AI Model Context Window Cheat Sheet
Compare how much text each model can ingest and return. Pick the right model for documents, codebases, transcripts, and agent memory.
What fits your content?
Models whose total window covers your content plus a reasonable output buffer (~25%):
OpenAI GPT
GPT-4o
Multimodal flagships; fast tokenizer.
GPT-4o-mini
Cheaper, same context window as 4o.
o3 / o4-mini
Reasoning models; large context for code/reasoning.
GPT-4.1 / GPT-4.1-mini
1M-token context for long-document analysis.
o1 / o1-mini
Legacy reasoning series; smaller output cap.
Anthropic Claude
Claude 4 Opus
Highest reasoning quality; 200K context.
Claude 4 Sonnet
Balanced speed and capability.
Claude 4 Haiku
Fast and cheap; same context as larger siblings.
Claude 3.5 Sonnet
Widely deployed; strong coding agent use.
Claude 3 Opus / Sonnet / Haiku
Legacy Claude 3 family.
Google Gemini
Gemini 2.5 Pro
1M context; strong long-context retrieval.
Gemini 2.5 Flash
Fast multimodal with 1M context.
Gemini 2.0 Pro / Flash
Up to 2M tokens for video/audio docs.
Gemini 1.5 Pro / Flash
First mainstream 1M-context model.
Meta Llama
Llama 4 Scout
Multimodal MoE; long context for local.
Llama 4 Maverick
Strong coding and reasoning.
Llama 3.3 70B
Efficient instruction model.
Llama 3.1 8B / 70B / 405B
128K standard across all sizes.
Llama 3.2 1B / 3B
Tiny edge models with long context.
Alibaba Qwen
Qwen3 235B / 32B / 30B / 14B / 8B / 4B
128K context across dense and MoE variants.
Qwen2.5 72B / 32B / 14B / 7B
Strong open models; 128K context.
Qwen2.5-VL
Vision-language with long context.
Mistral AI
Mistral Large 2
Commercial flagship; 128K window.
Mistral Small 3.1 / 24B
Open-weight small models with 128K.
Pixtral Large / 12B
Vision model with 128K context.
Codestral
Code-specialized; 32K context.
DeepSeek
DeepSeek-V3
MoE generalist; 64K context.
DeepSeek-R1
Open reasoning model; 64K context.
DeepSeek-V2
Earlier version with 128K claim (often 64K practical).
xAI Grok
Grok 3 / Grok 3 Mini
128K context for chat and coding.
Grok 2
Prior generation with same 128K window.
Local / Quantized
Llama.cpp / Ollama
Context limited by VRAM, not model; reduce KV cache with quantization.
vLLM
Serves full-context models if GPU memory allows.
KTransformers / llamafile
Offload-aware; effective context depends on RAM/VRAM.
MLX (Apple Silicon)
Unified memory; large context possible on Macs with 32-128 GB RAM.
How to use this cheat sheet
- Context = total tokens the model can process (system + input + history + tools).
- Output = maximum tokens the model is allowed to generate in one response.
- Local/quantized models are usually VRAM-limited; use the VRAM calculator to size the KV cache.
- For agent memory that spans many turns, prefer models with high context and low API cost. Use the LLM cost calculator to compare.