Which AI model has the largest context window?

Google Gemini 2.0 Pro/Flash advertises up to 2 million tokens, while Gemini 2.5 Pro and several OpenAI GPT-4.1 variants support 1 million tokens. Anthropic Claude, OpenAI o3, and many open models top out around 128K-200K tokens.

What counts against the context window?

The context window includes your system prompt, user messages, uploaded documents, tool results, and the model's own output. Output tokens reduce the remaining room for input.

How much context do I need for a codebase or transcript?

A short blog post is ~1,500 tokens, a research paper ~8,000 tokens, a book chapter ~25,000 tokens, a large codebase ~50,000-100,000 tokens, and a multi-hour transcript can exceed 100,000 tokens.

Does quantization reduce the context window?

The model's published context limit does not change with quantization, but the practical context you can run locally is limited by available VRAM/RAM for the KV cache. Tools like llama.cpp allow context scaling via KV-cache quantization.

📜

AI Model Context Window Cheat Sheet

Compare how much text each model can ingest and return. Pick the right model for documents, codebases, transcripts, and agent memory.

What fits your content?

Content type

Custom tokens

🌀

OpenAI GPT

GPT-4o

Context 128,000 Output 16,384

Multimodal flagships; fast tokenizer.

GPT-4o-mini

Context 128,000 Output 16,384

Cheaper, same context window as 4o.

o3 / o4-mini

Context 200,000 Output 100,000

Reasoning models; large context for code/reasoning.

GPT-4.1 / GPT-4.1-mini

Context 1,000,000 Output 32,768

1M-token context for long-document analysis.

o1 / o1-mini

Context 128,000 Output 32,768

Legacy reasoning series; smaller output cap.

🟣

Anthropic Claude

Claude 4 Opus

Context 200,000 Output 64,000

Highest reasoning quality; 200K context.

Claude 4 Sonnet

Context 200,000 Output 64,000

Balanced speed and capability.

Claude 4 Haiku

Context 200,000 Output 8,192

Fast and cheap; same context as larger siblings.

Claude 3.5 Sonnet

Context 200,000 Output 8,192

Widely deployed; strong coding agent use.

Claude 3 Opus / Sonnet / Haiku

Context 200,000 Output 4,096

Legacy Claude 3 family.

🔵

Google Gemini

Gemini 2.5 Pro

Context 1,000,000 Output 64,000

1M context; strong long-context retrieval.

Gemini 2.5 Flash

Context 1,000,000 Output 64,000

Fast multimodal with 1M context.

Gemini 2.0 Pro / Flash

Context 2,000,000 Output 8,192

Up to 2M tokens for video/audio docs.

Gemini 1.5 Pro / Flash

Context 1,000,000 Output 8,192

First mainstream 1M-context model.

🦙

Meta Llama

Llama 4 Scout

Context 128,000 Output 128,000

Multimodal MoE; long context for local.

Llama 4 Maverick

Context 128,000 Output 128,000

Strong coding and reasoning.

Llama 3.3 70B

Context 128,000 Output 128,000

Efficient instruction model.

Llama 3.1 8B / 70B / 405B

Context 128,000 Output 128,000

128K standard across all sizes.

Llama 3.2 1B / 3B

Context 128,000 Output 128,000

Tiny edge models with long context.

🟠

Alibaba Qwen

Qwen3 235B / 32B / 30B / 14B / 8B / 4B

Context 128,000 Output 8,192

128K context across dense and MoE variants.

Qwen2.5 72B / 32B / 14B / 7B

Context 128,000 Output 8,192

Strong open models; 128K context.

Qwen2.5-VL

Context 128,000 Output 8,192

Vision-language with long context.

🌬️

Mistral AI

Mistral Large 2

Context 128,000 Output 128,000

Commercial flagship; 128K window.

Mistral Small 3.1 / 24B

Context 128,000 Output 128,000

Open-weight small models with 128K.

Pixtral Large / 12B

Context 128,000 Output 128,000

Vision model with 128K context.

Codestral

Context 32,000 Output 32,000

Code-specialized; 32K context.

🐋

DeepSeek

DeepSeek-V3

Context 64,000 Output 8,192

MoE generalist; 64K context.

DeepSeek-R1

Context 64,000 Output 8,192

Open reasoning model; 64K context.

DeepSeek-V2

Context 128,000 Output 8,192

Earlier version with 128K claim (often 64K practical).

⚫

xAI Grok

Grok 3 / Grok 3 Mini

Context 131,072 Output 131,072

128K context for chat and coding.

Grok 2

Context 131,072 Output 131,072

Prior generation with same 128K window.

🏠

Local / Quantized

Llama.cpp / Ollama

Context 128,000 Output 128,000

Context limited by VRAM, not model; reduce KV cache with quantization.

vLLM

Context 128,000 Output 128,000

Serves full-context models if GPU memory allows.

KTransformers / llamafile

Context 128,000 Output 128,000

Offload-aware; effective context depends on RAM/VRAM.

MLX (Apple Silicon)

Context 128,000 Output 128,000

Unified memory; large context possible on Macs with 32-128 GB RAM.

How to use this cheat sheet

Context = total tokens the model can process (system + input + history + tools).
Output = maximum tokens the model is allowed to generate in one response.
Local/quantized models are usually VRAM-limited; use the VRAM calculator to size the KV cache.
For agent memory that spans many turns, prefer models with high context and low API cost. Use the LLM cost calculator to compare.

AI Model Context Window Cheat Sheet

What fits your content?

OpenAI GPT

GPT-4o

GPT-4o-mini

o3 / o4-mini

GPT-4.1 / GPT-4.1-mini

o1 / o1-mini

Anthropic Claude

Claude 4 Opus

Claude 4 Sonnet

Claude 4 Haiku

Claude 3.5 Sonnet

Claude 3 Opus / Sonnet / Haiku

Google Gemini

Gemini 2.5 Pro

Gemini 2.5 Flash

Gemini 2.0 Pro / Flash

Gemini 1.5 Pro / Flash

Meta Llama

Llama 4 Scout

Llama 4 Maverick

Llama 3.3 70B

Llama 3.1 8B / 70B / 405B

Llama 3.2 1B / 3B

Alibaba Qwen

Qwen3 235B / 32B / 30B / 14B / 8B / 4B

Qwen2.5 72B / 32B / 14B / 7B

Qwen2.5-VL

Mistral AI

Mistral Large 2

Mistral Small 3.1 / 24B

Pixtral Large / 12B

Codestral

DeepSeek

DeepSeek-V3

DeepSeek-R1

DeepSeek-V2

xAI Grok

Grok 3 / Grok 3 Mini

Grok 2

Local / Quantized

Llama.cpp / Ollama

vLLM

KTransformers / llamafile

MLX (Apple Silicon)

How to use this cheat sheet

Wait — Don't Miss Tomorrow's Dispatch