📜

AI Model Context Window Cheat Sheet

Compare how much text each model can ingest and return. Pick the right model for documents, codebases, transcripts, and agent memory.

What fits your content?

🌀

OpenAI GPT

GPT-4o

Context 128,000 Output 16,384

Multimodal flagships; fast tokenizer.

GPT-4o-mini

Context 128,000 Output 16,384

Cheaper, same context window as 4o.

o3 / o4-mini

Context 200,000 Output 100,000

Reasoning models; large context for code/reasoning.

GPT-4.1 / GPT-4.1-mini

Context 1,000,000 Output 32,768

1M-token context for long-document analysis.

o1 / o1-mini

Context 128,000 Output 32,768

Legacy reasoning series; smaller output cap.

🟣

Anthropic Claude

Claude 4 Opus

Context 200,000 Output 64,000

Highest reasoning quality; 200K context.

Claude 4 Sonnet

Context 200,000 Output 64,000

Balanced speed and capability.

Claude 4 Haiku

Context 200,000 Output 8,192

Fast and cheap; same context as larger siblings.

Claude 3.5 Sonnet

Context 200,000 Output 8,192

Widely deployed; strong coding agent use.

Claude 3 Opus / Sonnet / Haiku

Context 200,000 Output 4,096

Legacy Claude 3 family.

🔵

Google Gemini

Gemini 2.5 Pro

Context 1,000,000 Output 64,000

1M context; strong long-context retrieval.

Gemini 2.5 Flash

Context 1,000,000 Output 64,000

Fast multimodal with 1M context.

Gemini 2.0 Pro / Flash

Context 2,000,000 Output 8,192

Up to 2M tokens for video/audio docs.

Gemini 1.5 Pro / Flash

Context 1,000,000 Output 8,192

First mainstream 1M-context model.

🦙

Meta Llama

Llama 4 Scout

Context 128,000 Output 128,000

Multimodal MoE; long context for local.

Llama 4 Maverick

Context 128,000 Output 128,000

Strong coding and reasoning.

Llama 3.3 70B

Context 128,000 Output 128,000

Efficient instruction model.

Llama 3.1 8B / 70B / 405B

Context 128,000 Output 128,000

128K standard across all sizes.

Llama 3.2 1B / 3B

Context 128,000 Output 128,000

Tiny edge models with long context.

🟠

Alibaba Qwen

Qwen3 235B / 32B / 30B / 14B / 8B / 4B

Context 128,000 Output 8,192

128K context across dense and MoE variants.

Qwen2.5 72B / 32B / 14B / 7B

Context 128,000 Output 8,192

Strong open models; 128K context.

Qwen2.5-VL

Context 128,000 Output 8,192

Vision-language with long context.

🌬️

Mistral AI

Mistral Large 2

Context 128,000 Output 128,000

Commercial flagship; 128K window.

Mistral Small 3.1 / 24B

Context 128,000 Output 128,000

Open-weight small models with 128K.

Pixtral Large / 12B

Context 128,000 Output 128,000

Vision model with 128K context.

Codestral

Context 32,000 Output 32,000

Code-specialized; 32K context.

🐋

DeepSeek

DeepSeek-V3

Context 64,000 Output 8,192

MoE generalist; 64K context.

DeepSeek-R1

Context 64,000 Output 8,192

Open reasoning model; 64K context.

DeepSeek-V2

Context 128,000 Output 8,192

Earlier version with 128K claim (often 64K practical).

xAI Grok

Grok 3 / Grok 3 Mini

Context 131,072 Output 131,072

128K context for chat and coding.

Grok 2

Context 131,072 Output 131,072

Prior generation with same 128K window.

🏠

Local / Quantized

Llama.cpp / Ollama

Context 128,000 Output 128,000

Context limited by VRAM, not model; reduce KV cache with quantization.

vLLM

Context 128,000 Output 128,000

Serves full-context models if GPU memory allows.

KTransformers / llamafile

Context 128,000 Output 128,000

Offload-aware; effective context depends on RAM/VRAM.

MLX (Apple Silicon)

Context 128,000 Output 128,000

Unified memory; large context possible on Macs with 32-128 GB RAM.

How to use this cheat sheet

  • Context = total tokens the model can process (system + input + history + tools).
  • Output = maximum tokens the model is allowed to generate in one response.
  • Local/quantized models are usually VRAM-limited; use the VRAM calculator to size the KV cache.
  • For agent memory that spans many turns, prefer models with high context and low API cost. Use the LLM cost calculator to compare.

🚀 Get AI automation insights daily

15:00 MST. One-click unsubscribe.

Subscribe