LLM VRAM Calculator โ€” How Much GPU Memory Do You Need?

Estimate GPU memory for local LLMs by model size and quantization. Add 20% overhead for context cache.

Estimated VRAM

5.4 GB

Formula: 8B ร— 4 bit รท 8 ร— 1.2 = 4.8 GB + cache

Recommended hardware for this config:

RTX 4060 Ti 16GB, RX 7600 XT 16GB, or Apple M4 16GB.

Compare GPUs โ†’

Common configs at a glance

Model Q4 (fast) Q8 (quality) FP16 (best)
Llama 3.1 8B 5 GB 10 GB 20 GB
Llama 3.1 70B 42 GB 84 GB 168 GB
Llama 3.2 1B 1 GB 2 GB 3 GB
Llama 3.2 3B 2 GB 4 GB 8 GB
Qwen 2.5 7B 5 GB 9 GB 17 GB
Qwen 2.5 72B 44 GB 87 GB 173 GB

How this works

The calculator uses the standard rule of thumb: VRAM โ‰ˆ parameters ร— bits รท 8 ร— 1.2. The 1.2 multiplier adds headroom for the key-value cache, model overhead, and a typical context window. For exact numbers, use your inference engine's loader (llama.cpp, Ollama, vLLM) with your actual prompt length.

Frequently asked questions

How much VRAM for Llama 3.1 70B?

About 40 GB at Q4, 70 GB at Q8, and 140 GB at FP16. A single RTX 3090/4090 (24 GB) cannot fit the full model at Q4; you need two 3090s, a 48 GB card, or Apple Silicon with 64-128 GB unified memory.

Can I run a 70B model on 24 GB VRAM?

No for the full model. You would need aggressive offloading to system RAM or CPU, which kills tokens/sec. Use a smaller model (8B-13B) or a GPU with 40 GB+ VRAM for usable 70B performance.

๐Ÿš€ Get AI automation insights daily

15:00 MST. One-click unsubscribe.

Subscribe