LLM VRAM Calculator โ How Much GPU Memory Do You Need?
Estimate GPU memory for local LLMs by model size and quantization. Add 20% overhead for context cache.
Estimated VRAM
5.4 GB
Formula: 8B ร 4 bit รท 8 ร 1.2 = 4.8 GB + cache
Recommended hardware for this config:
RTX 4060 Ti 16GB, RX 7600 XT 16GB, or Apple M4 16GB.
Common configs at a glance
| Model | Q4 (fast) | Q8 (quality) | FP16 (best) |
|---|---|---|---|
| Llama 3.1 8B | 5 GB | 10 GB | 20 GB |
| Llama 3.1 70B | 42 GB | 84 GB | 168 GB |
| Llama 3.2 1B | 1 GB | 2 GB | 3 GB |
| Llama 3.2 3B | 2 GB | 4 GB | 8 GB |
| Qwen 2.5 7B | 5 GB | 9 GB | 17 GB |
| Qwen 2.5 72B | 44 GB | 87 GB | 173 GB |
How this works
The calculator uses the standard rule of thumb: VRAM โ parameters ร bits รท 8 ร 1.2. The 1.2 multiplier adds headroom for the key-value cache, model overhead, and a typical context window. For exact numbers, use your inference engine's loader (llama.cpp, Ollama, vLLM) with your actual prompt length.
Frequently asked questions
How much VRAM for Llama 3.1 70B?
About 40 GB at Q4, 70 GB at Q8, and 140 GB at FP16. A single RTX 3090/4090 (24 GB) cannot fit the full model at Q4; you need two 3090s, a 48 GB card, or Apple Silicon with 64-128 GB unified memory.
Can I run a 70B model on 24 GB VRAM?
No for the full model. You would need aggressive offloading to system RAM or CPU, which kills tokens/sec. Use a smaller model (8B-13B) or a GPU with 40 GB+ VRAM for usable 70B performance.