GPU Benchmarks for Local LLMs
Real-world Ollama tokens/sec on popular models. Filter by VRAM, model size, and power budget.
Data: 4-bit quantized inference, ~4096 context, averaged from community Ollama/vLLM runs.
| GPU / System | VRAM | TDP | PassMark | llama-3.1-70b t/s | llama-3.1-8b t/s | qwen2.5-7b t/s | |
|---|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 5090 Flagship Blackwell gaming/AI card. Estimated data. | 32 GB | 575 W | 48,000 | 22.0 | 155.0 | 165.0 | — |
| NVIDIA GeForce RTX 4090 Featured Top pick for local LLMs today. | 24 GB | 450 W | 39,535 | 18.0 | 125.0 | 135.0 | — |
| NVIDIA GeForce RTX 4080 | 16 GB | 320 W | 34,675 | 15.0 | 105.0 | 115.0 | — |
| NVIDIA GeForce RTX 4070 Ti | 12 GB | 285 W | 31,704 | 11.0 | 85.0 | 92.0 | — |
| NVIDIA GeForce RTX 3090 Best VRAM-per-dollar for large models on used market. | 24 GB | 350 W | 26,655 | 14.0 | 95.0 | 105.0 | — |
| NVIDIA GeForce RTX 3090 Ti | 24 GB | 450 W | 29,280 | 15.0 | 98.0 | 108.0 | — |
| Apple Mac Studio M4 Max Massive unified memory for 70B+ models, lower raw TPS. | 128 GB | 130 W | 17,500 | 9.0 | 65.0 | 70.0 | — |
| Apple Mac Studio M3 Max | 128 GB | 110 W | 15,300 | 7.0 | 55.0 | 60.0 | — |
| Apple Mac Studio M2 Ultra | 192 GB | 180 W | 14,450 | 8.0 | 50.0 | — | — |
| Geekom A8 (Ryzen 9 8945HS, Radeon 780M) Mini PC. Good for 3B-7B models, very low power. | 8 GB | 65 W | 3,500 | — | 12.0 | 13.0 | — |
| GMKtec NucBox K8+ (Ryzen 7 8845HS, Radeon 780M) Budget mini PC for local LLM experiments. | 8 GB | 65 W | 3,300 | — | 11.0 | 12.0 | — |
Best for 8B models
RTX 4070 Ti or RTX 3090. Both run Llama 3.1 8B at 85–95 tok/s. 3090 wins on VRAM; 4070 Ti wins on power.
Best for 70B models
Apple Mac Studio M4 Max (128 GB unified) or RTX 4090 (24 GB). For 70B 4-bit you need ~40+ GB; Apple wins on memory capacity.
Budget starter
Geekom A8 or GMKtec K8+. ~12 tok/s on 7B models, 65W, under $600. Perfect for experiments.