GPU Benchmarks for Local LLMs

Real-world Ollama tokens/sec on popular models. Filter by VRAM, model size, and power budget.

Data: 4-bit quantized inference, ~4096 context, averaged from community Ollama/vLLM runs.

GPU / System VRAM TDP PassMark llama-3.1-70b t/sllama-3.1-8b t/sqwen2.5-7b t/s
NVIDIA GeForce RTX 5090
Flagship Blackwell gaming/AI card. Estimated data.
32 GB 575 W 48,000 22.0 155.0 165.0
NVIDIA GeForce RTX 4090 Featured
Top pick for local LLMs today.
24 GB 450 W 39,535 18.0 125.0 135.0
NVIDIA GeForce RTX 4080
16 GB 320 W 34,675 15.0 105.0 115.0
NVIDIA GeForce RTX 4070 Ti
12 GB 285 W 31,704 11.0 85.0 92.0
NVIDIA GeForce RTX 3090
Best VRAM-per-dollar for large models on used market.
24 GB 350 W 26,655 14.0 95.0 105.0
NVIDIA GeForce RTX 3090 Ti
24 GB 450 W 29,280 15.0 98.0 108.0
Apple Mac Studio M4 Max
Massive unified memory for 70B+ models, lower raw TPS.
128 GB 130 W 17,500 9.0 65.0 70.0
Apple Mac Studio M3 Max
128 GB 110 W 15,300 7.0 55.0 60.0
Apple Mac Studio M2 Ultra
192 GB 180 W 14,450 8.0 50.0
Geekom A8 (Ryzen 9 8945HS, Radeon 780M)
Mini PC. Good for 3B-7B models, very low power.
8 GB 65 W 3,500 12.0 13.0
GMKtec NucBox K8+ (Ryzen 7 8845HS, Radeon 780M)
Budget mini PC for local LLM experiments.
8 GB 65 W 3,300 11.0 12.0

Best for 8B models

RTX 4070 Ti or RTX 3090. Both run Llama 3.1 8B at 85–95 tok/s. 3090 wins on VRAM; 4070 Ti wins on power.

Best for 70B models

Apple Mac Studio M4 Max (128 GB unified) or RTX 4090 (24 GB). For 70B 4-bit you need ~40+ GB; Apple wins on memory capacity.

Budget starter

Geekom A8 or GMKtec K8+. ~12 tok/s on 7B models, 65W, under $600. Perfect for experiments.

🚀 Get AI automation insights daily

15:00 MST. One-click unsubscribe.

Subscribe