GPU Benchmarks for Local LLMs

Real-world Ollama tokens/sec on popular models. Filter by VRAM, model size, and power budget.

Data: 4-bit quantized inference, ~4096 context, averaged from community Ollama/vLLM runs.

GPU / System	VRAM	TDP	PassMark	llama-3.1-70b t/s	llama-3.1-8b t/s	qwen2.5-7b t/s
NVIDIA GeForce RTX 5090 Flagship Blackwell gaming/AI card. Estimated data.	32 GB	575 W	48,000	22.0	155.0	165.0	—
NVIDIA GeForce RTX 4090 Featured Top pick for local LLMs today.	24 GB	450 W	39,535	18.0	125.0	135.0	—
NVIDIA GeForce RTX 4080	16 GB	320 W	34,675	15.0	105.0	115.0	—
NVIDIA GeForce RTX 4070 Ti	12 GB	285 W	31,704	11.0	85.0	92.0	—
NVIDIA GeForce RTX 3090 Best VRAM-per-dollar for large models on used market.	24 GB	350 W	26,655	14.0	95.0	105.0	—
NVIDIA GeForce RTX 3090 Ti	24 GB	450 W	29,280	15.0	98.0	108.0	—
Apple Mac Studio M4 Max Massive unified memory for 70B+ models, lower raw TPS.	128 GB	130 W	17,500	9.0	65.0	70.0	—
Apple Mac Studio M3 Max	128 GB	110 W	15,300	7.0	55.0	60.0	—
Apple Mac Studio M2 Ultra	192 GB	180 W	14,450	8.0	50.0	—	—
Geekom A8 (Ryzen 9 8945HS, Radeon 780M) Mini PC. Good for 3B-7B models, very low power.	8 GB	65 W	3,500	—	12.0	13.0	—
GMKtec NucBox K8+ (Ryzen 7 8845HS, Radeon 780M) Budget mini PC for local LLM experiments.	8 GB	65 W	3,300	—	11.0	12.0	—

RTX 4070 Ti or RTX 3090. Both run Llama 3.1 8B at 85–95 tok/s. 3090 wins on VRAM; 4070 Ti wins on power.

Apple Mac Studio M4 Max (128 GB unified) or RTX 4090 (24 GB). For 70B 4-bit you need ~40+ GB; Apple wins on memory capacity.

Geekom A8 or GMKtec K8+. ~12 tok/s on 7B models, 65W, under $600. Perfect for experiments.