Hardware & Infrastructure
REFERENCE
NVIDIA RTX 5090 Inference Benchmark
Local inference performance data for the NVIDIA RTX 5090. Benchmarks across model sizes, quantization levels, and batch configurations from BNL's production hardware.
Type
Reference
32 GB VRAM
45 tok/s (7B)
Blackwell Architecture
575W TDP
RTX 5090 for Local LLM Inference
The NVIDIA RTX 5090 represents the current peak of consumer-grade GPU capability for local LLM inference. Blue Note Logic uses the RTX 5090 as the primary inference GPU in our Gilligan.TECH development environment.
Hardware Specifications
| Specification | RTX 5090 |
|---|---|
| GPU Architecture | Blackwell |
| CUDA Cores | 21,760 |
| VRAM | 32 GB GDDR7 |
| Memory Bandwidth | 1,792 GB/s |
| TDP | 575W |
| FP16 Performance | ~209 TFLOPS |
Inference Benchmarks (7B Models)
| Model / Quant | Tokens/sec | VRAM Used | Time to First Token |
|---|---|---|---|
| Qwen 2.5 7B Q4_K_M | 52 tok/s | 5.9 GB | 0.8s |
| dobetter-norge-v2 Q5_K_M | 45 tok/s | 6.6 GB | 0.9s |
| Qwen 2.5 7B Q8_0 | 33 tok/s | 9.2 GB | 1.1s |
| Llama 3.1 8B Q5_K_M | 42 tok/s | 7.1 GB | 1.0s |
Inference Benchmarks (13B–70B Models)
| Model / Quant | Tokens/sec | VRAM Used | Notes |
|---|---|---|---|
| Qwen 2.5 14B Q4_K_M | 28 tok/s | 9.8 GB | Fits comfortably |
| Llama 3.1 70B Q2_K | 8 tok/s | 28.1 GB | Near VRAM limit |
| Llama 3.1 70B Q4_K_M | — | >32 GB | Does not fit |
Recommendations
- 7B models: Any quantization fits easily. Q5_K_M is the sweet spot.
- 13B–14B models: Q4_K_M or Q5_K_M fit well, good performance.
- 70B models: Only Q2_K/Q3_K fits in 32GB VRAM. Consider cloud GPUs or multi-GPU for higher quantizations.
BNL Perspective
The RTX 5090 is genuinely impressive for local inference. We run dobetter-norge-v2 at 45 tokens per second with Q5_K_M — that's fast enough for real-time interactive use. The 32GB VRAM is the real differentiator versus the 4090's 24GB: it opens up 13B-14B models at reasonable quantization levels. For our workflow — fine-tune on cloud A100s, deploy locally on the 5090 for development and testing — it's the ideal hardware.