Estimate GPU requirements, concurrent sessions, and latency for self-hosted LLM deployments โ no API calls, all offline math.
Great balance of speed/quality for most tasks
Best for high-throughput cloud deployments. PagedAttention maximizes concurrent sessions.
Model weights need 9 GB + KV cache 0.64 GB = 9.6 GB total VRAM
What you get at different replica counts
| Replicas | Total GPUs | Total TPS | Max Concurrent | Est. $/month |
|---|---|---|---|---|
| 1 (current) | 1 | 156 | 1 | $1,400 |
| 2 | 2 | 312 | 3 | $2,800 |
| 4 | 4 | 624 | 6 | $5,600 |
| 8 | 8 | 1,248 | 12 | $11,200 |
| 16 | 16 | 2,496 | 24 | $22,400 |
โ ๏ธ Estimates & assumptions
โข Tokens/sec numbers are empirical estimates based on published benchmarks โ actual results vary ยฑ30% by batch size, sequence length, hardware config, and driver version.
โข KV cache VRAM scales with sequence length. Long-context requests will consume significantly more VRAM than shown.
โข "Max concurrent sessions" assumes 5-second latency budget. Lower budgets reduce concurrency linearly.
โข Quantized models (int4/int8) may show slightly reduced output quality. Always benchmark on your target use case.