ProgrammerShopยทAI Workload Calculator

AI Workload Calculator

Estimate GPU requirements, concurrent sessions, and latency for self-hosted LLM deployments โ€” no API calls, all offline math.

๐Ÿค– Model

Family: Llama 3Params: 8BContext: 8,192 tokens

Great balance of speed/quality for most tasks

โš–๏ธ Quantization

๐Ÿ–ฅ๏ธ GPU

๐Ÿ“Š Workload

10
256
512
1

๐Ÿš€ Serving Framework

Best for high-throughput cloud deployments. PagedAttention maximizes concurrent sessions.

โœ… Feasible โ€” 1 ร— A100 40GB per replica

Model weights need 9 GB + KV cache 0.64 GB = 9.6 GB total VRAM

๐Ÿ–ฅ๏ธ
1
GPUs per replica
40 GB each
โšก
156
Throughput
tokens/sec total
๐Ÿ‘ฅ
1
Max concurrent sessions
at โ‰ค5s latency
โฑ๏ธ
3.3s
P50 latency
512 output tokens
๐ŸŽฏ
~100ms
TTFT estimate
time to first token
๐Ÿ’พ
9.6 GB
VRAM per replica
model + KV cache

Scaling options

What you get at different replica counts

ReplicasTotal GPUsTotal TPSMax ConcurrentEst. $/month
1 (current)11561$1,400
2 23123$2,800
4 46246$5,600
8 81,24812$11,200
16 162,49624$22,400

๐Ÿ“‹ Suggested deployment (vLLM)

# Pull & run with vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama-3-8b \
--dtype auto \
--max-num-seqs 1 \
--port 8000

โš ๏ธ Estimates & assumptions

โ€ข Tokens/sec numbers are empirical estimates based on published benchmarks โ€” actual results vary ยฑ30% by batch size, sequence length, hardware config, and driver version.

โ€ข KV cache VRAM scales with sequence length. Long-context requests will consume significantly more VRAM than shown.

โ€ข "Max concurrent sessions" assumes 5-second latency budget. Lower budgets reduce concurrency linearly.

โ€ข Quantized models (int4/int8) may show slightly reduced output quality. Always benchmark on your target use case.