ProgrammerShop·AI Workload Calculator

AI Workload Calculator

Estimate GPU requirements, concurrent sessions, and latency for self-hosted LLM deployments — no API calls, all offline math.

🤖 Model

Select model

Family: Llama 3Params: 8BContext: 8,192 tokens

Great balance of speed/quality for most tasks

⚖️ Quantization

fp16 (full precision)16 GB

int8 (8-bit)9 GB

int4 (4-bit / GGUF)5 GB

🖥️ GPU

NVIDIA H100 SXM 80GB80 GB

NVIDIA A100 80GB80 GB

NVIDIA A100 40GB40 GB

NVIDIA A10G 24GB24 GB

NVIDIA RTX 4090 24GB24 GB

NVIDIA RTX 3090 24GB24 GB

📊 Workload

Concurrent users (target)10

Avg input tokens256

Avg output tokens512

Replicas (horizontal scale)1

🚀 Serving Framework

Best for high-throughput cloud deployments. PagedAttention maximizes concurrent sessions.

✅ Feasible — 1 × A100 40GB per replica

Model weights need 9 GB + KV cache 0.64 GB = 9.6 GB total VRAM

🖥️

GPUs per replica

40 GB each

⚡

156

Throughput

tokens/sec total

👥

Max concurrent sessions

at ≤5s latency

⏱️

3.3s

P50 latency

512 output tokens

🎯

~100ms

TTFT estimate

time to first token

💾

9.6 GB

VRAM per replica

model + KV cache

Scaling options

What you get at different replica counts

Replicas	Total GPUs	Total TPS	Max Concurrent	Est. $/month
1 (current)	1	156	1	$1,400
2	2	312	3	$2,800
4	4	624	6	$5,600
8	8	1,248	12	$11,200
16	16	2,496	24	$22,400

📋 Suggested deployment (vLLM)

# Pull & run with vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \

--model meta-llama-3-8b \

--dtype auto \

--max-num-seqs 1 \

--port 8000

⚠️ Estimates & assumptions

• Tokens/sec numbers are empirical estimates based on published benchmarks — actual results vary ±30% by batch size, sequence length, hardware config, and driver version.

• KV cache VRAM scales with sequence length. Long-context requests will consume significantly more VRAM than shown.

• "Max concurrent sessions" assumes 5-second latency budget. Lower budgets reduce concurrency linearly.

• Quantized models (int4/int8) may show slightly reduced output quality. Always benchmark on your target use case.