If you’re planning to run open-weight LLMs locally or in production, one of the first questions is:
How much GPU VRAM do I actually need?
The answer depends on three major components:
- Model weights
- KV cache (context memory)
- Runtime overhead
Let’s break each one down clearly and practically.
1️⃣ Model Weights: The Base Memory Cost
The largest fixed memory cost comes from the model weights.
Simple Formula
Weights (GB) ≈ Parameters (in billions) × (bits per weight / 8)
Because:
- 8 bits = 1 byte
- 1 billion parameters ≈ 1e9 values
Typical Memory Per Parameter
| Precision | Bytes per Parameter | 7B Model | 70B Model |
|---|---|---|---|
| FP32 | 4 bytes | 28 GB | 280 GB |
| FP16/BF16 | 2 bytes | 14 GB | 140 GB |
| INT8 | 1 byte | 7 GB | 70 GB |
| 4-bit | ~0.5 byte | ~3.5–5 GB | ~35–50 GB |
Note: Quantized models use extra scaling factors, so real usage is slightly higher than the theoretical number.
Example Calculation
For a 13B model in FP16:
13 × (16 / 8) = 26 GB
So you need ~26 GB just for weights.
2️⃣ KV Cache: The Hidden Memory Multiplier
The second major memory consumer is the KV cache (Key-Value cache).
This stores attention history so the model doesn’t recompute previous tokens.
KV Cache Scales With:
- Context length (number of tokens)
- Batch size (concurrent requests)
- Number of layers
- Hidden size
- KV precision
Rough Practical Estimates
For many modern models:
| Model Size | KV per 1k Tokens |
|---|---|
| 7B | ~0.2–0.6 GB |
| 13B | ~0.4–1.0 GB |
| 70B | Several GB |
So:
- 8k context can consume several GB
- 32k context can consume tens of GB
- 64k+ context becomes extremely expensive
3️⃣ Runtime Overhead
Even after weights + KV cache, you need extra headroom for:
- CUDA workspace buffers
- FlashAttention
- Temporary activations
- Memory fragmentation
- Sampling buffers
Safe Rule:
Add 10–30% extra VRAM for stability.
📌 Total VRAM Formula
Total VRAM ≈ Weights + KV Cache + 20% Overhead
Practical VRAM Requirements by Model Class
| Model | 4-bit | FP16 |
|---|---|---|
| 7B | 6–10 GB | 16–24 GB |
| 13B | 12–20 GB | 32–48 GB |
| 70B | 40–60 GB | 160+ GB |
These assume moderate context (8k–16k).
Long context multiplies requirements quickly.
🚀 Example: Qwen3.5-397B-A17B
Now let’s apply this to a real model.
The model described:
- 397 billion total parameters
- 17B activated per token (Mixture-of-Experts)
- 262k native context (extendable past 1M)
- 60 layers
- Apache 2.0 license
- Verified on 8× H200 GPUs
⚠️ Important: “17B Active” Does NOT Reduce Weight Memory
Even though only 17B parameters activate per token,
you still must load all 397B parameters into memory.
MoE reduces compute cost — not weight storage cost.
1️⃣ Weight Memory
BF16 / FP16
397 × (16 / 8) ≈ 794 GB
So roughly:
~800 GB VRAM for weights
FP8 / INT8
397 × (8 / 8) ≈ 397 GB
~400 GB VRAM
4-bit Quantization
397 × (4 / 8) ≈ 198.5 GB
Realistically:
~220–260 GB VRAM including scaling overhead
2️⃣ KV Cache at 64k Context
This model uses:
- 60 layers
- 2 KV heads
- Head dimension 256
Approximate KV usage:
- ~120 KB per token
- At 64k tokens:
≈ 7.5 GB per request
So:
- 10 concurrent 64k sessions ≈ 75 GB
- 30 concurrent sessions ≈ 225 GB
KV memory scales linearly with concurrency.
3️⃣ Realistic Deployment: 8× H200 141GB
Total VRAM:
8 × 141 GB = 1128 GB
In BF16:
- ~800 GB weights
- ~100–200 GB KV cache
- ~100 GB overhead
This fits safely.
Speed=(tokensprompt+tokensgeneration)/ time
Qwen’s official “Speed Benchmark
📊 Expected Throughput on 8× H200
For 64k max context:
| Workload Type | Aggregate Output Throughput |
|---|---|
| Online serving | ~200–800 tokens/sec |
| High-batch offline | ~600–1500 tokens/sec |
Per single request:
- Usually single-digit to few dozen tokens/sec
- Depends on batching efficiency
Throughput depends heavily on:
- Continuous batching
- KV cache pressure
- Network latency
- All-reduce efficiency
- Thinking mode usage
🧠 Key Takeaways
- Parameter count determines weight memory.
- Context length determines KV memory.
- Concurrency multiplies KV usage.
- MoE reduces compute cost, not memory cost.
- Large 400B-class models require multi-GPU clusters.
- Long context (64k+) dramatically increases memory needs.
🎯 Final Rule of Thumb
If you’re sizing hardware:
VRAM ≈ (Params × bits/8) + (Context × KV per token × concurrency) + 20%
For Qwen3.5-397B-A17B:
- BF16 → ~800 GB baseline
- Practical deployment → 8× H200 class system
- 64k context → ~7.5 GB per active request