If you’re planning to run open-weight LLMs locally or in production, one of the first questions is:

How much GPU VRAM do I actually need?

The answer depends on three major components:

Model weights
KV cache (context memory)
Runtime overhead

Let’s break each one down clearly and practically.

1️⃣ Model Weights: The Base Memory Cost

The largest fixed memory cost comes from the model weights.

Simple Formula

Weights (GB) ≈ Parameters (in billions) × (bits per weight / 8)

Because:

8 bits = 1 byte
1 billion parameters ≈ 1e9 values

Typical Memory Per Parameter

Precision	Bytes per Parameter	7B Model	70B Model
FP32	4 bytes	28 GB	280 GB
FP16/BF16	2 bytes	14 GB	140 GB
INT8	1 byte	7 GB	70 GB
4-bit	~0.5 byte	~3.5–5 GB	~35–50 GB

Note: Quantized models use extra scaling factors, so real usage is slightly higher than the theoretical number.

Example Calculation

For a 13B model in FP16:

13 × (16 / 8) = 26 GB

So you need ~26 GB just for weights.

2️⃣ KV Cache: The Hidden Memory Multiplier

The second major memory consumer is the KV cache (Key-Value cache).

This stores attention history so the model doesn’t recompute previous tokens.

KV Cache Scales With:

Context length (number of tokens)
Batch size (concurrent requests)
Number of layers
Hidden size
KV precision

Rough Practical Estimates

For many modern models:

Model Size	KV per 1k Tokens
7B	~0.2–0.6 GB
13B	~0.4–1.0 GB
70B	Several GB

So:

8k context can consume several GB
32k context can consume tens of GB
64k+ context becomes extremely expensive

3️⃣ Runtime Overhead

Even after weights + KV cache, you need extra headroom for:

CUDA workspace buffers
FlashAttention
Temporary activations
Memory fragmentation
Sampling buffers

Safe Rule:

Add 10–30% extra VRAM for stability.

📌 Total VRAM Formula

Total VRAM ≈ Weights + KV Cache + 20% Overhead

Practical VRAM Requirements by Model Class

Model	4-bit	FP16
7B	6–10 GB	16–24 GB
13B	12–20 GB	32–48 GB
70B	40–60 GB	160+ GB

These assume moderate context (8k–16k).

Long context multiplies requirements quickly.

🚀 Example: Qwen3.5-397B-A17B

Now let’s apply this to a real model.

The model described:

397 billion total parameters
17B activated per token (Mixture-of-Experts)
262k native context (extendable past 1M)
60 layers
Apache 2.0 license
Verified on 8× H200 GPUs

⚠️ Important: “17B Active” Does NOT Reduce Weight Memory

Even though only 17B parameters activate per token,
you still must load all 397B parameters into memory.

MoE reduces compute cost — not weight storage cost.

1️⃣ Weight Memory

BF16 / FP16

397 × (16 / 8) ≈ 794 GB

So roughly:

~800 GB VRAM for weights

FP8 / INT8

397 × (8 / 8) ≈ 397 GB

~400 GB VRAM

4-bit Quantization

397 × (4 / 8) ≈ 198.5 GB

Realistically:

~220–260 GB VRAM including scaling overhead

2️⃣ KV Cache at 64k Context

This model uses:

60 layers
2 KV heads
Head dimension 256

Approximate KV usage:

~120 KB per token
At 64k tokens:

≈ 7.5 GB per request

So:

10 concurrent 64k sessions ≈ 75 GB
30 concurrent sessions ≈ 225 GB

KV memory scales linearly with concurrency.

3️⃣ Realistic Deployment: 8× H200 141GB

Total VRAM:

8 × 141 GB = 1128 GB

In BF16:

~800 GB weights
~100–200 GB KV cache
~100 GB overhead

This fits safely.

Speed=(tokensprompt+tokensgeneration)/ time

Qwen’s official “Speed Benchmark

📊 Expected Throughput on 8× H200

For 64k max context:

Workload Type	Aggregate Output Throughput
Online serving	~200–800 tokens/sec
High-batch offline	~600–1500 tokens/sec

Per single request:

Usually single-digit to few dozen tokens/sec
Depends on batching efficiency

Throughput depends heavily on:

Continuous batching
KV cache pressure
Network latency
All-reduce efficiency
Thinking mode usage

🧠 Key Takeaways

Parameter count determines weight memory.
Context length determines KV memory.
Concurrency multiplies KV usage.
MoE reduces compute cost, not memory cost.
Large 400B-class models require multi-GPU clusters.
Long context (64k+) dramatically increases memory needs.

🎯 Final Rule of Thumb

If you’re sizing hardware:

VRAM ≈ (Params × bits/8) + (Context × KV per token × concurrency) + 20%

For Qwen3.5-397B-A17B:

BF16 → ~800 GB baseline
Practical deployment → 8× H200 class system
64k context → ~7.5 GB per active request

1️⃣ Model Weights: The Base Memory Cost#

Simple Formula#

Typical Memory Per Parameter#

Example Calculation#

2️⃣ KV Cache: The Hidden Memory Multiplier#

KV Cache Scales With:#

Rough Practical Estimates#

3️⃣ Runtime Overhead#

Safe Rule:#

📌 Total VRAM Formula#

Practical VRAM Requirements by Model Class#

🚀 Example: Qwen3.5-397B-A17B#

⚠️ Important: “17B Active” Does NOT Reduce Weight Memory#

1️⃣ Weight Memory#

BF16 / FP16#

FP8 / INT8#

4-bit Quantization#

2️⃣ KV Cache at 64k Context#

3️⃣ Realistic Deployment: 8× H200 141GB#

In BF16:#

📊 Expected Throughput on 8× H200#

🧠 Key Takeaways#

🎯 Final Rule of Thumb#

Related Articles