If you’re planning to run open-weight LLMs locally or in production, one of the first questions is:

How much GPU VRAM do I actually need?

The answer depends on three major components:

  1. Model weights
  2. KV cache (context memory)
  3. Runtime overhead

Let’s break each one down clearly and practically.


1️⃣ Model Weights: The Base Memory Cost

The largest fixed memory cost comes from the model weights.

Simple Formula

Weights (GB) ≈ Parameters (in billions) × (bits per weight / 8)

Because:

  • 8 bits = 1 byte
  • 1 billion parameters ≈ 1e9 values

Typical Memory Per Parameter

PrecisionBytes per Parameter7B Model70B Model
FP324 bytes28 GB280 GB
FP16/BF162 bytes14 GB140 GB
INT81 byte7 GB70 GB
4-bit~0.5 byte~3.5–5 GB~35–50 GB

Note: Quantized models use extra scaling factors, so real usage is slightly higher than the theoretical number.


Example Calculation

For a 13B model in FP16:

13 × (16 / 8) = 26 GB

So you need ~26 GB just for weights.


2️⃣ KV Cache: The Hidden Memory Multiplier

The second major memory consumer is the KV cache (Key-Value cache).

This stores attention history so the model doesn’t recompute previous tokens.

KV Cache Scales With:

  • Context length (number of tokens)
  • Batch size (concurrent requests)
  • Number of layers
  • Hidden size
  • KV precision

Rough Practical Estimates

For many modern models:

Model SizeKV per 1k Tokens
7B~0.2–0.6 GB
13B~0.4–1.0 GB
70BSeveral GB

So:

  • 8k context can consume several GB
  • 32k context can consume tens of GB
  • 64k+ context becomes extremely expensive

3️⃣ Runtime Overhead

Even after weights + KV cache, you need extra headroom for:

  • CUDA workspace buffers
  • FlashAttention
  • Temporary activations
  • Memory fragmentation
  • Sampling buffers

Safe Rule:

Add 10–30% extra VRAM for stability.


📌 Total VRAM Formula

Total VRAM ≈ Weights + KV Cache + 20% Overhead


Practical VRAM Requirements by Model Class

Model4-bitFP16
7B6–10 GB16–24 GB
13B12–20 GB32–48 GB
70B40–60 GB160+ GB

These assume moderate context (8k–16k).

Long context multiplies requirements quickly.



🚀 Example: Qwen3.5-397B-A17B

Now let’s apply this to a real model.

The model described:

  • 397 billion total parameters
  • 17B activated per token (Mixture-of-Experts)
  • 262k native context (extendable past 1M)
  • 60 layers
  • Apache 2.0 license
  • Verified on 8× H200 GPUs

⚠️ Important: “17B Active” Does NOT Reduce Weight Memory

Even though only 17B parameters activate per token,
you still must load all 397B parameters into memory.

MoE reduces compute cost — not weight storage cost.


1️⃣ Weight Memory

BF16 / FP16

397 × (16 / 8) ≈ 794 GB

So roughly:

~800 GB VRAM for weights


FP8 / INT8

397 × (8 / 8) ≈ 397 GB

~400 GB VRAM


4-bit Quantization

397 × (4 / 8) ≈ 198.5 GB

Realistically:

~220–260 GB VRAM including scaling overhead


2️⃣ KV Cache at 64k Context

This model uses:

  • 60 layers
  • 2 KV heads
  • Head dimension 256

Approximate KV usage:

  • ~120 KB per token
  • At 64k tokens:

≈ 7.5 GB per request

So:

  • 10 concurrent 64k sessions ≈ 75 GB
  • 30 concurrent sessions ≈ 225 GB

KV memory scales linearly with concurrency.


3️⃣ Realistic Deployment: 8× H200 141GB

Total VRAM:

8 × 141 GB = 1128 GB

In BF16:

  • ~800 GB weights
  • ~100–200 GB KV cache
  • ~100 GB overhead

This fits safely.


Speed=(tokensprompt​+tokensgeneration​​)/ time

Qwen’s official “Speed Benchmark

📊 Expected Throughput on 8× H200

For 64k max context:

Workload TypeAggregate Output Throughput
Online serving~200–800 tokens/sec
High-batch offline~600–1500 tokens/sec

Per single request:

  • Usually single-digit to few dozen tokens/sec
  • Depends on batching efficiency

Throughput depends heavily on:

  • Continuous batching
  • KV cache pressure
  • Network latency
  • All-reduce efficiency
  • Thinking mode usage

🧠 Key Takeaways

  1. Parameter count determines weight memory.
  2. Context length determines KV memory.
  3. Concurrency multiplies KV usage.
  4. MoE reduces compute cost, not memory cost.
  5. Large 400B-class models require multi-GPU clusters.
  6. Long context (64k+) dramatically increases memory needs.

🎯 Final Rule of Thumb

If you’re sizing hardware:

VRAM ≈ (Params × bits/8) + (Context × KV per token × concurrency) + 20%

For Qwen3.5-397B-A17B:

  • BF16 → ~800 GB baseline
  • Practical deployment → 8× H200 class system
  • 64k context → ~7.5 GB per active request