Llm-Inference

TTFT in LLMs Explained: What Time to First Token Really Measures

When I evaluate an LLM system, one of the first latency metrics I look at is TTFT, or time to first token. This metric answers a simple question: After a user sends a request, how long does it take before the first output token appears? That sounds narrow, but it matters a lot. Users usually forgive a response that streams steadily after it starts. What feels bad is the dead time before anything appears on screen. ...

How Much GPU VRAM Do You Need to Run Large Language Models?

If you’re planning to run open-weight LLMs locally or in production, one of the first questions is: How much GPU VRAM do I actually need? The answer depends on three major components: Model weights KV cache (context memory) Runtime overhead Let’s break each one down clearly and practically. 1️⃣ Model Weights: The Base Memory Cost The largest fixed memory cost comes from the model weights. Simple Formula Weights (GB) ≈ Parameters (in billions) × (bits per weight / 8) ...

Understanding LLM Inference Basics: Prefill and Decode, TTFT, and ITL

Large language models (LLMs) like GPT-4, Llama, or Grok generate text by running inference — the phase where a trained model produces outputs from a given input prompt. While training is resource-intensive and done once, inference happens every time a user sends a query. Understanding the mechanics of inference is key to grasping why some models feel “fast” while others lag, and why certain optimizations matter. At a high level, modern LLM inference (for autoregressive transformer-based models) splits into two distinct phases: prefill and decode. These phases behave very differently in terms of computation and directly affect two critical user-facing metrics: Time to First Token (TTFT) and Inter-Token Latency (ITL). ...

Agentic Context Engineering (ACE): Turning Context Into a Self-Improving Playbook for LLMs

Large language models are getting smarter—but the real superpower may be how we feed them context. Instead of constantly fine-tuning weights, a growing family of techniques improves models by upgrading the inputs they see: richer instructions, reusable strategies, domain heuristics, and concrete evidence. The paper “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models” proposes ACE, a practical framework that treats context like an evolving playbook—something you grow, refine, and curate over time to make agents and reasoning systems measurably better. ...