When I evaluate an LLM system, one of the first latency metrics I look at is TTFT, or time to first token.
This metric answers a simple question:
After a user sends a request, how long does it take before the first output token appears?
That sounds narrow, but it matters a lot. Users usually forgive a response that streams steadily after it starts. What feels bad is the dead time before anything appears on screen.
In this post, I explain what TTFT really measures, how it differs from throughput metrics like tokens per second, what usually makes TTFT worse, and what teams actually do to improve it.
If you want the broader background on prefill, decode, and KV cache behavior, I already covered that in Understanding LLM Inference Basics: Prefill and Decode, TTFT, and ITL.
What Is TTFT?
TTFT (time to first token) is the elapsed time between:
- the moment a request reaches the inference system
- the moment the first generated token is returned to the user
In a streamed chat response, TTFT is the delay before the model starts “speaking”.
For autoregressive LLMs, TTFT is dominated by two things:
- processing the input prompt
- generating the first output token
That is why TTFT is closely tied to the prefill phase of inference.
TTFT vs TPS vs ITL
These terms get mixed together all the time, but they are not the same.
| Metric | What it tells me | Why it matters |
|---|---|---|
| TTFT | How long the user waits before seeing the first token | Perceived responsiveness at the start |
| TPS | How many tokens per second are produced after generation begins | Streaming speed |
| ITL | The gap between consecutive generated tokens | Smoothness of output once streaming starts |
A useful way to think about it is:
- TTFT is startup latency
- TPS is output rate
- ITL is the inverse view of output rate per user
If ITL is 20 ms, the stream is effectively producing about 50 tokens per second for that user.
One reason people get confused is that TPS is sometimes used as a latency metric per user and sometimes as a throughput metric for the whole service. I prefer being explicit:
- Per-user TPS for how fast one response streams
- Total throughput for how many tokens the service generates overall
Why TTFT Matters So Much
TTFT has a disproportionate effect on how fast a system feels.
Consider these two cases:
- System A starts in 200 ms and then streams a little slower
- System B starts in 2 seconds and then streams very fast
Most users will say System A feels better, especially in chat, coding assistants, search copilots, and voice or agent interfaces.
That is because the first visible token is a trust signal. It tells the user the system is alive, the request is accepted, and the response is on its way.
In practical terms, TTFT matters most for:
- chatbots
- coding assistants
- agent UIs with streaming text
- search copilots
- interactive internal tools
If the system is returning a tool result, JSON blob, or a fully buffered response, total response time may matter more than TTFT. But for user-facing streamed text, TTFT is usually one of the most important latency metrics.
Why TTFT Is Mostly a Prefill Problem
In most modern LLM systems, the first token cannot be produced until the model has processed the entire prompt.
That means the system must:
- tokenize the input
- run the prompt through all transformer layers
- build the KV cache
- perform the first decode step
This front-loaded work is why TTFT grows with prompt length.
A short prompt usually gives lower TTFT.
A long prompt, large system prompt, large retrieved context window, or oversized chat history usually pushes TTFT up.
This is also why TTFT is often described as being compute-bound. During prefill, the GPU is doing large matrix multiplications across the full prompt. The work is highly parallel, but there is still a lot of it.
What Increases TTFT?
These are the most common causes I see:
1. Long prompts
This is the biggest one.
Every extra token in the input has to be processed before the first output token can be emitted. Long RAG context, repeated conversation history, and bloated system prompts all hurt TTFT.
2. Larger models
Bigger models do more work per layer and usually have higher TTFT on the same hardware.
3. Queueing and contention
Even if raw model execution is fast, the request may sit in a scheduler queue behind other work. In production, TTFT often includes this waiting time too.
4. Cold starts
If the model weights are not already warm in memory, or the runtime has to spin up workers, TTFT can spike badly.
5. Inefficient prompt construction
Some systems pass far more context than they need. I have seen teams spend weeks optimizing model serving while leaving prompt bloat untouched.
6. Slow tokenization or preprocessing
It is usually not the main bottleneck, but in some stacks preprocessing, template rendering, retrieval joins, guardrails, or request routing add noticeable time before inference even starts.
What Does “Good” TTFT Look Like?
There is no single universal number because TTFT depends on:
- model size
- prompt length
- hardware
- batching policy
- whether the request is cold or warm
- how much orchestration happens before inference
Still, the intuition is straightforward:
- lower TTFT feels better
- stable TTFT is often more important than one perfect benchmark number
For interactive systems, the user notices startup delay immediately. A system with great average TTFT but terrible tail latency still feels unreliable.
That is why I would track at least:
- p50 TTFT
- p95 TTFT
- p99 TTFT
Average TTFT alone hides too much.
How To Improve TTFT
If I needed to reduce TTFT, I would look at these levers first.
1. Cut prompt length aggressively
This is usually the highest-leverage fix.
- trim old chat history
- summarize earlier turns
- shrink retrieved chunks
- reduce repeated instructions
- avoid stuffing context “just in case”
Many systems have a prompt design problem disguised as an inference problem.
2. Use prompt caching where it actually helps
If a large prefix is reused across requests, caching that prefix can reduce repeated prefill work.
This is especially useful when the system prompt, tool schema block, or shared context stays stable across many requests.
3. Choose the right model size
If the use case does not need the largest model, a smaller model can reduce TTFT significantly and often improve the total product experience.
4. Reduce cold starts
Keep workers warm when possible. If the runtime repeatedly unloads and reloads model state, TTFT becomes unpredictable.
5. Tune batching carefully
Batching can improve hardware efficiency, but aggressive batching can also increase waiting time before a request starts running. That tradeoff is good for throughput, but sometimes bad for perceived latency.
6. Simplify pre-inference orchestration
If your stack does retrieval, reranking, safety checks, routing, prompt templating, and tracing before the model sees the request, TTFT can suffer even when the model runtime is fine.
Measure the whole path, not just the GPU kernel time.
A Simple Mental Model
I think of LLM latency like this:
- TTFT tells me how long it takes to get started
- ITL tells me how smoothly the response continues
- Total response time tells me when the full job is done
Different products care about these differently.
For example:
- a chatbot cares a lot about TTFT and ITL
- an offline batch summarization job cares more about total throughput
- an agent waiting for a tool call may care more about full completion time than token-by-token streaming
That is why one latency metric never tells the whole story.
TTFT and User Experience
The important part is not just model performance. It is perceived performance.
A user cannot see FLOPs, KV cache efficiency, or scheduler behavior. They only see:
- Did the answer start quickly?
- Did it keep moving?
- Did it finish in a reasonable time?
TTFT directly affects the first of those.
That is why it is such a useful metric for real systems. It maps technical behavior to an obvious human experience.
Summary
TTFT in LLMs measures how long it takes for the first generated token to appear after a request is sent.
It is mostly driven by prompt processing and the first decode step, which makes it strongly influenced by prompt length, model size, queueing, and cold-start behavior.
If the goal is to reduce user-visible latency in a streaming LLM product, TTFT is one of the first metrics worth checking. A fast-starting system usually feels much better than one that stays silent for too long, even if both eventually generate at similar speeds.