TTFT in LLMs Explained: What Time to First Token Really Measures

When I evaluate an LLM system, one of the first latency metrics I look at is TTFT, or time to first token.

This metric answers a simple question:

After a user sends a request, how long does it take before the first output token appears?

That sounds narrow, but it matters a lot. Users usually forgive a response that streams steadily after it starts. What feels bad is the dead time before anything appears on screen.

In this post, I explain what TTFT really measures, how it differs from throughput metrics like tokens per second, what usually makes TTFT worse, and what teams actually do to improve it.

If you want the broader background on prefill, decode, and KV cache behavior, I already covered that in Understanding LLM Inference Basics: Prefill and Decode, TTFT, and ITL.

What Is TTFT?

TTFT (time to first token) is the elapsed time between:

the moment a request reaches the inference system
the moment the first generated token is returned to the user

In a streamed chat response, TTFT is the delay before the model starts “speaking”.

For autoregressive LLMs, TTFT is dominated by two things:

processing the input prompt
generating the first output token

That is why TTFT is closely tied to the prefill phase of inference.

TTFT vs TPS vs ITL

These terms get mixed together all the time, but they are not the same.

Metric	What it tells me	Why it matters
TTFT	How long the user waits before seeing the first token	Perceived responsiveness at the start
TPS	How many tokens per second are produced after generation begins	Streaming speed
ITL	The gap between consecutive generated tokens	Smoothness of output once streaming starts

A useful way to think about it is:

TTFT is startup latency
TPS is output rate
ITL is the inverse view of output rate per user

If ITL is 20 ms, the stream is effectively producing about 50 tokens per second for that user.

One reason people get confused is that TPS is sometimes used as a latency metric per user and sometimes as a throughput metric for the whole service. I prefer being explicit:

Per-user TPS for how fast one response streams
Total throughput for how many tokens the service generates overall

Why TTFT Matters So Much

TTFT has a disproportionate effect on how fast a system feels.

Consider these two cases:

System A starts in 200 ms and then streams a little slower
System B starts in 2 seconds and then streams very fast

Most users will say System A feels better, especially in chat, coding assistants, search copilots, and voice or agent interfaces.

That is because the first visible token is a trust signal. It tells the user the system is alive, the request is accepted, and the response is on its way.

In practical terms, TTFT matters most for:

chatbots
coding assistants
agent UIs with streaming text
search copilots
interactive internal tools

If the system is returning a tool result, JSON blob, or a fully buffered response, total response time may matter more than TTFT. But for user-facing streamed text, TTFT is usually one of the most important latency metrics.

Why TTFT Is Mostly a Prefill Problem

In most modern LLM systems, the first token cannot be produced until the model has processed the entire prompt.

That means the system must:

tokenize the input
run the prompt through all transformer layers
build the KV cache
perform the first decode step

This front-loaded work is why TTFT grows with prompt length.

A short prompt usually gives lower TTFT.

A long prompt, large system prompt, large retrieved context window, or oversized chat history usually pushes TTFT up.

This is also why TTFT is often described as being compute-bound. During prefill, the GPU is doing large matrix multiplications across the full prompt. The work is highly parallel, but there is still a lot of it.

What Increases TTFT?

These are the most common causes I see:

1. Long prompts

This is the biggest one.

Every extra token in the input has to be processed before the first output token can be emitted. Long RAG context, repeated conversation history, and bloated system prompts all hurt TTFT.

2. Larger models

Bigger models do more work per layer and usually have higher TTFT on the same hardware.

3. Queueing and contention

Even if raw model execution is fast, the request may sit in a scheduler queue behind other work. In production, TTFT often includes this waiting time too.

4. Cold starts

If the model weights are not already warm in memory, or the runtime has to spin up workers, TTFT can spike badly.

5. Inefficient prompt construction

Some systems pass far more context than they need. I have seen teams spend weeks optimizing model serving while leaving prompt bloat untouched.

6. Slow tokenization or preprocessing

It is usually not the main bottleneck, but in some stacks preprocessing, template rendering, retrieval joins, guardrails, or request routing add noticeable time before inference even starts.

What Does “Good” TTFT Look Like?

There is no single universal number because TTFT depends on:

model size
prompt length
hardware
batching policy
whether the request is cold or warm
how much orchestration happens before inference

Still, the intuition is straightforward:

lower TTFT feels better
stable TTFT is often more important than one perfect benchmark number

For interactive systems, the user notices startup delay immediately. A system with great average TTFT but terrible tail latency still feels unreliable.

That is why I would track at least:

p50 TTFT
p95 TTFT
p99 TTFT

Average TTFT alone hides too much.

How To Improve TTFT

If I needed to reduce TTFT, I would look at these levers first.

1. Cut prompt length aggressively

This is usually the highest-leverage fix.

trim old chat history
summarize earlier turns
shrink retrieved chunks
reduce repeated instructions
avoid stuffing context “just in case”

Many systems have a prompt design problem disguised as an inference problem.

2. Use prompt caching where it actually helps

If a large prefix is reused across requests, caching that prefix can reduce repeated prefill work.

This is especially useful when the system prompt, tool schema block, or shared context stays stable across many requests.

3. Choose the right model size

If the use case does not need the largest model, a smaller model can reduce TTFT significantly and often improve the total product experience.

4. Reduce cold starts

Keep workers warm when possible. If the runtime repeatedly unloads and reloads model state, TTFT becomes unpredictable.

5. Tune batching carefully

Batching can improve hardware efficiency, but aggressive batching can also increase waiting time before a request starts running. That tradeoff is good for throughput, but sometimes bad for perceived latency.

6. Simplify pre-inference orchestration

If your stack does retrieval, reranking, safety checks, routing, prompt templating, and tracing before the model sees the request, TTFT can suffer even when the model runtime is fine.

Measure the whole path, not just the GPU kernel time.

A Simple Mental Model

I think of LLM latency like this:

TTFT tells me how long it takes to get started
ITL tells me how smoothly the response continues
Total response time tells me when the full job is done

Different products care about these differently.

For example:

a chatbot cares a lot about TTFT and ITL
an offline batch summarization job cares more about total throughput
an agent waiting for a tool call may care more about full completion time than token-by-token streaming

That is why one latency metric never tells the whole story.

TTFT and User Experience

The important part is not just model performance. It is perceived performance.

A user cannot see FLOPs, KV cache efficiency, or scheduler behavior. They only see:

Did the answer start quickly?
Did it keep moving?
Did it finish in a reasonable time?

TTFT directly affects the first of those.

That is why it is such a useful metric for real systems. It maps technical behavior to an obvious human experience.

Summary

TTFT in LLMs measures how long it takes for the first generated token to appear after a request is sent.

It is mostly driven by prompt processing and the first decode step, which makes it strongly influenced by prompt length, model size, queueing, and cold-start behavior.

If the goal is to reduce user-visible latency in a streaming LLM product, TTFT is one of the first metrics worth checking. A fast-starting system usually feels much better than one that stays silent for too long, even if both eventually generate at similar speeds.

What Is TTFT?#

TTFT vs TPS vs ITL#

Why TTFT Matters So Much#

Why TTFT Is Mostly a Prefill Problem#

What Increases TTFT?#

1. Long prompts#

2. Larger models#

3. Queueing and contention#

4. Cold starts#

5. Inefficient prompt construction#

6. Slow tokenization or preprocessing#

What Does “Good” TTFT Look Like?#

How To Improve TTFT#

1. Cut prompt length aggressively#

2. Use prompt caching where it actually helps#

3. Choose the right model size#

4. Reduce cold starts#

5. Tune batching carefully#

6. Simplify pre-inference orchestration#

A Simple Mental Model#

TTFT and User Experience#

Summary#

Related Articles