Attention is the idea that made modern transformers practical and powerful. Instead of compressing an entire input into one fixed vector, a model can decide, token by token, which earlier pieces of information matter most right now.
That sounds simple, but there are many different kinds of attention mechanisms, and they exist because models face different constraints:
- some need strong alignment between an encoder and a decoder
- some need to generate text one token at a time without looking ahead
- some need to handle very long documents
- some need to reduce GPU memory traffic at inference time
This article walks through the main families of attention, shows where they fit, and explains why newer variants such as DeepSeek’s multi-head latent attention (MLA) matter.
A practical timeline: attention started as an alignment mechanism for sequence-to-sequence models, then became the core compute pattern inside transformers, and is now being redesigned for long-context and low-latency inference.
Why Attention Matters
Before transformers, sequence models such as RNNs and LSTMs processed text step by step. They were useful, but they struggled to keep distant information alive over long spans. If a model had to connect a pronoun to a noun many words earlier, or tie the end of a long paragraph back to the beginning, that fixed-memory bottleneck became a problem.
Attention changed this by turning memory lookup into a learned operation. Instead of asking the network to remember everything in one hidden state, attention lets the current token selectively read from many token positions.
In plain language:
- the current token asks a question
- earlier tokens advertise what information they contain
- the model retrieves a weighted mixture of the most relevant information
That is the heart of attention.
The Core Idea: Query, Key, and Value
The modern transformer version of attention is usually written as:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
You do not need the notation to get the idea:
- Query (Q) represents what the current token is looking for
- Key (K) represents what each token can offer
- Value (V) is the actual content that will be mixed into the output
The query is compared with all keys. High similarity means high relevance. After normalization with softmax, the model uses those weights to combine the value vectors.
Scaled dot-product attention in one picture: token embeddings are projected into queries, keys, and values; scores are computed from query-key similarity; the resulting weights mix the value vectors into contextual outputs.
A Short History of Attention
1. Additive Attention (Bahdanau Attention)
One of the most influential early uses of attention appeared in sequence-to-sequence machine translation. In the 2014 Bahdanau paper, the decoder did not rely on a single final encoder state. Instead, for each output word, it learned a soft alignment over encoder states.
Why it mattered:
- it improved translation quality
- it made long sequences easier to handle
- it made alignment between source and target words more explicit
This mechanism is often called additive attention because the score is produced by a learned feed-forward function over encoder and decoder states rather than a raw dot product.
2. Multiplicative / Dot-Product Attention (Luong Attention)
Luong-style attention simplified scoring by using dot products or closely related forms. It was faster and easier to scale than additive attention, especially as vectorized linear algebra became the dominant implementation style.
This family sits conceptually between early seq2seq attention and the fully transformer-native attention used later.
Transformer Attention: The Variants You Must Know
The transformer did not invent attention, but it made attention the central compute primitive in the architecture.
Self-Attention
In self-attention, the queries, keys, and values all come from the same sequence.
If the input is:
The flag is blue
then each token can look at the other tokens in that same sentence. The token flag can attend to The, is, and blue, and the final representation for flag becomes contextual rather than isolated.
Why self-attention is powerful:
- every token can directly access every other token
- long-range interactions no longer require many recurrent steps
- the whole sequence can be processed in parallel during training
Cross-Attention
In cross-attention, queries come from one sequence while keys and values come from another.
This is the classic pattern in encoder-decoder models:
- the encoder builds representations of the source sentence
- the decoder uses cross-attention to read from the encoded source while generating the target sentence
Cross-attention also appears outside translation:
- retrieval-augmented generation
- image-text models
- audio-text models
- multi-document or multi-modal fusion
Causal or Masked Self-Attention
For autoregressive language models, a token cannot be allowed to see future tokens during training. Otherwise the model would cheat.
That is why GPT-style decoders use causal attention:
- token 1 can attend only to token 1
- token 2 can attend to tokens 1 and 2
- token 3 can attend to tokens 1, 2, and 3
Mathematically, this is handled by masking the upper-right triangle of the attention score matrix before the softmax.
Self-attention reads within one sequence. Cross-attention reads from a different sequence. Causal attention applies a future-token mask so next-token prediction remains valid.
Multi-Head Attention
One attention pattern is useful, but a single pattern is limiting. That is why transformers use multi-head attention.
Instead of learning one query, key, and value projection, the model learns many heads in parallel. Each head gets its own projection space and can specialize:
- one head might focus on nearby syntax
- another might focus on coreference
- another might track delimiters, list structure, or repeated phrases
The heads are then concatenated and projected back into the model dimension.
Why multi-head attention works so well:
- it increases representational diversity
- it lets different relation types coexist
- it is still easy to implement efficiently on accelerators
This is the dominant form in vanilla transformers, but it comes with a cost: each head usually stores its own key and value cache during decoding.
Soft Attention vs Hard Attention
Most large language models use soft attention. Every token receives a weighted mixture from many other tokens, and the weighting is differentiable. That makes training stable with standard gradient descent.
Hard attention uses discrete selection, such as choosing one location or a very small subset of locations. It can be more selective, but it is much harder to train because the discrete choice breaks ordinary backpropagation.
In practice:
- soft attention dominates mainstream LLMs
- hard attention ideas still appear in routing, retrieval, and some specialized sparse systems
Attention for Long Contexts
Standard full attention is expensive because the score matrix grows with the square of sequence length. If a sequence has N tokens, the matrix has N x N interactions.
That is acceptable for short contexts. It becomes painful for long documents, code repositories, or long conversations.
This pressure led to several families of efficient attention.
Local or Sliding-Window Attention
Local attention restricts each token to a fixed neighborhood, such as a window of nearby tokens.
Why it helps:
- compute and memory drop sharply
- nearby dependencies in language are often very important
- it is simple and hardware-friendly
Where it struggles:
- purely local windows can miss long-range dependencies
- information must hop across many layers to travel far
Sparse Attention
Sparse attention does not let every token attend everywhere. Instead, it computes a carefully chosen subset of token pairs.
Typical sparse patterns include:
- local bands
- strided links
- dilated patterns
- designated global tokens
The goal is to keep important long-range routes while avoiding a dense N x N matrix.
Global + Local Attention
Architectures such as Longformer combine local windows with a few globally visible tokens. Those global tokens act like hubs, allowing information to travel long distances without fully dense attention.
This is often a strong compromise for document processing:
- local attention captures nearby structure
- global tokens provide long-range communication
Linear Attention
Linear attention changes the computation rather than only changing the pattern. Instead of explicitly forming the full pairwise attention matrix, it rewrites or approximates the operation so the cost grows roughly linearly with sequence length.
That can be attractive for very long inputs, but there is a trade-off:
- it is faster or more memory-efficient in the right setting
- it does not always match the quality of exact full attention
- implementation details matter a lot
Long-context variants either limit which token pairs are scored, as in local and sparse attention, or they change the algebra itself, as in linear attention.
Cross-Attention vs Self-Attention in Real Systems
A useful way to think about deployment is this:
- encoder-only models use self-attention heavily for representation learning
- decoder-only LLMs use causal self-attention for next-token prediction
- encoder-decoder models use self-attention inside each stack and cross-attention between them
- multimodal models often use cross-attention to connect text with image, audio, or video features
That means “attention mechanism” can refer either to the scoring rule itself or to the connectivity pattern between information sources.
The Inference Problem: KV Cache
The transcript you shared focuses on a particularly important decoder-side issue: the key-value cache, usually shortened to KV cache.
When a decoder-only language model generates text one token at a time, it does not want to recompute all keys and values for the full prefix on every step. So it stores previously computed keys and values in memory and reuses them.
That is a huge win for compute, but it creates a memory bottleneck:
- the cache grows with sequence length
- it must exist for every layer
- in vanilla multi-head attention, each head contributes its own keys and values
As models and contexts get larger, moving KV cache data can become a major cost during inference.
This is the background for MQA, GQA, and MLA.
MQA, GQA, and MLA: Attention Variants for Faster Decoding
Multi-Head Attention (MHA)
This is the standard transformer setup:
- every head has its own Q, K, and V projections
- every head keeps its own key and value cache
Strength:
- maximum head-specific flexibility
Trade-off:
- largest KV cache and highest memory bandwidth pressure
Multi-Query Attention (MQA)
MQA keeps separate queries per head, but it shares the key and value projections across all heads.
Why it helps:
- the KV cache becomes much smaller
- decoding can become much faster
Trade-off:
- heads lose some freedom because they read from the same shared keys and values
Grouped-Query Attention (GQA)
GQA is a compromise between MHA and MQA. Instead of one shared K/V for all heads, it shares K/V within groups of heads.
This gives:
- a smaller KV cache than MHA
- more specialization than MQA
That is why GQA has become a common practical choice in large models.
Multi-Head Latent Attention (MLA)
This is the mechanism highlighted in the Welch Labs video and poster.
The key idea is not simply to force heads to share the same keys and values. Instead, the model learns a compressed latent representation for the KV cache, shared across heads, and then uses learned projections so each head can still recover head-specific behavior from that shared latent space.
Conceptually, MLA does this:
- compress the information that would normally live in a large per-head KV cache into a smaller shared latent cache
- keep enough structure so different heads can still act differently
- rearrange the inference computation so the compression does not introduce a large new runtime penalty
That is why MLA is interesting. It is not just a memory-saving trick in the crude sense. It tries to preserve the benefits of multi-head specialization while shrinking the cache dramatically.
The decoder-efficiency family: MHA stores separate K/V per head, MQA shares one K/V set, GQA shares within groups, and MLA uses a learned shared latent cache with head-specific recovery.
Why DeepSeek MLA Drew So Much Attention
The Welch Labs video discusses DeepSeek R1 because that model made the mechanism famous to a wider audience in early 2025. The underlying attention design, however, was introduced in the DeepSeek-V2 technical report published in May 2024.
That distinction matters:
- the reasoning model publicized the result
- the earlier technical report introduced the architectural idea
The central engineering claim is that MLA drastically reduces KV-cache pressure relative to standard multi-head attention. That matters because modern decoding is often memory-bandwidth-bound, not purely arithmetic-bound.
In other words, the bottleneck is often not “how many multiplications can the GPU do?” but “how quickly can the system move cached keys and values back into the compute units for the next token?”
Why MLA is appealing:
- smaller KV cache
- lower memory traffic during decode
- preservation of more head-specific flexibility than simple MQA
- better fit for long-context inference workloads
One subtle but important point from the DeepSeek description is that MLA is combined with algebraic rearrangements so the latent-space trick does not simply add another expensive matrix multiply on every token. That design choice is part of what makes the mechanism practically useful instead of merely elegant on paper.
Not Everything Called “Attention Optimization” Is a New Attention Mechanism
This is an important distinction.
Some improvements change the attention mechanism itself:
- additive attention
- self-attention
- sparse attention
- linear attention
- MQA, GQA, MLA
Other improvements mainly change the implementation strategy:
- FlashAttention
- fused kernels
- better cache layouts
- quantized KV caches
These are all valuable, but they solve different layers of the problem.
For example:
- FlashAttention improves how attention is computed on hardware
- MLA changes what is stored and how head-specific information is represented
The first is mainly an execution optimization. The second is an architectural change.
Which Attention Mechanism Should You Use?
There is no universally best choice. The right mechanism depends on the job.
Use full self-attention when:
- context lengths are moderate
- you want maximum modeling flexibility
- you are building a standard transformer baseline
Use cross-attention when:
- one sequence must read from another sequence
- you are building encoder-decoder or multimodal systems
Use local or sparse attention when:
- long-context cost is a real bottleneck
- the task has strong locality structure
- you can tolerate restricted token-to-token connectivity
Use linear attention when:
- you need streaming or long-sequence efficiency
- approximate or reformulated attention is acceptable
- you are optimizing for asymptotic scaling
Use MQA or GQA when:
- decoder inference speed matters
- KV-cache footprint is a bottleneck
- you need a practical LLM inference optimization
Use MLA when:
- you want more aggressive KV-cache reduction
- you still want more flexibility than naive K/V sharing
- you are designing for large-scale autoregressive decoding
A Practical Comparison Table
| Mechanism | What changes? | Main benefit | Main trade-off | Common use |
|---|---|---|---|---|
| Additive attention | Learned scoring network over encoder/decoder states | Strong seq2seq alignment | Less hardware-friendly than dot products | Early NMT |
| Dot-product attention | Similarity via vector products | Fast and scalable | Still expensive at long context | Seq2seq, transformers |
| Self-attention | Tokens attend within one sequence | Rich contextualization | Full version is quadratic | Encoders and decoders |
| Cross-attention | Queries read a different sequence | Great for conditioning and fusion | Extra memory and compute | Encoder-decoder, multimodal |
| Causal attention | Future positions are masked | Valid next-token prediction | Still quadratic over prefix during training | GPT-style LLMs |
| Multi-head attention | Multiple learned heads in parallel | Diverse relation modeling | Large KV cache in decoding | Vanilla transformers |
| Local attention | Restrict attention to nearby windows | Cheaper long-context processing | Weak direct long-range access | Long documents |
| Sparse attention | Compute only selected token pairs | Better scaling than full attention | Pattern design matters | Long-context models |
| Linear attention | Rewrite or approximate attention algebra | Near-linear scaling | May sacrifice exactness | Streaming and very long sequences |
| Multi-query attention | Share K/V across all heads | Much smaller KV cache | Less head specialization | Fast decoder inference |
| Grouped-query attention | Share K/V within groups | Good quality/speed compromise | Not as flexible as full MHA | Many modern LLMs |
| Multi-head latent attention | Learn a compressed shared latent KV space | Very small KV cache with stronger flexibility | More architectural complexity | DeepSeek-style decoder efficiency |
Common Misunderstandings
“Attention means the model understands language like a human.”
No. Attention is a learned weighting mechanism. It is powerful, but it is still numerical pattern processing.
“Attention weights are a complete explanation of model reasoning.”
Not necessarily. Attention maps can be informative, but they are not the whole story. Feed-forward blocks, residual paths, normalization, and head interactions all contribute.
“All efficient attention methods solve the same problem.”
No. Some solve training-time sequence scaling, some solve long-context connectivity, and some solve decode-time KV-cache bandwidth.
“DeepSeek MLA replaces all earlier attention ideas.”
No. MLA is best understood as a decoder-efficiency architecture for transformer-style models, not as a universal replacement for every attention variant.
Final Takeaway
The word attention now covers a family of related ideas rather than one single mechanism.
The progression looks like this:
- early attention solved alignment in sequence-to-sequence models
- transformer self-attention made global token interaction the center of the architecture
- long-context variants reduced the cost of pairwise interactions
- decoder-focused variants such as MQA, GQA, and MLA attacked the KV-cache bottleneck during generation
If you are learning modern AI systems, that last step is especially important. Once models become large enough, architecture is no longer only about raw quality. It is also about bandwidth, latency, cache size, and deployability. That is exactly why newer mechanisms such as DeepSeek’s MLA matter.
Sources and Further Reading
- Bahdanau, Cho, and Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (2014): https://arxiv.org/abs/1409.0473
- Luong, Pham, and Manning, Effective Approaches to Attention-based Neural Machine Translation (2015): https://arxiv.org/abs/1508.04025
- Vaswani et al., Attention Is All You Need (2017): https://arxiv.org/abs/1706.03762
- Shazeer, Fast Transformer Decoding: One Write-Head is All You Need (2019): https://arxiv.org/abs/1911.02150
- Child et al., Generating Long Sequences with Sparse Transformers (2019): https://arxiv.org/abs/1904.10509
- Beltagy, Peters, and Cohan, Longformer: The Long-Document Transformer (2020): https://arxiv.org/abs/2004.05150
- Katharopoulos et al., Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020): https://arxiv.org/abs/2006.16236
- Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (2023): https://arxiv.org/abs/2305.13245
- DeepSeek-AI, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024): https://arxiv.org/abs/2405.04434
- Welch Labs poster page, MLA/DeepSeek Attention Poster 13x19: https://www.welchlabs.com/resources/mladeepseek-attention-poster-13x19