Attention Mechanisms Explained: Self-Attention, Cross-Attention, Sparse Attention, MQA, GQA, and DeepSeek MLA

Attention is the idea that made modern transformers practical and powerful. Instead of compressing an entire input into one fixed vector, a model can decide, token by token, which earlier pieces of information matter most right now.

That sounds simple, but there are many different kinds of attention mechanisms, and they exist because models face different constraints:

some need strong alignment between an encoder and a decoder
some need to generate text one token at a time without looking ahead
some need to handle very long documents
some need to reduce GPU memory traffic at inference time

This article walks through the main families of attention, shows where they fit, and explains why newer variants such as DeepSeek’s multi-head latent attention (MLA) matter.

Timeline of attention mechanisms

A practical timeline: attention started as an alignment mechanism for sequence-to-sequence models, then became the core compute pattern inside transformers, and is now being redesigned for long-context and low-latency inference.

Why Attention Matters

Before transformers, sequence models such as RNNs and LSTMs processed text step by step. They were useful, but they struggled to keep distant information alive over long spans. If a model had to connect a pronoun to a noun many words earlier, or tie the end of a long paragraph back to the beginning, that fixed-memory bottleneck became a problem.

Attention changed this by turning memory lookup into a learned operation. Instead of asking the network to remember everything in one hidden state, attention lets the current token selectively read from many token positions.

In plain language:

the current token asks a question
earlier tokens advertise what information they contain
the model retrieves a weighted mixture of the most relevant information

That is the heart of attention.

The Core Idea: Query, Key, and Value

The modern transformer version of attention is usually written as:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

You do not need the notation to get the idea:

Query (Q) represents what the current token is looking for
Key (K) represents what each token can offer
Value (V) is the actual content that will be mixed into the output

The query is compared with all keys. High similarity means high relevance. After normalization with softmax, the model uses those weights to combine the value vectors.

QKV attention flow

Scaled dot-product attention in one picture: token embeddings are projected into queries, keys, and values; scores are computed from query-key similarity; the resulting weights mix the value vectors into contextual outputs.

A Short History of Attention

1. Additive Attention (Bahdanau Attention)

One of the most influential early uses of attention appeared in sequence-to-sequence machine translation. In the 2014 Bahdanau paper, the decoder did not rely on a single final encoder state. Instead, for each output word, it learned a soft alignment over encoder states.

Why it mattered:

it improved translation quality
it made long sequences easier to handle
it made alignment between source and target words more explicit

This mechanism is often called additive attention because the score is produced by a learned feed-forward function over encoder and decoder states rather than a raw dot product.

2. Multiplicative / Dot-Product Attention (Luong Attention)

Luong-style attention simplified scoring by using dot products or closely related forms. It was faster and easier to scale than additive attention, especially as vectorized linear algebra became the dominant implementation style.

This family sits conceptually between early seq2seq attention and the fully transformer-native attention used later.

Transformer Attention: The Variants You Must Know

The transformer did not invent attention, but it made attention the central compute primitive in the architecture.

Self-Attention

In self-attention, the queries, keys, and values all come from the same sequence.

If the input is:

The flag is blue

then each token can look at the other tokens in that same sentence. The token flag can attend to The, is, and blue, and the final representation for flag becomes contextual rather than isolated.

Why self-attention is powerful:

every token can directly access every other token
long-range interactions no longer require many recurrent steps
the whole sequence can be processed in parallel during training

Cross-Attention

In cross-attention, queries come from one sequence while keys and values come from another.

This is the classic pattern in encoder-decoder models:

the encoder builds representations of the source sentence
the decoder uses cross-attention to read from the encoded source while generating the target sentence

Cross-attention also appears outside translation:

retrieval-augmented generation
image-text models
audio-text models
multi-document or multi-modal fusion

Causal or Masked Self-Attention

For autoregressive language models, a token cannot be allowed to see future tokens during training. Otherwise the model would cheat.

That is why GPT-style decoders use causal attention:

token 1 can attend only to token 1
token 2 can attend to tokens 1 and 2
token 3 can attend to tokens 1, 2, and 3

Mathematically, this is handled by masking the upper-right triangle of the attention score matrix before the softmax.

Self, cross, and causal attention

Self-attention reads within one sequence. Cross-attention reads from a different sequence. Causal attention applies a future-token mask so next-token prediction remains valid.

Multi-Head Attention

One attention pattern is useful, but a single pattern is limiting. That is why transformers use multi-head attention.

Instead of learning one query, key, and value projection, the model learns many heads in parallel. Each head gets its own projection space and can specialize:

one head might focus on nearby syntax
another might focus on coreference
another might track delimiters, list structure, or repeated phrases

The heads are then concatenated and projected back into the model dimension.

Why multi-head attention works so well:

it increases representational diversity
it lets different relation types coexist
it is still easy to implement efficiently on accelerators

This is the dominant form in vanilla transformers, but it comes with a cost: each head usually stores its own key and value cache during decoding.

Soft Attention vs Hard Attention

Most large language models use soft attention. Every token receives a weighted mixture from many other tokens, and the weighting is differentiable. That makes training stable with standard gradient descent.

Hard attention uses discrete selection, such as choosing one location or a very small subset of locations. It can be more selective, but it is much harder to train because the discrete choice breaks ordinary backpropagation.

In practice:

soft attention dominates mainstream LLMs
hard attention ideas still appear in routing, retrieval, and some specialized sparse systems

Attention for Long Contexts

Standard full attention is expensive because the score matrix grows with the square of sequence length. If a sequence has N tokens, the matrix has N x N interactions.

That is acceptable for short contexts. It becomes painful for long documents, code repositories, or long conversations.

This pressure led to several families of efficient attention.

Local or Sliding-Window Attention

Local attention restricts each token to a fixed neighborhood, such as a window of nearby tokens.

Why it helps:

compute and memory drop sharply
nearby dependencies in language are often very important
it is simple and hardware-friendly

Where it struggles:

purely local windows can miss long-range dependencies
information must hop across many layers to travel far

Sparse Attention

Sparse attention does not let every token attend everywhere. Instead, it computes a carefully chosen subset of token pairs.

Typical sparse patterns include:

local bands
strided links
dilated patterns
designated global tokens

The goal is to keep important long-range routes while avoiding a dense N x N matrix.

Global + Local Attention

Architectures such as Longformer combine local windows with a few globally visible tokens. Those global tokens act like hubs, allowing information to travel long distances without fully dense attention.

This is often a strong compromise for document processing:

local attention captures nearby structure
global tokens provide long-range communication

Linear Attention

Linear attention changes the computation rather than only changing the pattern. Instead of explicitly forming the full pairwise attention matrix, it rewrites or approximates the operation so the cost grows roughly linearly with sequence length.

That can be attractive for very long inputs, but there is a trade-off:

it is faster or more memory-efficient in the right setting
it does not always match the quality of exact full attention
implementation details matter a lot

Long-context attention patterns

Long-context variants either limit which token pairs are scored, as in local and sparse attention, or they change the algebra itself, as in linear attention.

Cross-Attention vs Self-Attention in Real Systems

A useful way to think about deployment is this:

encoder-only models use self-attention heavily for representation learning
decoder-only LLMs use causal self-attention for next-token prediction
encoder-decoder models use self-attention inside each stack and cross-attention between them
multimodal models often use cross-attention to connect text with image, audio, or video features

That means “attention mechanism” can refer either to the scoring rule itself or to the connectivity pattern between information sources.

The Inference Problem: KV Cache

The transcript you shared focuses on a particularly important decoder-side issue: the key-value cache, usually shortened to KV cache.

When a decoder-only language model generates text one token at a time, it does not want to recompute all keys and values for the full prefix on every step. So it stores previously computed keys and values in memory and reuses them.

That is a huge win for compute, but it creates a memory bottleneck:

the cache grows with sequence length
it must exist for every layer
in vanilla multi-head attention, each head contributes its own keys and values

As models and contexts get larger, moving KV cache data can become a major cost during inference.

This is the background for MQA, GQA, and MLA.

MQA, GQA, and MLA: Attention Variants for Faster Decoding

Multi-Head Attention (MHA)

This is the standard transformer setup:

every head has its own Q, K, and V projections
every head keeps its own key and value cache

Strength:

maximum head-specific flexibility

Trade-off:

largest KV cache and highest memory bandwidth pressure

Multi-Query Attention (MQA)

MQA keeps separate queries per head, but it shares the key and value projections across all heads.

Why it helps:

the KV cache becomes much smaller
decoding can become much faster

Trade-off:

heads lose some freedom because they read from the same shared keys and values

Grouped-Query Attention (GQA)

GQA is a compromise between MHA and MQA. Instead of one shared K/V for all heads, it shares K/V within groups of heads.

This gives:

a smaller KV cache than MHA
more specialization than MQA

That is why GQA has become a common practical choice in large models.

Multi-Head Latent Attention (MLA)

This is the mechanism highlighted in the Welch Labs video and poster.

The key idea is not simply to force heads to share the same keys and values. Instead, the model learns a compressed latent representation for the KV cache, shared across heads, and then uses learned projections so each head can still recover head-specific behavior from that shared latent space.

Conceptually, MLA does this:

compress the information that would normally live in a large per-head KV cache into a smaller shared latent cache
keep enough structure so different heads can still act differently
rearrange the inference computation so the compression does not introduce a large new runtime penalty

That is why MLA is interesting. It is not just a memory-saving trick in the crude sense. It tries to preserve the benefits of multi-head specialization while shrinking the cache dramatically.

KV cache attention families

The decoder-efficiency family: MHA stores separate K/V per head, MQA shares one K/V set, GQA shares within groups, and MLA uses a learned shared latent cache with head-specific recovery.

Why DeepSeek MLA Drew So Much Attention

The Welch Labs video discusses DeepSeek R1 because that model made the mechanism famous to a wider audience in early 2025. The underlying attention design, however, was introduced in the DeepSeek-V2 technical report published in May 2024.

That distinction matters:

the reasoning model publicized the result
the earlier technical report introduced the architectural idea

The central engineering claim is that MLA drastically reduces KV-cache pressure relative to standard multi-head attention. That matters because modern decoding is often memory-bandwidth-bound, not purely arithmetic-bound.

In other words, the bottleneck is often not “how many multiplications can the GPU do?” but “how quickly can the system move cached keys and values back into the compute units for the next token?”

Why MLA is appealing:

smaller KV cache
lower memory traffic during decode
preservation of more head-specific flexibility than simple MQA
better fit for long-context inference workloads

One subtle but important point from the DeepSeek description is that MLA is combined with algebraic rearrangements so the latent-space trick does not simply add another expensive matrix multiply on every token. That design choice is part of what makes the mechanism practically useful instead of merely elegant on paper.

Not Everything Called “Attention Optimization” Is a New Attention Mechanism

This is an important distinction.

Some improvements change the attention mechanism itself:

additive attention
self-attention
sparse attention
linear attention
MQA, GQA, MLA

Other improvements mainly change the implementation strategy:

FlashAttention
fused kernels
better cache layouts
quantized KV caches

These are all valuable, but they solve different layers of the problem.

For example:

FlashAttention improves how attention is computed on hardware
MLA changes what is stored and how head-specific information is represented

The first is mainly an execution optimization. The second is an architectural change.

Which Attention Mechanism Should You Use?

There is no universally best choice. The right mechanism depends on the job.

Use full self-attention when:

context lengths are moderate
you want maximum modeling flexibility
you are building a standard transformer baseline

Use cross-attention when:

one sequence must read from another sequence
you are building encoder-decoder or multimodal systems

Use local or sparse attention when:

long-context cost is a real bottleneck
the task has strong locality structure
you can tolerate restricted token-to-token connectivity

Use linear attention when:

you need streaming or long-sequence efficiency
approximate or reformulated attention is acceptable
you are optimizing for asymptotic scaling

Use MQA or GQA when:

decoder inference speed matters
KV-cache footprint is a bottleneck
you need a practical LLM inference optimization

Use MLA when:

you want more aggressive KV-cache reduction
you still want more flexibility than naive K/V sharing
you are designing for large-scale autoregressive decoding

A Practical Comparison Table

Mechanism	What changes?	Main benefit	Main trade-off	Common use
Additive attention	Learned scoring network over encoder/decoder states	Strong seq2seq alignment	Less hardware-friendly than dot products	Early NMT
Dot-product attention	Similarity via vector products	Fast and scalable	Still expensive at long context	Seq2seq, transformers
Self-attention	Tokens attend within one sequence	Rich contextualization	Full version is quadratic	Encoders and decoders
Cross-attention	Queries read a different sequence	Great for conditioning and fusion	Extra memory and compute	Encoder-decoder, multimodal
Causal attention	Future positions are masked	Valid next-token prediction	Still quadratic over prefix during training	GPT-style LLMs
Multi-head attention	Multiple learned heads in parallel	Diverse relation modeling	Large KV cache in decoding	Vanilla transformers
Local attention	Restrict attention to nearby windows	Cheaper long-context processing	Weak direct long-range access	Long documents
Sparse attention	Compute only selected token pairs	Better scaling than full attention	Pattern design matters	Long-context models
Linear attention	Rewrite or approximate attention algebra	Near-linear scaling	May sacrifice exactness	Streaming and very long sequences
Multi-query attention	Share K/V across all heads	Much smaller KV cache	Less head specialization	Fast decoder inference
Grouped-query attention	Share K/V within groups	Good quality/speed compromise	Not as flexible as full MHA	Many modern LLMs
Multi-head latent attention	Learn a compressed shared latent KV space	Very small KV cache with stronger flexibility	More architectural complexity	DeepSeek-style decoder efficiency

Common Misunderstandings

“Attention means the model understands language like a human.”

No. Attention is a learned weighting mechanism. It is powerful, but it is still numerical pattern processing.

“Attention weights are a complete explanation of model reasoning.”

Not necessarily. Attention maps can be informative, but they are not the whole story. Feed-forward blocks, residual paths, normalization, and head interactions all contribute.

“All efficient attention methods solve the same problem.”

No. Some solve training-time sequence scaling, some solve long-context connectivity, and some solve decode-time KV-cache bandwidth.

“DeepSeek MLA replaces all earlier attention ideas.”

No. MLA is best understood as a decoder-efficiency architecture for transformer-style models, not as a universal replacement for every attention variant.

Final Takeaway

The word attention now covers a family of related ideas rather than one single mechanism.

The progression looks like this:

early attention solved alignment in sequence-to-sequence models
transformer self-attention made global token interaction the center of the architecture
long-context variants reduced the cost of pairwise interactions
decoder-focused variants such as MQA, GQA, and MLA attacked the KV-cache bottleneck during generation

If you are learning modern AI systems, that last step is especially important. Once models become large enough, architecture is no longer only about raw quality. It is also about bandwidth, latency, cache size, and deployability. That is exactly why newer mechanisms such as DeepSeek’s MLA matter.

Sources and Further Reading

Bahdanau, Cho, and Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (2014): https://arxiv.org/abs/1409.0473
Luong, Pham, and Manning, Effective Approaches to Attention-based Neural Machine Translation (2015): https://arxiv.org/abs/1508.04025
Vaswani et al., Attention Is All You Need (2017): https://arxiv.org/abs/1706.03762
Shazeer, Fast Transformer Decoding: One Write-Head is All You Need (2019): https://arxiv.org/abs/1911.02150
Child et al., Generating Long Sequences with Sparse Transformers (2019): https://arxiv.org/abs/1904.10509
Beltagy, Peters, and Cohan, Longformer: The Long-Document Transformer (2020): https://arxiv.org/abs/2004.05150
Katharopoulos et al., Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020): https://arxiv.org/abs/2006.16236
Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (2023): https://arxiv.org/abs/2305.13245
DeepSeek-AI, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024): https://arxiv.org/abs/2405.04434
Welch Labs poster page, MLA/DeepSeek Attention Poster 13x19: https://www.welchlabs.com/resources/mladeepseek-attention-poster-13x19

Why Attention Matters#

The Core Idea: Query, Key, and Value#

A Short History of Attention#

1. Additive Attention (Bahdanau Attention)#

2. Multiplicative / Dot-Product Attention (Luong Attention)#

Transformer Attention: The Variants You Must Know#

Self-Attention#

Cross-Attention#

Causal or Masked Self-Attention#

Multi-Head Attention#

Soft Attention vs Hard Attention#

Attention for Long Contexts#

Local or Sliding-Window Attention#

Sparse Attention#

Global + Local Attention#

Linear Attention#

Cross-Attention vs Self-Attention in Real Systems#

The Inference Problem: KV Cache#

MQA, GQA, and MLA: Attention Variants for Faster Decoding#

Multi-Head Attention (MHA)#

Multi-Query Attention (MQA)#

Grouped-Query Attention (GQA)#

Multi-Head Latent Attention (MLA)#

Why DeepSeek MLA Drew So Much Attention#

Not Everything Called “Attention Optimization” Is a New Attention Mechanism#

Which Attention Mechanism Should You Use?#

Use full self-attention when:#

Use cross-attention when:#

Use local or sparse attention when:#

Use linear attention when:#

Use MQA or GQA when:#

Use MLA when:#

A Practical Comparison Table#

Common Misunderstandings#

“Attention means the model understands language like a human.”#

“Attention weights are a complete explanation of model reasoning.”#

“All efficient attention methods solve the same problem.”#

“DeepSeek MLA replaces all earlier attention ideas.”#

Final Takeaway#

Sources and Further Reading#

Related Articles