Attention is the idea that made modern transformers practical and powerful. Instead of compressing an entire input into one fixed vector, a model can decide, token by token, which earlier pieces of information matter most right now.

That sounds simple, but there are many different kinds of attention mechanisms, and they exist because models face different constraints:

  • some need strong alignment between an encoder and a decoder
  • some need to generate text one token at a time without looking ahead
  • some need to handle very long documents
  • some need to reduce GPU memory traffic at inference time

This article walks through the main families of attention, shows where they fit, and explains why newer variants such as DeepSeek’s multi-head latent attention (MLA) matter.

Timeline of attention mechanisms

A practical timeline: attention started as an alignment mechanism for sequence-to-sequence models, then became the core compute pattern inside transformers, and is now being redesigned for long-context and low-latency inference.

Why Attention Matters

Before transformers, sequence models such as RNNs and LSTMs processed text step by step. They were useful, but they struggled to keep distant information alive over long spans. If a model had to connect a pronoun to a noun many words earlier, or tie the end of a long paragraph back to the beginning, that fixed-memory bottleneck became a problem.

Attention changed this by turning memory lookup into a learned operation. Instead of asking the network to remember everything in one hidden state, attention lets the current token selectively read from many token positions.

In plain language:

  • the current token asks a question
  • earlier tokens advertise what information they contain
  • the model retrieves a weighted mixture of the most relevant information

That is the heart of attention.

The Core Idea: Query, Key, and Value

The modern transformer version of attention is usually written as:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

You do not need the notation to get the idea:

  • Query (Q) represents what the current token is looking for
  • Key (K) represents what each token can offer
  • Value (V) is the actual content that will be mixed into the output

The query is compared with all keys. High similarity means high relevance. After normalization with softmax, the model uses those weights to combine the value vectors.

QKV attention flow

Scaled dot-product attention in one picture: token embeddings are projected into queries, keys, and values; scores are computed from query-key similarity; the resulting weights mix the value vectors into contextual outputs.

A Short History of Attention

1. Additive Attention (Bahdanau Attention)

One of the most influential early uses of attention appeared in sequence-to-sequence machine translation. In the 2014 Bahdanau paper, the decoder did not rely on a single final encoder state. Instead, for each output word, it learned a soft alignment over encoder states.

Why it mattered:

  • it improved translation quality
  • it made long sequences easier to handle
  • it made alignment between source and target words more explicit

This mechanism is often called additive attention because the score is produced by a learned feed-forward function over encoder and decoder states rather than a raw dot product.

2. Multiplicative / Dot-Product Attention (Luong Attention)

Luong-style attention simplified scoring by using dot products or closely related forms. It was faster and easier to scale than additive attention, especially as vectorized linear algebra became the dominant implementation style.

This family sits conceptually between early seq2seq attention and the fully transformer-native attention used later.

Transformer Attention: The Variants You Must Know

The transformer did not invent attention, but it made attention the central compute primitive in the architecture.

Self-Attention

In self-attention, the queries, keys, and values all come from the same sequence.

If the input is:

The flag is blue

then each token can look at the other tokens in that same sentence. The token flag can attend to The, is, and blue, and the final representation for flag becomes contextual rather than isolated.

Why self-attention is powerful:

  • every token can directly access every other token
  • long-range interactions no longer require many recurrent steps
  • the whole sequence can be processed in parallel during training

Cross-Attention

In cross-attention, queries come from one sequence while keys and values come from another.

This is the classic pattern in encoder-decoder models:

  • the encoder builds representations of the source sentence
  • the decoder uses cross-attention to read from the encoded source while generating the target sentence

Cross-attention also appears outside translation:

  • retrieval-augmented generation
  • image-text models
  • audio-text models
  • multi-document or multi-modal fusion

Causal or Masked Self-Attention

For autoregressive language models, a token cannot be allowed to see future tokens during training. Otherwise the model would cheat.

That is why GPT-style decoders use causal attention:

  • token 1 can attend only to token 1
  • token 2 can attend to tokens 1 and 2
  • token 3 can attend to tokens 1, 2, and 3

Mathematically, this is handled by masking the upper-right triangle of the attention score matrix before the softmax.

Self, cross, and causal attention

Self-attention reads within one sequence. Cross-attention reads from a different sequence. Causal attention applies a future-token mask so next-token prediction remains valid.

Multi-Head Attention

One attention pattern is useful, but a single pattern is limiting. That is why transformers use multi-head attention.

Instead of learning one query, key, and value projection, the model learns many heads in parallel. Each head gets its own projection space and can specialize:

  • one head might focus on nearby syntax
  • another might focus on coreference
  • another might track delimiters, list structure, or repeated phrases

The heads are then concatenated and projected back into the model dimension.

Why multi-head attention works so well:

  • it increases representational diversity
  • it lets different relation types coexist
  • it is still easy to implement efficiently on accelerators

This is the dominant form in vanilla transformers, but it comes with a cost: each head usually stores its own key and value cache during decoding.

Soft Attention vs Hard Attention

Most large language models use soft attention. Every token receives a weighted mixture from many other tokens, and the weighting is differentiable. That makes training stable with standard gradient descent.

Hard attention uses discrete selection, such as choosing one location or a very small subset of locations. It can be more selective, but it is much harder to train because the discrete choice breaks ordinary backpropagation.

In practice:

  • soft attention dominates mainstream LLMs
  • hard attention ideas still appear in routing, retrieval, and some specialized sparse systems

Attention for Long Contexts

Standard full attention is expensive because the score matrix grows with the square of sequence length. If a sequence has N tokens, the matrix has N x N interactions.

That is acceptable for short contexts. It becomes painful for long documents, code repositories, or long conversations.

This pressure led to several families of efficient attention.

Local or Sliding-Window Attention

Local attention restricts each token to a fixed neighborhood, such as a window of nearby tokens.

Why it helps:

  • compute and memory drop sharply
  • nearby dependencies in language are often very important
  • it is simple and hardware-friendly

Where it struggles:

  • purely local windows can miss long-range dependencies
  • information must hop across many layers to travel far

Sparse Attention

Sparse attention does not let every token attend everywhere. Instead, it computes a carefully chosen subset of token pairs.

Typical sparse patterns include:

  • local bands
  • strided links
  • dilated patterns
  • designated global tokens

The goal is to keep important long-range routes while avoiding a dense N x N matrix.

Global + Local Attention

Architectures such as Longformer combine local windows with a few globally visible tokens. Those global tokens act like hubs, allowing information to travel long distances without fully dense attention.

This is often a strong compromise for document processing:

  • local attention captures nearby structure
  • global tokens provide long-range communication

Linear Attention

Linear attention changes the computation rather than only changing the pattern. Instead of explicitly forming the full pairwise attention matrix, it rewrites or approximates the operation so the cost grows roughly linearly with sequence length.

That can be attractive for very long inputs, but there is a trade-off:

  • it is faster or more memory-efficient in the right setting
  • it does not always match the quality of exact full attention
  • implementation details matter a lot

Long-context attention patterns

Long-context variants either limit which token pairs are scored, as in local and sparse attention, or they change the algebra itself, as in linear attention.

Cross-Attention vs Self-Attention in Real Systems

A useful way to think about deployment is this:

  • encoder-only models use self-attention heavily for representation learning
  • decoder-only LLMs use causal self-attention for next-token prediction
  • encoder-decoder models use self-attention inside each stack and cross-attention between them
  • multimodal models often use cross-attention to connect text with image, audio, or video features

That means “attention mechanism” can refer either to the scoring rule itself or to the connectivity pattern between information sources.

The Inference Problem: KV Cache

The transcript you shared focuses on a particularly important decoder-side issue: the key-value cache, usually shortened to KV cache.

When a decoder-only language model generates text one token at a time, it does not want to recompute all keys and values for the full prefix on every step. So it stores previously computed keys and values in memory and reuses them.

That is a huge win for compute, but it creates a memory bottleneck:

  • the cache grows with sequence length
  • it must exist for every layer
  • in vanilla multi-head attention, each head contributes its own keys and values

As models and contexts get larger, moving KV cache data can become a major cost during inference.

This is the background for MQA, GQA, and MLA.

MQA, GQA, and MLA: Attention Variants for Faster Decoding

Multi-Head Attention (MHA)

This is the standard transformer setup:

  • every head has its own Q, K, and V projections
  • every head keeps its own key and value cache

Strength:

  • maximum head-specific flexibility

Trade-off:

  • largest KV cache and highest memory bandwidth pressure

Multi-Query Attention (MQA)

MQA keeps separate queries per head, but it shares the key and value projections across all heads.

Why it helps:

  • the KV cache becomes much smaller
  • decoding can become much faster

Trade-off:

  • heads lose some freedom because they read from the same shared keys and values

Grouped-Query Attention (GQA)

GQA is a compromise between MHA and MQA. Instead of one shared K/V for all heads, it shares K/V within groups of heads.

This gives:

  • a smaller KV cache than MHA
  • more specialization than MQA

That is why GQA has become a common practical choice in large models.

Multi-Head Latent Attention (MLA)

This is the mechanism highlighted in the Welch Labs video and poster.

The key idea is not simply to force heads to share the same keys and values. Instead, the model learns a compressed latent representation for the KV cache, shared across heads, and then uses learned projections so each head can still recover head-specific behavior from that shared latent space.

Conceptually, MLA does this:

  1. compress the information that would normally live in a large per-head KV cache into a smaller shared latent cache
  2. keep enough structure so different heads can still act differently
  3. rearrange the inference computation so the compression does not introduce a large new runtime penalty

That is why MLA is interesting. It is not just a memory-saving trick in the crude sense. It tries to preserve the benefits of multi-head specialization while shrinking the cache dramatically.

KV cache attention families

The decoder-efficiency family: MHA stores separate K/V per head, MQA shares one K/V set, GQA shares within groups, and MLA uses a learned shared latent cache with head-specific recovery.

Why DeepSeek MLA Drew So Much Attention

The Welch Labs video discusses DeepSeek R1 because that model made the mechanism famous to a wider audience in early 2025. The underlying attention design, however, was introduced in the DeepSeek-V2 technical report published in May 2024.

That distinction matters:

  • the reasoning model publicized the result
  • the earlier technical report introduced the architectural idea

The central engineering claim is that MLA drastically reduces KV-cache pressure relative to standard multi-head attention. That matters because modern decoding is often memory-bandwidth-bound, not purely arithmetic-bound.

In other words, the bottleneck is often not “how many multiplications can the GPU do?” but “how quickly can the system move cached keys and values back into the compute units for the next token?”

Why MLA is appealing:

  • smaller KV cache
  • lower memory traffic during decode
  • preservation of more head-specific flexibility than simple MQA
  • better fit for long-context inference workloads

One subtle but important point from the DeepSeek description is that MLA is combined with algebraic rearrangements so the latent-space trick does not simply add another expensive matrix multiply on every token. That design choice is part of what makes the mechanism practically useful instead of merely elegant on paper.

Not Everything Called “Attention Optimization” Is a New Attention Mechanism

This is an important distinction.

Some improvements change the attention mechanism itself:

  • additive attention
  • self-attention
  • sparse attention
  • linear attention
  • MQA, GQA, MLA

Other improvements mainly change the implementation strategy:

  • FlashAttention
  • fused kernels
  • better cache layouts
  • quantized KV caches

These are all valuable, but they solve different layers of the problem.

For example:

  • FlashAttention improves how attention is computed on hardware
  • MLA changes what is stored and how head-specific information is represented

The first is mainly an execution optimization. The second is an architectural change.

Which Attention Mechanism Should You Use?

There is no universally best choice. The right mechanism depends on the job.

Use full self-attention when:

  • context lengths are moderate
  • you want maximum modeling flexibility
  • you are building a standard transformer baseline

Use cross-attention when:

  • one sequence must read from another sequence
  • you are building encoder-decoder or multimodal systems

Use local or sparse attention when:

  • long-context cost is a real bottleneck
  • the task has strong locality structure
  • you can tolerate restricted token-to-token connectivity

Use linear attention when:

  • you need streaming or long-sequence efficiency
  • approximate or reformulated attention is acceptable
  • you are optimizing for asymptotic scaling

Use MQA or GQA when:

  • decoder inference speed matters
  • KV-cache footprint is a bottleneck
  • you need a practical LLM inference optimization

Use MLA when:

  • you want more aggressive KV-cache reduction
  • you still want more flexibility than naive K/V sharing
  • you are designing for large-scale autoregressive decoding

A Practical Comparison Table

MechanismWhat changes?Main benefitMain trade-offCommon use
Additive attentionLearned scoring network over encoder/decoder statesStrong seq2seq alignmentLess hardware-friendly than dot productsEarly NMT
Dot-product attentionSimilarity via vector productsFast and scalableStill expensive at long contextSeq2seq, transformers
Self-attentionTokens attend within one sequenceRich contextualizationFull version is quadraticEncoders and decoders
Cross-attentionQueries read a different sequenceGreat for conditioning and fusionExtra memory and computeEncoder-decoder, multimodal
Causal attentionFuture positions are maskedValid next-token predictionStill quadratic over prefix during trainingGPT-style LLMs
Multi-head attentionMultiple learned heads in parallelDiverse relation modelingLarge KV cache in decodingVanilla transformers
Local attentionRestrict attention to nearby windowsCheaper long-context processingWeak direct long-range accessLong documents
Sparse attentionCompute only selected token pairsBetter scaling than full attentionPattern design mattersLong-context models
Linear attentionRewrite or approximate attention algebraNear-linear scalingMay sacrifice exactnessStreaming and very long sequences
Multi-query attentionShare K/V across all headsMuch smaller KV cacheLess head specializationFast decoder inference
Grouped-query attentionShare K/V within groupsGood quality/speed compromiseNot as flexible as full MHAMany modern LLMs
Multi-head latent attentionLearn a compressed shared latent KV spaceVery small KV cache with stronger flexibilityMore architectural complexityDeepSeek-style decoder efficiency

Common Misunderstandings

“Attention means the model understands language like a human.”

No. Attention is a learned weighting mechanism. It is powerful, but it is still numerical pattern processing.

“Attention weights are a complete explanation of model reasoning.”

Not necessarily. Attention maps can be informative, but they are not the whole story. Feed-forward blocks, residual paths, normalization, and head interactions all contribute.

“All efficient attention methods solve the same problem.”

No. Some solve training-time sequence scaling, some solve long-context connectivity, and some solve decode-time KV-cache bandwidth.

“DeepSeek MLA replaces all earlier attention ideas.”

No. MLA is best understood as a decoder-efficiency architecture for transformer-style models, not as a universal replacement for every attention variant.

Final Takeaway

The word attention now covers a family of related ideas rather than one single mechanism.

The progression looks like this:

  • early attention solved alignment in sequence-to-sequence models
  • transformer self-attention made global token interaction the center of the architecture
  • long-context variants reduced the cost of pairwise interactions
  • decoder-focused variants such as MQA, GQA, and MLA attacked the KV-cache bottleneck during generation

If you are learning modern AI systems, that last step is especially important. Once models become large enough, architecture is no longer only about raw quality. It is also about bandwidth, latency, cache size, and deployability. That is exactly why newer mechanisms such as DeepSeek’s MLA matter.

Sources and Further Reading