LoRA stands for Low-Rank Adaptation. It is one of the most useful ideas in modern LLM fine-tuning because it changes the question from:
How do we update all of the model's weights?
to:
How do we learn a small update that is still expressive enough for the new task?
That is the whole trick.
Instead of fine-tuning every entry of a large weight matrix, LoRA keeps the original pretrained weight frozen and learns a low-rank correction on top of it. This makes training much cheaper in parameters, optimizer state, checkpoint size, and often VRAM.
LoRA is a PEFT method, short for Parameter-Efficient Fine-Tuning.
If you have seen people say things like:
- “I trained only adapters”
- “I fine-tuned a 7B model on one GPU”
- “I shipped one base model with many small task-specific checkpoints”
there is a good chance they were using LoRA or something very close to it.
Why Full Fine-Tuning Gets Expensive Fast
Take one linear layer with weight matrix
$$ W_0 \in \mathbb{R}^{d_{out} \times d_{in}}. $$
In a normal forward pass:
$$ y = W_0 x. $$
If we do full fine-tuning, we allow every entry of that matrix to change. So training really learns
$$ W = W_0 + \Delta W $$
and the layer becomes
$$ y = (W_0 + \Delta W)x. $$
That sounds harmless for one layer, but LLMs contain many huge projection matrices:
- attention projections
- output projections
- MLP up and down projections
Once the model is large, updating all of them becomes expensive.
The main cost is not just storing the pretrained weights. During training you also need memory for:
- gradients
- optimizer states such as Adam’s first and second moments
- checkpoints for all trainable tensors
So full fine-tuning is often overkill if the downstream task only needs a relatively structured change.
LoRA in One Equation
LoRA keeps $W_0$ frozen and parameterizes the update as
$$ \Delta W = \frac{\alpha}{r}BA $$
where
$$ A \in \mathbb{R}^{r \times d_{in}}, \qquad B \in \mathbb{R}^{d_{out} \times r}. $$
Then the forward pass becomes
$$ y = W_0 x + \frac{\alpha}{r}BAx. $$
The factor $\alpha / r$ is just a scaling term. Different libraries expose it as lora_alpha, and its job is to control the effective size of the adapter update.
Some sources swap the letters and write the factors in the opposite order. Do not get hung up on the names. The important point is this:
- the big pretrained matrix stays frozen
- the learned update is factored through a small intermediate dimension
r ris much smaller thand_inord_out
r.What “Low Rank” Actually Means
The word rank is doing the heavy lifting here.
If a matrix has rank r, that means it can only express r independent directions in a linear-algebra sense. A useful identity is:
$$ \operatorname{rank}(BA) \le \min(\operatorname{rank}(B), \operatorname{rank}(A)) \le r. $$
So the LoRA update cannot be an arbitrary full matrix. It is restricted to a smaller family of updates.
Another way to see it is to write the product as a sum of rank-1 outer products:
$$ BA = \sum_{i=1}^{r} b_i a_i^T $$
where $b_i$ is the $i$th column of $B$ and $a_i^T$ is the $i$th row of $A$.
That means LoRA is really saying:
Instead of learning one giant free-form update, learn a small number of direction pairs and add them together.
This restriction is exactly why LoRA is efficient.
Why a Low-Rank Update Can Still Work
At first glance, LoRA looks like it should be too restrictive.
Why would a tiny rank-8 or rank-16 update be enough for a giant transformer layer?
The intuition is that many downstream tasks do not need the model to move in every possible direction in weight space. The useful adaptation often lives in a much smaller subspace.
This is closely related to a standard fact from linear algebra: the best rank-r approximation to a matrix, in Frobenius norm, comes from truncated SVD:
$$ M_r = \sum_{i=1}^{r} \sigma_i u_i v_i^T. $$
If the singular values $\sigma_i$ decay quickly, then a small rank already captures most of the matrix’s energy.
LoRA is not literally computing the SVD of the perfect update during training. But it is making the same bet:
The task-specific change is low-dimensional enough that a low-rank parameterization can capture most of what matters.
The Parameter-Count Math
This is where LoRA becomes obviously attractive.
For full fine-tuning of one linear layer, the number of trainable parameters is
$$ d_{out}d_{in}. $$
For LoRA, the trainable parameters are only
$$ rd_{in} + d_{out}r = r(d_{in} + d_{out}). $$
So the trainable fraction is
$$ \frac{r(d_{in} + d_{out})}{d_{out}d_{in}}. $$
If the layer is square, with $d_{in} = d_{out} = d$, this simplifies to
$$ \frac{2r}{d}. $$
That simple ratio explains why LoRA becomes dramatic on big models.
For a 4096 x 4096 projection:
| Setup | Trainable parameters | Fraction of full |
|---|---|---|
| Full fine-tuning | 16,777,216 | 100% |
LoRA, r = 8 | 65,536 | 0.39% |
LoRA, r = 16 | 131,072 | 0.78% |
LoRA, r = 64 | 524,288 | 3.13% |
So even rank 64 is still training only a small slice of the full matrix.
4096 x 4096 layer, LoRA stays tiny even as rank increases. The full dense update is a horizontal line because it never changes with rank.A Small but Important Nuance About Memory
LoRA reduces trainable parameters, but it does not remove the need to hold the base model itself in memory.
That distinction matters.
- LoRA saves a lot of optimizer memory because only
AandBget optimizer states. - LoRA saves checkpoint size because you can store just the adapters.
- LoRA does not magically make the frozen base weights disappear.
If base-model memory is still the bottleneck, that is where QLoRA comes in:
- quantize the frozen base model, often to 4-bit
- keep LoRA adapters trainable in higher precision
So LoRA and QLoRA solve related but not identical problems.
Forward Pass Intuition
It helps to break the LoRA forward pass into two smaller steps.
First project the input down into rank r:
$$ h = Ax \in \mathbb{R}^{r}. $$
Then project it back up:
$$ u = Bh \in \mathbb{R}^{d_{out}}. $$
Now combine with the frozen layer:
$$ y = W_0 x + \frac{\alpha}{r}u. $$
That gives a nice mental model:
Acompresses the useful task signal into a small latent spaceBexpands that signal back into the output dimension- the frozen pretrained path still carries the original model behavior
In other words, LoRA is not replacing the pretrained model. It is adding a small learned correction on top of it.
What Backpropagation Updates
Suppose the loss is $L$ and the layer output is
$$ y = W_0 x + sBAx $$
with $s = \alpha / r$.
Let
$$ g = \frac{\partial L}{\partial y}. $$
If we define
$$ h = Ax, $$
then the gradients for the trainable LoRA factors are
$$ \frac{\partial L}{\partial B} = sgh^T $$
and
$$ \frac{\partial L}{\partial A} = sB^Tgx^T. $$
The important practical point is not the notation. It is this:
- gradients flow through the LoRA branch
AandBget updatedW_0stays frozen
That is why optimizer state stays small.
Why the Initialization Matters
Most LoRA implementations initialize the factors so that the adapter path starts at zero.
A common choice is:
- initialize
Awith small random values - initialize
Bto zeros
Then initially
$$ BA \approx 0 $$
and the model behaves exactly like the original pretrained model at step 0.
That is a good default because the adapter starts as a no-op and only learns deviations supported by the downstream data.
A Minimal PyTorch Implementation
Here is a stripped-down LoRALinear module. This is not a full library replacement, but it shows the mechanics clearly. In a real fine-tuning run, self.weight would be copied from the pretrained checkpoint and then frozen.
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, r=16, alpha=32, bias=True):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.r = r
self.scaling = alpha / r if r > 0 else 0.0
self.weight = nn.Parameter(torch.empty(out_features, in_features))
self.weight.requires_grad = False
if bias:
self.bias = nn.Parameter(torch.zeros(out_features))
self.bias.requires_grad = False
else:
self.register_parameter("bias", None)
if r > 0:
self.lora_A = nn.Parameter(torch.randn(r, in_features) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(out_features, r))
else:
self.register_parameter("lora_A", None)
self.register_parameter("lora_B", None)
nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
def forward(self, x):
base = F.linear(x, self.weight, self.bias)
if self.r == 0:
return base
update = (x @ self.lora_A.t()) @ self.lora_B.t()
return base + self.scaling * update
And when you fine-tune, you would optimize only the adapter parameters:
model = LoRALinear(4096, 4096, r=16, alpha=32)
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.AdamW(trainable, lr=2e-4)
Real-world implementations add more details:
- dropout on the adapter path
- merge and unmerge logic for inference
- automatic insertion into specific transformer modules
- optional bias tuning rules
But the mathematical core is still the same BA update.
A Short Python + Matplotlib Example
The parameter-savings chart above can be reproduced with a small script:
import numpy as np
import matplotlib.pyplot as plt
d_in = 4096
d_out = 4096
ranks = np.array([1, 2, 4, 8, 16, 32, 64, 128, 256])
full_params = d_in * d_out
lora_params = ranks * (d_in + d_out)
fig, ax = plt.subplots(figsize=(9, 5.6), constrained_layout=True)
ax.set_yscale("log")
ax.plot(ranks, lora_params, marker="o", linewidth=2.5, label="LoRA trainable params")
ax.axhline(full_params, linestyle="--", linewidth=2, label="full fine-tuning")
ax.set_xscale("log", base=2)
ax.set_xticks(ranks)
ax.get_xaxis().set_major_formatter(plt.ScalarFormatter())
ax.set_xlabel("rank r")
ax.set_ylabel("trainable parameters")
ax.legend()
plt.show()
And this is the plot the script produces:
This is one of those cases where a quick plot is more persuasive than a paragraph.
Where LoRA Is Applied in a Transformer
LoRA is usually attached to selected linear layers, not every parameter in the model.
Common choices:
q_projandv_projin self-attention- sometimes
k_projando_projas well - MLP projections in more aggressive setups
- occasionally embeddings or the output head, depending on the goal
This is another reason LoRA is efficient: you can decide where adaptation capacity matters most.
For many instruction-tuning setups, adding LoRA to a small subset of projection layers is already enough to get strong results.
What the Hyperparameters Mean
Three knobs matter most:
1. Rank r
This controls adapter capacity.
- smaller
rmeans fewer trainable parameters - larger
rmeans more expressive updates
If rank is too small, the adapter may underfit. If it is too large, you lose some of the efficiency advantage.
2. alpha
This scales the adapter contribution.
You can think of it as controlling how strongly the low-rank branch is allowed to influence the frozen base path.
3. LoRA dropout
Some implementations apply dropout to the adapter input during training. This can help regularize the adapter when data is limited.
Why LoRA Is So Convenient Operationally
LoRA is not only about memory. It is also operationally neat.
Because the base model is frozen:
- the base checkpoint can be reused across tasks
- each new task can be stored as a small adapter file
- multiple adapters can be swapped in and out without copying the whole model
That is why LoRA has become such a standard workflow for practical LLM customization.
Can LoRA Match Full Fine-Tuning?
Sometimes yes, sometimes no.
LoRA is often surprisingly competitive, especially when:
- the downstream task is close to what the base model already knows
- the data volume is moderate
- the target behavior can be expressed with a structured update
But LoRA is still a constraint.
Cases where full fine-tuning or a larger adapter may help:
- the task requires a very large behavioral shift
- you need to update many more parts of the model
- the chosen rank is too small
- the base model itself is a poor starting point
So the right question is not:
Is LoRA always better than full fine-tuning?
The right question is:
Is a low-rank update enough for this task, given the cost savings?
The Main Takeaway
LoRA works because it separates two roles:
- the pretrained model stores broad general capability
- the adapter stores a compact task-specific correction
Mathematically, the idea is simple:
$$ W = W_0 + \frac{\alpha}{r}BA $$
Engineering-wise, that small factorization changes a lot:
- far fewer trainable parameters
- smaller optimizer state
- smaller checkpoints
- easy adapter sharing and reuse
That is why LoRA became one of the default answers to the question:
How do I fine-tune a large model without paying the full price of full fine-tuning?
If you want one sentence to remember, use this one:
LoRA freezes the big pretrained matrix and learns a small low-rank update that captures the task-specific change.