LoRA fine-tuning cover illustrationA cover graphic showing a frozen pretrained matrix plus a small low-rank adapter update made of two trainable matrices.LoRA Fine-TuningLow-Rank Adaptation for LLMsFreeze the large pretrained weight. Learn a small structured updatethat changes the model just enough for the new task.W = W0 + (alpha / r)BAOne frozen path, one tiny trainable pathThe base model stays intact while adapters learn the task update.W0d_out x d_infrozenAr x d_inBd_out x rtrain only r(d_in + d_out) paramsWhy it matters1. Far fewer trainable parameters2. Much smaller optimizer state3. Easy task-specific adapter checkpoints

LoRA Fine-Tuning Explained: What It Is, Why It Works, and the Math Behind It

LoRA stands for Low-Rank Adaptation. It is one of the most useful ideas in modern LLM fine-tuning because it changes the question from: How do we update all of the model's weights? to: How do we learn a small update that is still expressive enough for the new task? That is the whole trick. Instead of fine-tuning every entry of a large weight matrix, LoRA keeps the original pretrained weight frozen and learns a low-rank correction on top of it. This makes training much cheaper in parameters, optimizer state, checkpoint size, and often VRAM. ...

April 5, 2026 · 11 min · Nitin

Loss functions for llm — a practical, hands-on guide

Introduction When training large language models (LLMs) the most important question is simple: how do we measure whether the model is doing well? For regression you use mean squared error, for classification you might use cross-entropy or hinge loss. But for LLMs — which predict sequences of discrete tokens — the right way to turn “this output feels wrong” into a number you can optimize is a specific kind of probability loss: categorical cross-entropy / negative log likelihood, and the closely related, more interpretable metric perplexity. ...

October 18, 2025 · 5 min · Nitin

Supervised Fine-Tuning (SFT)

What is Supervised Fine-Tuning (SFT)? Supervised fine-tuning is a training strategy where a pre-trained language model is further refined on a carefully curated dataset of prompt-response pairs. The primary goal is to “teach” the model how to generate appropriate, contextually relevant, and human-aligned responses. Key points about SFT include: Data Curation: The model is exposed to a dataset that contains high-quality examples—often created by human annotators—that demonstrate the desired behavior (e.g., step-by-step reasoning, correct coding outputs, or helpful dialogue responses). Instruction Following: By training on these examples, the model learns to interpret prompts as instructions and produce answers that mimic the reasoning and style of the training data. Limitations: While SFT works well to instill basic response quality, it is typically limited by the dataset’s scope and may not encourage the model to “think” beyond what is explicitly provided. Furthermore, excessive fine-tuning can lead to overfitting and reduce the model’s ability to generalize to unseen tasks. For many contemporary language models, SFT is the standard method used to bridge the gap between raw pre-training and interactive, user-facing performance. ...

February 2, 2025 · 5 min · Nitin

Intro to Large Language Models

The Busy Person’s Guide to Large Language Models: From Inner Workings to Future Possibilities (and Security Concerns) This post explores the fascinating world of large language models (LLMs) like ChatGPT and llama2, diving into their inner workings, potential future developments, and even the security challenges they present. It’s a summary of a talk by Andrej Karpathy, offering a comprehensive overview for anyone curious about this rapidly evolving technology. What are LLMs and How Do They Work? Imagine a massive file containing compressed knowledge from the internet – that’s essentially what an LLM is. It’s a complex neural network trained on vast amounts of text data, enabling it to predict and generate human-like text. The process involves two key stages: ...

April 23, 2024 · 4 min · Nitin