Transformers

LoRA Fine-Tuning Explained: What It Is, Why It Works, and the Math Behind It

LoRA stands for Low-Rank Adaptation. It is one of the most useful ideas in modern LLM fine-tuning because it changes the question from: How do we update all of the model's weights? to: How do we learn a small update that is still expressive enough for the new task? That is the whole trick. Instead of fine-tuning every entry of a large weight matrix, LoRA keeps the original pretrained weight frozen and learns a low-rank correction on top of it. This makes training much cheaper in parameters, optimizer state, checkpoint size, and often VRAM. ...

RoPE Explained: The Positional Encoding Trick Behind Modern Language Models

When people talk about transformers, they usually focus on attention, scale, or training data. But one smaller design choice has an outsized effect on model quality: How does the model know where each token appears in the sequence? That question matters because transformers do not understand order by default. Without positional information, a sequence starts to look more like an unordered set of tokens than a structured sentence, paragraph, or program. ...

GPT-2 XL architecture diagram showing token embeddings, positional embeddings, 48 transformer blocks, 25 attention heads, and the output layer

Understanding LLM Architecture: Layers, Transformer Blocks, and Attention Heads

Large Language Models (LLMs) such as GPT-2, GPT-3, LLaMA, and BERT are built on top of the Transformer architecture. That architecture changed natural language processing by replacing recurrence with attention, which lets models process sequences more efficiently and capture long-range relationships more directly. If you are trying to understand what terms like layer, transformer block, and attention head actually mean, the easiest way is to follow the path a sentence takes through a GPT-style model. ...

Attention Mechanisms Explained: Self-Attention, Cross-Attention, Sparse Attention, MQA, GQA, and DeepSeek MLA

Attention is the idea that made modern transformers practical and powerful. Instead of compressing an entire input into one fixed vector, a model can decide, token by token, which earlier pieces of information matter most right now. That sounds simple, but there are many different kinds of attention mechanisms, and they exist because models face different constraints: some need strong alignment between an encoder and a decoder some need to generate text one token at a time without looking ahead some need to handle very long documents some need to reduce GPU memory traffic at inference time This article walks through the main families of attention, shows where they fit, and explains why newer variants such as DeepSeek’s multi-head latent attention (MLA) matter. ...

Q K V : Query (Q), Key (K), and Value (V) Vectors in the Attention Mechanism

Introduction In the attention mechanism used by Large Language Models (LLMs) like transformers (e.g., GPT), the core idea is to allow the model to dynamically focus on relevant parts of the input sequence when generating or understanding text. This is achieved through a process called scaled dot-product attention, where input tokens (e.g., words or subwords) are transformed into three types of vectors: Q K V, Query (Q), Key (K), and Value (V). These are not arbitrary; they’re learned projections of the input embeddings via linear transformations matrices ...

Tokenization in Large Language Models: A Hands-On Guide

Introduction In this blog post, we dive deep into tokenization, the very first step in preparing data for training large language models (LLMs). Tokenization is more than just splitting sentences into words—it’s about transforming raw text into a structured format that neural networks can process. We’ll build a tokenizer, encoder, and decoder from scratch in Python, and walk through handling unknown tokens and special context markers. By the end, you’ll not only understand how tokenization works but also have working Python code you can adapt for your own projects. ...

Unveiling the Secrets Behind ChatGPT – Part 1

Introduction Hello everyone! By now, you’ve likely heard of ChatGPT, the revolutionary AI system that has taken the world and the AI community by storm. This remarkable technology allows you to interact with an AI through text-based tasks. The Technology Behind ChatGPT: Transformers The neural network that powers ChatGPT is based on the Transformer architecture, introduced in the 2017 paper “Attention is All You Need.” GPT stands for “Generatively Pre-trained Transformer.” The Transformer architecture is a landmark development in AI that revolutionized the field, primarily in natural language processing (NLP). The Transformer architecture, initially designed for machine translation, became the backbone for numerous AI applications, including ChatGPT. ...

Intro to Large Language Models

The Busy Person’s Guide to Large Language Models: From Inner Workings to Future Possibilities (and Security Concerns) This post explores the fascinating world of large language models (LLMs) like ChatGPT and llama2, diving into their inner workings, potential future developments, and even the security challenges they present. It’s a summary of a talk by Andrej Karpathy, offering a comprehensive overview for anyone curious about this rapidly evolving technology. What are LLMs and How Do They Work? Imagine a massive file containing compressed knowledge from the internet – that’s essentially what an LLM is. It’s a complex neural network trained on vast amounts of text data, enabling it to predict and generate human-like text. The process involves two key stages: ...