Forward pass cover illustrationA visual summary of the forward pass from inputs through weighted sum and activation to output, then scaling to matrix form.Forward PassFrom One Neuron to Matrix FormStart with one weighted sum, add a bias, apply an activation,then scale the same idea into a full dense layer.x1x2x3sum+ bReLUa = f(z)yx1w1x2w2x3w3Matrix formz = Wx + ba = f(z)many neuronsThe same computation repeats: inputs -> weighted sum -> activation -> output

Forward Pass Explained: From a Single Neuron to Matrix Form

The forward pass is the part of a neural network that actually produces a prediction. You feed inputs into the model, the model applies a sequence of mathematical operations, and an output comes out the other side. That sounds trivial, but it is one of the most important ideas in deep learning because everything else depends on it: the loss compares the forward-pass output to the target backpropagation differentiates through the forward pass training is just repeating the forward pass and improving it The easiest way to understand the forward pass is to start with a single neuron and then scale it up into a full layer written in matrix form. ...

April 3, 2026 · 7 min · Nitin
Backpropagation cover illustrationA cover image showing the forward pass flowing left to right and gradients flowing backward from loss to parameters.BackpropagationHow Neural Networks LearnForward pass computes values. Backward pass computes how mucheach earlier choice contributed to the final error.xwbzweighted sumReLUa = f(z)Lforward passbackward gradientsBackpropagation is the chain rule applied efficiently over the computation graph.

Backpropagation Explained Visually: How Neural Networks Actually Learn

Backpropagation is the core algorithm that makes neural networks trainable. The forward pass tells the model what prediction it currently makes. Backpropagation tells the model how each weight contributed to the error so the optimizer can update those weights in the right direction. People often hear that backpropagation is “just the chain rule,” which is true but not especially helpful. The useful mental model is this: the forward pass computes values the backward pass computes sensitivities each node only needs its own local derivative the full gradient is built by multiplying those local derivatives along the path If that sounds abstract, it becomes much clearer once you look at one neuron first and then scale up. ...

April 3, 2026 · 8 min · Nitin

RoPE Explained: The Positional Encoding Trick Behind Modern Language Models

When people talk about transformers, they usually focus on attention, scale, or training data. But one smaller design choice has an outsized effect on model quality: How does the model know where each token appears in the sequence? That question matters because transformers do not understand order by default. Without positional information, a sequence starts to look more like an unordered set of tokens than a structured sentence, paragraph, or program. ...

March 19, 2026 · 10 min · Nitin
GPT-2 XL architecture diagram showing token embeddings, positional embeddings, 48 transformer blocks, 25 attention heads, and the output layer

Understanding LLM Architecture: Layers, Transformer Blocks, and Attention Heads

Large Language Models (LLMs) such as GPT-2, GPT-3, LLaMA, and BERT are built on top of the Transformer architecture. That architecture changed natural language processing by replacing recurrence with attention, which lets models process sequences more efficiently and capture long-range relationships more directly. If you are trying to understand what terms like layer, transformer block, and attention head actually mean, the easiest way is to follow the path a sentence takes through a GPT-style model. ...

March 16, 2026 · 8 min · Nitin

How Much Do LLMs Hallucinate in Document Q&A? Key Lessons from a 172B-Token Study

If you are building a RAG system, internal knowledge assistant, or document search chatbot, one question matters more than almost anything else: When the answer is supposed to come from the provided documents, how often does the model still make things up? That is exactly what the March 9, 2026 paper “How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms” tries to measure. ...

March 13, 2026 · 9 min · Nitin
How Attention EvolvedFrom sequence-to-sequence alignment to long-context decoder efficiency.2014Bahdanauadditive attention2015Luongdot-product styles2017Transformermulti-head self-attn2019-2023Sparse, local,linear, MQA, GQA2024-2025DeepSeekMLA focusThe trend is consistent: keep the expressive power of attention, then remove its biggest bottlenecks.

Attention Mechanisms Explained: Self-Attention, Cross-Attention, Sparse Attention, MQA, GQA, and DeepSeek MLA

Attention is the idea that made modern transformers practical and powerful. Instead of compressing an entire input into one fixed vector, a model can decide, token by token, which earlier pieces of information matter most right now. That sounds simple, but there are many different kinds of attention mechanisms, and they exist because models face different constraints: some need strong alignment between an encoder and a decoder some need to generate text one token at a time without looking ahead some need to handle very long documents some need to reduce GPU memory traffic at inference time This article walks through the main families of attention, shows where they fit, and explains why newer variants such as DeepSeek’s multi-head latent attention (MLA) matter. ...

March 9, 2026 · 14 min · Nitin

How Much GPU VRAM Do You Need to Run Large Language Models?

If you’re planning to run open-weight LLMs locally or in production, one of the first questions is: How much GPU VRAM do I actually need? The answer depends on three major components: Model weights KV cache (context memory) Runtime overhead Let’s break each one down clearly and practically. 1️⃣ Model Weights: The Base Memory Cost The largest fixed memory cost comes from the model weights. Simple Formula Weights (GB) ≈ Parameters (in billions) × (bits per weight / 8) ...

February 16, 2026 · 4 min · Nitin

Agentic Vision in Gemini 3 Flash: Turning “Seeing” into an Active Investigation

Frontier vision models have gotten really good at understanding images — but they’ve also had a consistent weakness: They still often treat an image like a single static glance. So if the answer depends on something tiny (a serial number, a distant street sign, a gauge reading, a small UI label), the model might miss it… and then it has to guess. Google’s new capability called Agentic Vision, launched with Gemini 3 Flash, is a major step toward fixing that. ...

January 29, 2026 · 5 min · Nitin

Understanding LLM Inference Basics: Prefill and Decode, TTFT, and ITL

Large language models (LLMs) like GPT-4, Llama, or Grok generate text by running inference — the phase where a trained model produces outputs from a given input prompt. While training is resource-intensive and done once, inference happens every time a user sends a query. Understanding the mechanics of inference is key to grasping why some models feel “fast” while others lag, and why certain optimizations matter. At a high level, modern LLM inference (for autoregressive transformer-based models) splits into two distinct phases: prefill and decode. These phases behave very differently in terms of computation and directly affect two critical user-facing metrics: Time to First Token (TTFT) and Inter-Token Latency (ITL). ...

December 21, 2025 · 5 min · Nitin

Analysis of open ai home directory

Recently, someone shared a screenshot on x.com, how to download OpenAI Home Directories. I tried it, and it works. In this blog, we will now try to understand exactly what the contents of this home directory are. working with GPT-5.2 thinking with gpt 5.2, i got error zip file not found. https://t.co/c1zTfBlWb9 pic.twitter.com/85tEv28MuJ — Nitin Kalra (@nkalra0123) <a href="https://twitter.com/nkalra0123/status/1999771366397231386?ref_src=twsrc%5Etfw">December 13, 2025</a> Let’s analyse the contents Inside the open ai home directory oai/ Folder: Slides, Docs, PDFs, and Spreadsheets Tooling This folder is a small toolkit for working with common “office” artifacts – PowerPoint decks, DOCX files, PDFs, and spreadsheets. It combines a few Python utilities with a set of practical guides that describe the preferred tools and a quality-check workflow (render → visually inspect → iterate). ...

December 13, 2025 · 5 min · Nitin