Universal Approximation Theorem illustration showing a neural network approximating a smooth function

Universal Approximation Theorem Explained: Why Neural Networks Can Approximate Any Continuous Function

The Universal Approximation Theorem (UAT) gets quoted constantly, but it is usually described in a fuzzier way than it deserves. It does not say neural networks are magically good at every task. It does not say a shallow network is the most practical architecture. It does not say gradient descent will easily find the right weights. What it does say is still important: With a suitable nonlinear activation and enough hidden units, a feedforward network can approximate any continuous function on a bounded domain as closely as we want. ...

April 3, 2026 · 8 min · Nitin
GPT-2 XL architecture diagram showing token embeddings, positional embeddings, 48 transformer blocks, 25 attention heads, and the output layer

Understanding LLM Architecture: Layers, Transformer Blocks, and Attention Heads

Large Language Models (LLMs) such as GPT-2, GPT-3, LLaMA, and BERT are built on top of the Transformer architecture. That architecture changed natural language processing by replacing recurrence with attention, which lets models process sequences more efficiently and capture long-range relationships more directly. If you are trying to understand what terms like layer, transformer block, and attention head actually mean, the easiest way is to follow the path a sentence takes through a GPT-style model. ...

March 16, 2026 · 8 min · Nitin

How Much Do LLMs Hallucinate in Document Q&A? Key Lessons from a 172B-Token Study

If you are building a RAG system, internal knowledge assistant, or document search chatbot, one question matters more than almost anything else: When the answer is supposed to come from the provided documents, how often does the model still make things up? That is exactly what the March 9, 2026 paper “How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms” tries to measure. ...

March 13, 2026 · 9 min · Nitin
Timeline of attention mechanisms from additive attention to DeepSeek MLA

Attention Mechanisms Explained: Self-Attention, Cross-Attention, Sparse Attention, MQA, GQA, and DeepSeek MLA

Attention is the idea that made modern transformers practical and powerful. Instead of compressing an entire input into one fixed vector, a model can decide, token by token, which earlier pieces of information matter most right now. That sounds simple, but there are many different kinds of attention mechanisms, and they exist because models face different constraints: some need strong alignment between an encoder and a decoder some need to generate text one token at a time without looking ahead some need to handle very long documents some need to reduce GPU memory traffic at inference time This article walks through the main families of attention, shows where they fit, and explains why newer variants such as DeepSeek’s multi-head latent attention (MLA) matter. ...

March 9, 2026 · 14 min · Nitin

How Much GPU VRAM Do You Need to Run Large Language Models?

If you’re planning to run open-weight LLMs locally or in production, one of the first questions is: How much GPU VRAM do I actually need? The answer depends on three major components: Model weights KV cache (context memory) Runtime overhead Let’s break each one down clearly and practically. 1️⃣ Model Weights: The Base Memory Cost The largest fixed memory cost comes from the model weights. Simple Formula Weights (GB) ≈ Parameters (in billions) × (bits per weight / 8) ...

February 16, 2026 · 4 min · Nitin

Agentic Vision in Gemini 3 Flash: Turning “Seeing” into an Active Investigation

Frontier vision models have gotten really good at understanding images — but they’ve also had a consistent weakness: They still often treat an image like a single static glance. So if the answer depends on something tiny (a serial number, a distant street sign, a gauge reading, a small UI label), the model might miss it… and then it has to guess. Google’s new capability called Agentic Vision, launched with Gemini 3 Flash, is a major step toward fixing that. ...

January 29, 2026 · 5 min · Nitin

Understanding LLM Inference Basics: Prefill and Decode, TTFT, and ITL

Large language models (LLMs) like GPT-4, Llama, or Grok generate text by running inference — the phase where a trained model produces outputs from a given input prompt. While training is resource-intensive and done once, inference happens every time a user sends a query. Understanding the mechanics of inference is key to grasping why some models feel “fast” while others lag, and why certain optimizations matter. At a high level, modern LLM inference (for autoregressive transformer-based models) splits into two distinct phases: prefill and decode. These phases behave very differently in terms of computation and directly affect two critical user-facing metrics: Time to First Token (TTFT) and Inter-Token Latency (ITL). ...

December 21, 2025 · 5 min · Nitin

Analysis of open ai home directory

Recently, someone shared a screenshot on x.com, how to download OpenAI Home Directories. I tried it, and it works. In this blog, we will now try to understand exactly what the contents of this home directory are. working with GPT-5.2 thinking with gpt 5.2, i got error zip file not found. https://t.co/c1zTfBlWb9 pic.twitter.com/85tEv28MuJ — Nitin Kalra (@nkalra0123) <a href="https://twitter.com/nkalra0123/status/1999771366397231386?ref_src=twsrc%5Etfw">December 13, 2025</a> Let’s analyse the contents Inside the open ai home directory oai/ Folder: Slides, Docs, PDFs, and Spreadsheets Tooling This folder is a small toolkit for working with common “office” artifacts – PowerPoint decks, DOCX files, PDFs, and spreadsheets. It combines a few Python utilities with a set of practical guides that describe the preferred tools and a quality-check workflow (render → visually inspect → iterate). ...

December 13, 2025 · 5 min · Nitin

How to Stop Hallucinations in RAG Chatbots: A Complete Guide

Hallucinations in RAG (Retrieval-Augmented Generation) chatbots can undermine user trust and lead to misinformation. In this comprehensive guide, we’ll explore proven strategies to minimize these AI-generated inaccuracies and build more reliable chatbot systems. If you’re building a RAG chatbot, you’ve likely encountered the frustrating problem of hallucinations—when your AI confidently provides incorrect or fabricated information. The good news? There are effective, battle-tested solutions to dramatically reduce these errors. Let’s dive into the multi-layered approach that actually works. ...

November 3, 2025 · 5 min · Nitin

Loss functions for llm — a practical, hands-on guide

Introduction When training large language models (LLMs) the most important question is simple: how do we measure whether the model is doing well? For regression you use mean squared error, for classification you might use cross-entropy or hinge loss. But for LLMs — which predict sequences of discrete tokens — the right way to turn “this output feels wrong” into a number you can optimize is a specific kind of probability loss: categorical cross-entropy / negative log likelihood, and the closely related, more interpretable metric perplexity. ...

October 18, 2025 · 5 min · Nitin