Managing Multiple Python and Java Versions: A Developer’s Guide to pyenv and SDKMAN

Introduction As developers, we often find ourselves juggling multiple projects, each requiring different versions of Python or Java. Maybe you’re maintaining a legacy application that runs on Python 3.8 while building a new microservice on Python 3.12. Or perhaps you’re working with Java 11 for one client and Java 21 for another. Manually managing these versions can quickly become a nightmare of PATH variables, symlinks, and “it works on my machine” debugging sessions. ...

November 14, 2025 · 8 min · Nitin

Debugging HTTP Traffic Like a Pro: HTTP Toolkit and Terminal Interception

Introduction If you’ve ever stared at a cryptic error message from a CLI tool wondering “What HTTP requests is this thing actually making?”, you’re not alone. Whether it’s a failed git clone, a mysterious npm install error, or tracking claude code for finding prompts, finding out what data your application sends to third-party services, understanding HTTP traffic is crucial for modern development. Enter HTTP Toolkit – an open-source powerhouse that makes intercepting and debugging HTTP traffic almost effortless. ...

November 5, 2025 · 6 min · Nitin

How to Stop Hallucinations in RAG Chatbots: A Complete Guide

Hallucinations in RAG (Retrieval-Augmented Generation) chatbots can undermine user trust and lead to misinformation. In this comprehensive guide, we’ll explore proven strategies to minimize these AI-generated inaccuracies and build more reliable chatbot systems. If you’re building a RAG chatbot, you’ve likely encountered the frustrating problem of hallucinations—when your AI confidently provides incorrect or fabricated information. The good news? There are effective, battle-tested solutions to dramatically reduce these errors. Let’s dive into the multi-layered approach that actually works. ...

November 3, 2025 · 5 min · Nitin

Agentic Context Engineering (ACE): Turning Context Into a Self-Improving Playbook for LLMs

Large language models are getting smarter—but the real superpower may be how we feed them context. Instead of constantly fine-tuning weights, a growing family of techniques improves models by upgrading the inputs they see: richer instructions, reusable strategies, domain heuristics, and concrete evidence. The paper “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models” proposes ACE, a practical framework that treats context like an evolving playbook—something you grow, refine, and curate over time to make agents and reasoning systems measurably better. ...

October 22, 2025 · 9 min · Nitin

Loss functions for llm — a practical, hands-on guide

Introduction When training large language models (LLMs) the most important question is simple: how do we measure whether the model is doing well? For regression you use mean squared error, for classification you might use cross-entropy or hinge loss. But for LLMs — which predict sequences of discrete tokens — the right way to turn “this output feels wrong” into a number you can optimize is a specific kind of probability loss: categorical cross-entropy / negative log likelihood, and the closely related, more interpretable metric perplexity. ...

October 18, 2025 · 5 min · Nitin

Q K V : Query (Q), Key (K), and Value (V) Vectors in the Attention Mechanism

Introduction In the attention mechanism used by Large Language Models (LLMs) like transformers (e.g., GPT), the core idea is to allow the model to dynamically focus on relevant parts of the input sequence when generating or understanding text. This is achieved through a process called scaled dot-product attention, where input tokens (e.g., words or subwords) are transformed into three types of vectors: Q K V, Query (Q), Key (K), and Value (V). These are not arbitrary; they’re learned projections of the input embeddings via linear transformations matrices ...

October 1, 2025 · 3 min · Nitin

Token Embeddings — what they are, why they matter, and how to build them (with working code)

Introduction Token embeddings (aka vector embeddings) turn tokens — words, subwords, or characters — into numeric vectors that encode meaning. They’re the essential bridge between raw text and a neural network. In this post, below we will run a small demos (Word2Vec-style analogies, similarity checks), and provide concrete PyTorch code that demonstrates how an embedding layer works, I also include a tiny toy training loop so you see embeddings updated by backprop. ...

September 28, 2025 · 7 min · Nitin

Byte Pair Encoding (BPE): the tokenizer that made GPTs practical

Introduction Byte Pair Encoding (BPE) is a subword tokenization scheme that gives us the best of both worlds: compact vocabulary sizes (not the full wordlist), the ability to represent any unknown word (by falling back to subwords/characters), and meaningful shared pieces (roots, suffixes) that help models generalize. GPT-2 used a BPE tokenizer with a vocabulary of ≈50,257 tokens, and OpenAI’s tiktoken is a fast Rust-backed implementation you can use today. Below I explain the why, the how (intuition + algorithm), and a short hands-on demo using tiktoken. ...

September 27, 2025 · 4 min · Nitin

Tokenization in Large Language Models: A Hands-On Guide

Introduction In this blog post, we dive deep into tokenization, the very first step in preparing data for training large language models (LLMs). Tokenization is more than just splitting sentences into words—it’s about transforming raw text into a structured format that neural networks can process. We’ll build a tokenizer, encoder, and decoder from scratch in Python, and walk through handling unknown tokens and special context markers. By the end, you’ll not only understand how tokenization works but also have working Python code you can adapt for your own projects. ...

September 27, 2025 · 4 min · Nitin

Unlocking Deeper AI: The Power of Thinking in LLM Models

Ever wondered how advanced AI models can tackle truly complex problems with a depth of analysis that seems to mimic human thought? The secret lies in a groundbreaking capability known as “thinking.” This fascinating development is designed to unblock key bottlenecks on the path to greater intelligence in AI. Moving Beyond Fixed Compute Historically, powerful large language models (LLMs) were designed to respond immediately to requests. This meant they applied a constant amount of computing power at “test time”—the moment you ask a question or give a command—to generate a response. This fixed compute budget restricted how deeply the model could “think” about a problem, limiting its ability to handle extremely hard or challenging tasks. Imagine if your brain only spent a fixed millisecond on every problem, no matter its complexity! ...

July 13, 2025 · 5 min · Nitin