Fix Context, Fix AI: Context Engineering 101
Close the Great AI Expectation Gap by treating context as code, not an afterthought.
As someone who has been building AI products for a few years, I get this question a lot: "Isn't this all just snake oil?" Friends see viral demos, try a public tool, and get a nonsensical answer. Their experience doesn't match the promise, creating what I call the "Great AI Expectation Gap"—the chasm between mind-bending demos and applications that deliver inconsistent results.
That's why the recent buzz around "Context Engineering," a term coined by Shopify CEO Tobi Lütke and championed by Andrej Karpathy, has struck such a chord. It reframes our frustration not as a model failure, but as an engineering challenge we can solve by moving from tinkering with a black box to engineering a transparent, controllable system.
Why Your Old Mental Model Fails
Much of our frustration stems from a fundamental misunderstanding. Traditional software is deterministic; an API call with the same payload returns the same response. LLMs are probabilistic, predicting the next most likely word based on the input they're given.
The core idea is simple: change the input, and you change the probable output. Since you can't change the model's probabilistic nature, you must engineer the input it receives. This is the disciplined work of Context Engineering.
Consider these two prompts:
Prompt 1: "I'm writing a fairytale. Once upon a time, in a land filled with dragons and..."
Likely completion: "...wizards, there lived a brave knight."
Prompt 2: "I'm writing a sci-fi epic. Once upon a time, in a land filled with dragons and..."
Likely completion: "...starships, there lived a lonely cyborg."
The model isn’t being clever—it’s just playing the odds based on the context provided. Your job isn't to hope the model gets it right. It's to engineer the context so your desired outcome becomes the most probable one.
The Solution: Master Context Engineering
Context Engineering is the practice of designing, managing, and orchestrating the entire information flow to an LLM. It's about building a system that delivers a multi-layered context stack to the model at runtime. This isn't a new role, but a new skill that software engineers need to develop.
This stack has three distinct layers:
The Knowledge Context: This is the raw, factual information the model needs—the "what." This includes documents retrieved via Retrieval-Augmented Generation (RAG), real-time data from APIs, or user data.
The Instructional Context: This layer specifies the user's intent and business logic—the "how." It's your system prompt, the user's query, constraints ("answer in JSON"), and few-shot examples.
The Delivery Context: This is the technical infrastructure that assembles and delivers the other two layers. It includes your RAG pipeline, API error handlers, and context window management logic.
Here’s a real-world example of a context failure. A support bot is asked, "What's your refund policy for enterprise customers?" The RAG pipeline (Delivery Context) retrieves an outdated policy from 2023 (Knowledge Context). The LLM then uses this outdated information to generate a perfectly worded but confidently incorrect answer. The model didn't fail; the context did.
Context Engineering vs. Fine-Tuning
As an engineer, you must choose the right tool for the job. The two most common methods for customizing models are Context Engineering (using RAG) and fine-tuning. They aren't competitors; they solve different problems.
Think of it this way: RAG provides knowledge, while fine-tuning teaches a skill. You use RAG when facts matter and knowledge changes frequently, like answering questions about a real-time inventory database. It's also cheaper to update—you just change the data source, not the model.
Fine-tuning is best for teaching a model a new behavior, like adopting a pirate's voice or responding in a specific XML schema. The skill becomes baked in, often leading to lower costs per call because prompts can be shorter. However, fine-tuning is expensive to update and can become stale.
Often, the best solution is a hybrid. An insurance company might fine-tune a model to master its legal jargon (a skill). Then, it could use RAG to feed it a specific customer's policy documents at runtime (the knowledge).
Are Tools Like LangChain Enough?
Frameworks like LangChain and LlamaIndex are fantastic productivity enhancers. They provide powerful abstractions that help you build the layers of your context stack. However, a framework is like a high-end power tool; you still need to be a skilled carpenter to build a sturdy house.
When your application fails, you need to debug the system, not just the prompt. To fix a RAG pipeline that returns irrelevant documents, you must understand the underlying principles of Context Engineering. The framework is the how, but you, the engineer, must understand the why and the what.
Why Smarter Models Still Require Context Engineering
It’s tempting to believe that a future model, like GPT-5, will make this practice obsolete. This is a fallacy. Context Engineering is a durable skill for three key reasons:
The Knowledge Access Problem: No model will ever be pre-trained on your company's private data or a user's real-time information. You will always need to provide this knowledge.
The Specificity Problem: A general intelligence must still be given specific instructions and constraints to perform a business task. You will always need to provide the Instructional Context.
The Control Problem: Businesses require deterministic guardrails. The Delivery Context is your mechanism for imposing that control and ensuring the final output is reliable.
These problems aren't going away. Mastering the systems that solve them is a skill that will last your entire career.
When It Is the Model
While context is your biggest lever, we must be honest: base models have genuine limitations. As an engineer, your job is to be aware of these and design guardrails to mitigate them. These include fact-conflicting hallucinations, where a model’s pre-trained knowledge conflicts with the facts you provide.
Models also have architectural limits and a fixed computational budget per token. This can limit their ability to perform deep, multi-step reasoning on highly complex problems. Finally, context is a new attack surface for "prompt injection," so designing for these edge cases is part of the job.
Measure What Matters
How do you know if your changes to the context stack are actually working? In a deterministic system, we have unit tests. In a probabilistic one, we need a new approach to validation.
In the world of LLMs, the equivalent is eval-driven development. You must build robust evaluation frameworks to measure quality, consistency, and accuracy over time. This means creating a representative set of test cases ("evals") and running them automatically whenever you change a part of your context stack. This "eval-driven" mindset is critical, as it allows you to move from guessing to measuring.
From Prompt Tinkering to Context Engineering
Forget prompt tricks. The lasting lesson from the AI hype is that value comes from engineering, not tinkering. State-of-the-art models are now so powerful they've become a commodity, not a competitive edge. The factor that separates a frustrating product from a reliable one is the quality of the context you provide. This deliberate work of managing information is the discipline of Context Engineering.
By methodically engineering the layers of context and adopting an eval-driven development process, you can measure what matters and prove your changes lead to more reliable applications. Your next step isn't to find a 'magic prompt or wait for the next model release'. It's to embrace Context Engineering as a core part of your software development skillset. The power to build production-ready AI that your skeptical friends—and your users—can rely on is already in your hands.
Great post 👏
What stack are you building things on? What are you using for observability / llm integrations / agent flows etc? I feel adding those details would be useful