What happens when a team of engineers ships a million lines of production code over five months, and never types a single line by hand?
That’s not a thought experiment. In February 2026, OpenAI’s Codex team published exactly that claim. Approximately one million lines. Roughly 1,500 merged pull requests. Development velocity estimated at ten times faster than manual work. And the secret wasn’t a better model. It wasn’t a breakthrough in reasoning or context windows or chain-of-thought prompting. It was the system wrapped around the model. The constraints, the feedback loops, the verification layers, the environmental scaffolding that kept autonomous agents productive and on-mission.
They called it the harness. And that term, quietly coined in an engineering blog post, is rapidly becoming the most important concept in production AI.
Martin Fowler picked it up and described it as “the tooling and practices we can use to keep AI agents in check.” But he added something crucial. A good harness doesn’t just control agents. It makes them more capable. Not a leash. A force multiplier.
Now, the OpenAI story is about software engineering. But harness engineering isn’t a software-only concept. It applies equally to AI agents that trade securities, manage rental properties, coordinate military operations, or run customer service pipelines. Anywhere an AI agent acts autonomously in a consequential environment, the harness is what determines whether it performs or misfires.
This post is about what harness engineering actually is, why it matters more than the model you’re running, and how to think about building one. I won’t rehash prompt engineering basics or walk you through LangChain tutorials. Not because they aren’t useful, but because they’re table stakes at this point. What I want to talk about is the layer above all of that. The layer that determines whether your AI agents are a competitive weapon or an expensive liability.
Let’s get into it.
The Drone Operations Center
Before we define the term formally, I want you to picture something.
A U.S. Air Force MQ-9 Reaper drone flies a reconnaissance mission over contested airspace. The drone is autonomous. It navigates waypoints, adjusts altitude for weather, tracks targets using onboard sensors. But it doesn’t operate in a vacuum. Back at Creech Air Force Base in Nevada, an entire operations center governs its behavior.
Restricted airspace boundaries define where the drone can and cannot fly. Hard constraints it physically cannot override. A mission briefing package loaded before launch tells it what to look for, what frequencies to monitor, what rules of engagement apply. That’s context. Continuous telemetry checks confirm the drone is where it should be, doing what it should be doing. That’s verification. And when the satellite link degrades below a threshold, a return-to-base protocol activates automatically. That’s correction.
The drone is the AI agent. The operations center is the harness.
Now here’s what makes this analogy load-bearing. Nobody in the military would deploy a Reaper with no airspace boundaries, no mission briefing, no telemetry, and no contingency protocols. That would be insane. You’d lose the drone, cause an international incident, or both.
And yet, that’s exactly how most companies deploy AI agents today. They hand the model a system prompt, connect it to a dozen tools, and hope for the best.
Harness engineering is the discipline that says: stop hoping; start engineering.
What a Harness Actually Is
So what does the discipline actually prescribe? Harness engineering studies and recommends methods for designing systems that do four things.
First, constrain what an AI agent can do. Boundaries, dependency rules, tool restrictions, permission models. Deterministic walls the agent cannot talk its way past.
Second, inform the agent about what it should do. Context engineering. Dynamically curating what information appears in each model invocation based on the current task state. Not a static template. An active, task-aware information supply chain.
Third, verify the agent did it correctly. Testing, validation rules, multi-agent review. Automated checks that don’t rely on the model’s self-assessment.
Fourth, correct the agent when it goes wrong. Feedback loops that inject error information back into the agent’s context, self-repair mechanisms, and escalation protocols when automated correction fails.
Constrain. Inform. Verify. Correct. Four verbs. That’s the whole discipline.
And here’s the thing. These four functions aren’t new individually. What’s new is recognizing them as a unified engineering discipline with its own design patterns, its own failure modes, and its own body of practice. The term is weeks old. The practice is what separates teams shipping production AI from teams still running demos.
The Three Nested Layers: Prompt, Context, Harness
To understand where harness engineering sits, you need to see how we got here. Three engineering disciplines evolved sequentially over the past three years, and they nest inside each other like Russian nesting dolls.
Prompt engineering arrived first, roughly 2023 to 2024. The question it answers: what should I ask? It’s the instruction text you send to the LLM. Craft the right prompt, get a better response. Simple. Important. And, as it turns out, wildly insufficient for production systems.
Context engineering emerged mid-2025. The question it answers: what should the model see? All tokens the LLM processes at reasoning time. Not just your prompt, but the documents, tool outputs, conversation history, and metadata you feed it. Andrej Karpathy popularized the term, arguing that the real art isn’t writing prompts but curating context. He was right. And it moved the needle significantly.
Harness engineering is where we are now, early 2026. The question it answers: how should the whole environment be designed? It encompasses everything outside the model. Constraints, feedback loops, verification systems, operational infrastructure. Prompt engineering lives inside context engineering. Context engineering lives inside harness engineering. The harness is the outermost layer, the one that governs everything else.
Let me make this concrete with a non-technical example.
Picture a corporate M&A due diligence process. Prompt engineering is writing the question to the AI: “Summarize this contract’s liability clauses.” Context engineering is feeding the AI the relevant contracts, financial statements, and legal precedents it needs to reason well. Harness engineering is the entire pipeline. Which documents get pulled and from where. What the AI is forbidden from concluding, like rendering legal opinions or making valuation recommendations. How its output gets validated by a compliance checker before any human sees it. And what happens when it flags an ambiguity, like routing it to associate counsel instead of the partner.
Each layer added capability. But the harness is what makes the system production-grade.
The Core Components
Now let’s break the harness into its working parts. Five components, each doing distinct and critical work.
Context Engineering Layer
The first component is the context engineering layer, the information supply chain I mentioned earlier. It dynamically curates what appears in each model invocation. Not static templates, but active context selection based on task state, agent role, and operational phase.
Think of it this way. In Star Trek, when the Enterprise encounters an unknown vessel, the ship’s computer doesn’t dump its entire database into Captain Picard’s ready room. It pulls relevant stellar cartography, species databases, prior Starfleet contact logs for that sector, and applicable diplomatic protocols. The selection is dynamic, task-aware, and filtered for relevance. Picard gets exactly what he needs to make a decision. No more, no less.
That’s what a well-engineered context layer does. When your agent is reviewing a contract, it sees the contract, the relevant legal framework, and prior similar reviews. It doesn’t see the company’s marketing calendar or last quarter’s revenue numbers. When a warehouse management agent is deciding where to route an inbound shipment, it sees current bin occupancy, pending outbound orders, and expiration dates. It doesn’t see the company’s HR policies. The context shifts based on the task.
The engineering challenge is non-trivial. You’re building a system that reasons about what another reasoning system needs to see. Get it wrong, whether that means too much context, irrelevant context, or missing context, and the downstream agent performance degrades in ways that are hard to diagnose. Get it right, and the agent operates like a well-briefed analyst walking into a meeting with exactly the right folder.
Architectural Constraints
The second component is architectural constraints. This is where determinism meets autonomy. These are mechanistic enforcement mechanisms that physically prevent the agent from violating design rules. Not by asking nicely. Not by including “please don’t do this” in the system prompt. By making violation impossible at the infrastructure level.
Isaac Asimov imagined this in 1942 with the Three Laws of Robotics. Hard-coded behavioral constraints that override a robot’s autonomous decision-making regardless of its reasoning. The robot physically cannot harm a human, even if its logic suggests it should. The constraint isn’t a suggestion. It’s architectural.
In practice, this looks like a financial trading AI agent that cannot execute trades above a dollar threshold or in blacklisted securities, enforced by the harness infrastructure, not by the prompt. When the agent tries to exceed the threshold, the harness blocks the action and injects a correction message directly into the agent’s context: “Transaction rejected. Policy: maximum single trade $50,000. Replan your approach.”
Or consider a medical triage agent that can prioritize cases and suggest next steps but is architecturally forbidden from modifying patient records or issuing prescriptions. The harness doesn’t ask the model to refrain. The write-access simply doesn’t exist in the agent’s environment.
The common thread across both examples is important. In every case, the constraint doesn’t just block. It teaches. The error message becomes context for the agent’s next reasoning step. Good constraints create a feedback loop. Bad constraints create a brick wall.
When I designed the Rishon AI Developer Agent, one of three harnesses we operate, I built architectural constraints directly into the agent’s execution environment. The agent generates production code, but it can only modify files within its assigned module scope, can only call approved APIs, and must pass all structural validation before any change is accepted. The agent doesn’t know about these boundaries in some abstract sense. It discovers them through interaction, the same way a new employee discovers that the compliance system will reject their expense report if they violate the spending policy. The difference is the agent hits those boundaries fifty times an hour. And it learns every time.
Entropy Management
The third component is entropy management. And here’s a problem nobody talks about until they hit it. When AI agents generate artifacts at scale, whether that’s code, documents, operational procedures, or design specifications, they inevitably replicate poor patterns. They propagate inconsistencies across outputs. They accumulate drift at a rate that would make a junior team member blush. Except the junior team member produces a few deliverables a day. The agent produces dozens, or hundreds.
OpenAI coined a vivid term for this: AI entropy. And their solution was equally vivid. They deployed background agents, separate from the producing agents, that continuously scanned for divergence from established standards and automatically generated correction proposals.
Now, calling this “entropy management” may seem ambitious. In physics, the second law of thermodynamics tells us that entropy in a closed system always increases. You don’t manage it. You surrender to it, gracefully. But here’s the thing: a well-harnessed AI system isn’t a closed system. You’re continuously injecting energy in the form of rules, standards, and sweep agents. So maybe we’re not violating thermodynamics. We’re just running a very aggressive air conditioner.
In The Matrix, the Agents serve a similar function. Smith, Brown, and Jones aren’t the architects of the system. They’re the cleanup crew, hunting down anomalies like Neo that violate the system’s structural rules, correcting drift, maintaining coherence. The system generates entropy through the autonomous behavior of its inhabitants. The Agents push back against it.
Your harness needs the same thing. Periodic sweep agents that audit generated outputs for consistency, flag divergence, and either auto-correct or escalate. Without this, AI-generated artifacts degrade faster than human-produced ones, because the generation rate is so much higher that drift accumulates before anyone notices.
One important caveat. Entropy management is primarily relevant when your agents produce artifacts in volume. If your agent handles one customer inquiry at a time with no persistent output, this component matters less. But if your agents are generating documents, building configurations, producing reports, or writing code at scale, entropy management becomes load-bearing infrastructure. It feels optional at output number one hundred. By output number five hundred, you can’t live without it.
Verification and Feedback Loops
The fourth component is verification and feedback loops. The self-correcting loop is the heartbeat of a mature harness. The agent produces an output. Automated validation runs. If something fails, the error output gets injected back into the agent’s context. The agent revises. Validation runs again. This cycle repeats until all checks pass, or until a maximum iteration count triggers escalation to a human.
What “validation” means depends entirely on domain. In software, it’s tests and structural checks. In legal document review, it’s compliance rules and precedent matching. In financial analysis, it’s regulatory constraints and arithmetic verification. In logistics, it’s feasibility checks against real-world capacity. The mechanism is universal. The rules are domain-specific.
OpenAI calls their version of this the “Ralph Wiggum Loop.” For those who haven’t watched The Simpsons, Ralph Wiggum is the lovably oblivious kid in Lisa Simpson’s class, famous for cheerful non sequiturs and blissful persistence in the face of failure. The name captures something real about how these loops work: the agent doesn’t get frustrated, doesn’t take feedback personally, and doesn’t give up. It just keeps trying, incorporating each correction, until it gets it right. Or until someone intervenes.
But verification doesn’t stop at automated checks. The most sophisticated harnesses include multi-agent review, where one agent’s output is reviewed by a separate agent with different instructions, different priorities, and sometimes a different underlying model. Think of a consulting firm’s AI drafting client deliverables, then routing them through a quality-check agent, a brand-compliance agent, and a factual-accuracy agent before any human sees the output. Consensus builds confidence. Disagreement triggers investigation.
There’s a cautionary tale here too. In 2001: A Space Odyssey, HAL 9000 runs self-diagnostics and concludes, incorrectly, that it’s functioning perfectly, even as its behavior becomes increasingly erratic and dangerous. HAL’s harness failed because the verification loop was self-referential. The system checking its own work was the same system doing the work. That’s not verification. That’s an echo chamber.
The lesson: verification agents must be independent of the agents they verify. Different instructions. Different context. Ideally, different models. Cross-checking isn’t a feature. It’s an architectural requirement.
Security
The fifth component is security. I wrote at length about this in my previous post, “One Sentence Can Hijack Your AI. Here’s How to Stop It.” Rather than repeat all of that here, let me give you the digest.
The core insight: a pure LLM is essentially harmless. It has no hands. The danger lives in the harness, the software that invokes tools, reads databases, sends emails, and touches production systems. That’s where nearly all the risk sits.
Three attack vectors matter most. Direct prompt injection, where crafted inputs override system instructions. Think of it as handing forged orders to a field agent. Indirect prompt injection, where malicious instructions hide inside documents, emails, or web pages that the AI consumes. This is what the KGB called “active measures,” planting disinformation in sources you know the target reads. And agent-to-agent propagation, where one compromised AI agent infects others in a trust chain, like the Cambridge Five, a turned double agent poisoning an entire spy ring.
Karpowicz’s Impossibility Theorem proved that an LLM cannot be both fully truthful and fully resistant to manipulation at the same time. Some degree of adversarial exploitation is mathematically guaranteed under hostile conditions.
The fix must be architectural, not behavioral. Six techniques form the foundation. Compartmentalization, modeled on the Manhattan Project’s need-to-know isolation. Source verification, inspired by the multi-precog consensus in Philip K. Dick’s Minority Report. DMZ architecture, which creates isolated buffer zones for untrusted inputs. Human-in-the-loop approval gates for high-stakes actions. Full observability and audit trails. And rate limiting with anomaly detection.
If you’re building a harness, security isn’t a feature you add later. It’s a design constraint you bake in from day one. For the full treatment, read the previous post. What matters here is understanding that security is a first-class component of harness engineering, not an afterthought bolted on at deployment.
The Evidence Is In
The results from early harness engineering adopters aren’t incremental improvements. They’re dramatic leaps.
OpenAI built one million lines of production code with zero hand-written source. Roughly 1,500 merged PRs. Estimated ten times faster than manual development. The model didn’t change. The harness made the difference.
Vercel took a different approach, and the results were counterintuitive. They removed 80% of their agent’s tools. Went from 80% accuracy to 100%. Ran 3.5 times faster. Less capability, more constraint, dramatically better outcomes. If that doesn’t make you rethink your “give the agent access to everything” architecture, nothing will.
LangChain jumped from Top 30 to Top 5 on a major coding benchmark by changing only the harness. Not the model, not the prompt, not the training data. Same brain, different operating environment. Massive performance gain.
And Manus, the autonomous agent startup, rebuilt their entire framework five times in six months. Their biggest insight? The largest gains came from removing things. Simplifying the harness. Cutting features that looked useful but created unpredictable behavior.
There’s a pattern here, and it’s the surprising inversion of this entire field. The most powerful harnesses aren’t the ones that give agents the most capability. They’re the ones that impose the tightest, smartest constraints. It’s the drone operations center again. The drone doesn’t fly better when you remove the airspace boundaries. It crashes.
Three Harnesses, Three Purposes
At Rishon, we don’t run one harness. We run three, each with an entirely different purpose, each designed for a different phase of the product lifecycle.
The first is the Rishon AI Product Agent. This harness performs product development. It follows a structured creative process based on the Disney Method. If you’re not familiar with it, I wrote about it on my blog. Look it up. It’s worth understanding.
In short, the agent researches subject domains, analyzes existing solutions, trawls forums for ideas and pain points, and formulates a complete product specification focused on business automation. Its constraints are about scope and feasibility. Its context layer dynamically pulls market data, competitor analysis, and user feedback. Its verification loop checks specs against technical feasibility and business viability before any engineering begins.
The second is the Rishon AI Developer Agent. This harness takes a specification text, regardless of its source, and turns it into a working program in the Rishon language. If the specification has clarity gaps, it asks pointed questions before proceeding.
What makes it different from most AI coding tools is the reasoning process. The Rishon Developer Agent implements a proprietary multi-phase approach that draws on proven Domain-Driven Design principles, natively supported by the Rishon Compiler. Most AI coding tools treat development as a single-pass activity: give the model a task, let it generate code, fix what breaks. Ours works differently.
First, the agent considers the core concepts and data entities. What are the fundamental things in this system, and how do they relate to each other? A property. A tenant. A lease. A maintenance request. The agent maps these relationships before writing a single line of implementation.
Next, it fleshes out specific attributes, one entity at a time. What fields does a lease need? What states can a maintenance request be in? This is deliberate and sequential, not a bulk generation pass.
Then the agent shifts to user-facing functionality. It thinks in terms of user activities and the data required to support those activities. Critically, it considers how users in different roles need different levels of data access. A property manager sees everything. A tenant sees only their own property and requests.
After that, the same treatment for AI automations. What work can agents do on behalf of users, and what data and tools do those automations require?
Then comes role-based security, ensuring that access boundaries are enforced architecturally.
And finally, translations into spoken languages. Not programming languages. Human ones. English, Spanish, Italian, whatever the application requires.
The key property of this phased design: every phase is validated for consistency and completeness before the agent moves on. The agent never has to backtrack to an earlier phase, which means fewer errors, cleaner design, and dramatically fewer wasted cycles.
There’s another distinctive design choice worth mentioning. The Rishon Compiler is given to the AI as a tool, and its error diagnostics are specifically designed to simplify the agent’s job. When a traditional compiler says “type mismatch on line 47,” a human knows to look at line 47 and figure it out. But an AI model, faced with an ambiguous error, will often hallucinate a fix. The Rishon Compiler’s diagnostics are explicit and prescriptive. They tell the agent exactly what went wrong, what the valid options are, and what structural rule was violated. The agent doesn’t need to guess. It reads the diagnosis and acts on it. That’s a harness design decision that dramatically reduces the correction cycles in our Ralph Wiggum Loop.
The third harness is the most interesting one, because it’s the one our customers’ applications use for live business automation. And it’s where the rubber meets the road.
Here’s a concrete example. A tenant in a rental property managed by our platform reports a problem. Say the kitchen outlets stopped working. The agent doesn’t just log a ticket. It performs an analysis of the lease agreement to determine responsibility, tenant versus landlord, considering both legal obligations and common sense factors like habitability. It assesses urgency. Does this affect the livability of the home? Is there a safety concern?
Then it does something remarkably human. It asks the tenant to try a few things first. “Have you checked the circuit breaker panel? Flip the breaker labeled ’Kitchen’ off and back on.” If the tenant reports that resolved it, done. Issue closed. No vendor, no cost, no landlord involvement.
If not, the agent goes online searching for local licensed electricians. It prioritizes the list by ratings, availability, and proximity. Then it starts calling them, checking availability, getting quotes, comparing pricing. And it makes a scheduling decision, but within constraints set by the landlord. Automatic approval for repairs under a certain dollar threshold. Escalation to the landlord for anything above it.
Constrain. Inform. Verify. Correct. All four harness functions, working together in a real-world business process that involves legal analysis, human interaction, web research, vendor communication, and financial decision-making. That’s harness engineering in production. Not a demo. Not a benchmark. A system that handles real problems for real people.
Frameworks and Tools
Now, let’s talk about what you can actually use today. Several open-source frameworks already provide harness-like capabilities, even though most predate the term itself.
A note on scope. My research here was restricted to the Node.js and TypeScript ecosystem, the platform of choice for my current projects. If you’re working in Python, Go, or another runtime, you’ll need to do your own survey.
Mastra comes from the team behind Gatsby.js, backed by Y Combinator. Typed tools via Zod, a graph-based workflow engine with branching and parallelism, a local development studio, built-in evals, and integrations with over forty model providers. It reached version 1.10 in March 2026 and is the most fully featured TypeScript-native option in the ecosystem right now.
OpenAI Agents SDK provides an agent loop, guardrails, inter-agent handoffs, tracing, and output schema enforcement. Provider-agnostic despite the name. The TypeScript version includes sessions, human-in-the-loop, and realtime voice agents. Best for multi-agent orchestration, especially if you’re already in the OpenAI ecosystem.
LangGraph.js offers graph-based state machines with time-travel debugging, human-in-the-loop approvals, conditional routing, and persistent state for long-running workflows. It hit 1.0 stable and is trusted by Klarna, Replit, and Elastic. Best for mission-critical workflows where you need deterministic control over every step.
DeepAgents builds on LangGraph, adding hierarchical subagent delegation, planning tools, a filesystem backend for context management, and long-term memory. TypeScript-first. Designed for deep research agents that need to break complex tasks into subtasks and delegate them.
Strands Agents SDK takes a model-driven, provider-agnostic approach, supporting Amazon Bedrock and OpenAI out of the box with native Model Context Protocol support. Lightweight, works in both Node.js and browser environments, with multi-agent orchestration through directed graphs. Just updated in March 2026.
Two more worth watching. VoltAgent offers a TypeScript framework plus a cloud-hosted observability console for production monitoring and evals, with over 5,000 GitHub stars. And KaibanJS takes a Kanban-inspired approach to multi-agent systems, with a visual board interface and Redux-style state management that’s particularly interesting for teams coming from a frontend background.
I haven’t had hands-on experience with any of these yet. I’m just starting to evaluate them for upcoming projects. If you’re interested in what I find, drop a comment. I’ll share the results of that investigation as it progresses.
The Bookshelf, While We Wait
No book titled “Harness Engineering” exists yet. The term is only weeks old. But the underlying concepts have been building for years, and several recent books cover the territory well.
The three I’d start with.
Generative AI Design Patterns by Valliappa Lakshmanan and Hannes Hapke, published in 2025. It lays out thirty-two design patterns including tool calling, multi-agent orchestration, guardrails, and reliability engineering. It’s the closest thing to a harness engineering reference that predates the term. If you’re designing agent infrastructure, this is your patterns catalog.
Designing Multi-Agent Systems by Victor Dibia, also 2025. It takes a framework-agnostic approach to orchestration, evaluation, and agent architecture. It’s particularly strong on the verification and feedback loop components, specifically how to evaluate whether your agents are actually doing what you think they’re doing. Essential reading for anyone running multiple agents in production.
AI Engineering by Chip Huyen covers production-focused AI systems design with the rigor of someone who’s shipped these systems at scale. It’s less focused on agents specifically, but the infrastructure thinking around monitoring, evaluation, and deployment patterns maps directly onto harness engineering’s operational layer.
Given the February 2026 coining of the term, expect dedicated harness engineering books to start appearing late 2026 into 2027. Watch for titles from O’Reilly, Manning, and Packt on topics like AI agent infrastructure, agent harness design, or production AI agent engineering. The field is moving fast enough that whatever ships first will define the vocabulary for the next generation of practitioners.
Where This Goes
Harness engineering is where AI development grows up. The model is the brain. The harness is everything else. The nervous system, the skeletal structure, the immune system, the operational doctrine. And as models commoditize, and they will, the harness becomes the primary source of competitive differentiation.
The teams that figure this out first won’t just build better AI products. They’ll build AI products that their competitors literally cannot replicate by switching to a newer model, because the value isn’t in the model. It’s in the system around it.
One million lines of code. Zero keystrokes. That was OpenAI’s proof of concept. The question for you is: what could your team build if the harness was right?
Cheers!
References
Industry and Technical Sources
- OpenAI Codex Team, How We Built 1M Lines of Code with AI Agents (February 2026)
- Martin Fowler, AI Harness Engineering (2026)
- Andrej Karpathy on context engineering (2025)
- Vercel, Simplifying Our Agent Architecture
- LangChain engineering blog
- Manus, What We Learned Rebuilding Our Framework Five Times (2025–2026)
- Karpowicz, Impossibility Theorem (2025)
- IBM X-Force enterprise AI security findings
- Anthropic, Claude Sonnet 4.6 prompt injection resistance benchmarks
- Simon Willison on the “lethal trifecta”
- The Cambridge Five
Frameworks and Tools (Node.js / TypeScript)
Books
- Valliappa Lakshmanan & Hannes Hapke, Generative AI Design Patterns (2025)
- Victor Dibia, Designing Multi-Agent Systems (2025)
- Chip Huyen, AI Engineering
Cultural and Historical References
- Isaac Asimov, I, Robot (1950)
- The Wachowskis, The Matrix (1999)
- Stanley Kubrick, 2001: A Space Odyssey (1968)
- Gene Roddenberry, Star Trek
- Matt Groening, The Simpsons
- U.S. Air Force MQ-9 Reaper drone operations
- The Manhattan Project
- Philip K. Dick, The Minority Report (1956)