Armature2026-06-08By Bryan Sparks

Why I built Armature (and why your multi-agent system will fail in production)

I know where the failure is going to happen, because I've been there.

You built an agentic workflow. It worked in development. Maybe it even worked in the first few weeks of production. Then, slowly, it started returning garbage. A stage that reliably produced valid JSON started returning text. A retry loop that was supposed to catch this started looping past its limit. You found out when a human noticed the output looked wrong — not because any alarm fired, not because a log filled up, but because someone was paying enough attention to notice.

You were flying blind. There was no IHR, no trace store, no signal to watch. Just a Python loop calling an LLM, and silence when it stopped working.

This is the failure I built Armature to prevent.

The wall I kept hitting

I spent two years building multi-agent systems with the frameworks that existed — LangChain, CrewAI, LangGraph. I learned a lot from each of them, and I'm not dismissing them. But every project I built had the same structural problem: the orchestration logic lived in Python code that I had to rewrite every time, the quality measurement didn't exist unless I built it, and the system had no way to know when it was degrading.

The most telling sign was when I needed a second version of a workflow I'd already built. I copied the Python file. Changed the prompts. Struggled with the places where I'd hardcoded assumptions about the prior version's output shape. Then I needed a third version. Now I had three copies of substantially the same retry logic, context passing, and error handling — each slightly different, each accumulating its own bugs.

You're copying boilerplate instead of describing the problem. That's always the sign that something is structurally wrong.

The insight: declarative + self-improving

Two things had to be true at once for a harness to actually work in production.

First, the workflow specification had to be declarative — text that describes what the workflow does, not how to execute it. When the spec is code, only engineers can read or modify it. When the spec is YAML, your domain experts can engage with it, an optimizer can propose changes as a clean diff, and you can version-control the logic without it being tangled up with implementation details.

A Tsinghua research team published results confirming what I'd suspected from practice: YAML-defined harnesses outperform equivalent Python-coded harnesses 47.2% vs. 30.4% on complex task benchmarks. The key finding was that when the specification is readable text, the entire system — including an optimizer — can reason about it. You can't feed Python orchestration code to a model and ask it to improve your workflow. You can feed YAML.

Second, the workflow had to be self-improving. Not "you can tune it manually." Self-improving: runs produce traces, traces produce diagnostics, diagnostics drive targeted spec rewrites, rewrites get applied automatically. Every run should make the next run slightly better.

Stanford published a paper showing that a frontier model given access to full execution traces (not just pass/fail scores) could propose harness improvements with 57% accuracy — versus 41% with just scores. The model can reason causally about why a run failed when it has the trace. "The output_valid_rate on the analyst stage dropped to 0.4 in the last 5 runs; here is a more constrained output_schema that should fix it" is a different — and more useful — kind of improvement proposal than "the workflow scored 0.71; try again."

How Armature works

You write a YAML spec. The spec declares model tiers (named capability slots, not hardcoded model names), a stage DAG with depends_on relationships, and whatever safety rules you need.

name: risk-assessment
version: "1.0"
mission: >
  Assess contract documents for legal and financial risk.
  Be conservative — flag ambiguous clauses as medium risk, not low.

model_tiers:
  small: {provider: anthropic, model: claude-haiku-4-5-20251001}
  large: {provider: anthropic, model: claude-sonnet-4-6}

role_type_defaults:
  researcher: large
  judge: large

stages:
  - id: extractor
    role:
      type: researcher
      description: "Extract key clauses from: {{ document }}"
    output_mode: guided_json
    output_schema:
      type: object
      required: [clauses, parties]
      properties:
        clauses: {type: array, items: {type: string}}
        parties: {type: array, items: {type: string}}
    depends_on: []

  - id: assessor
    role:
      type: judge
      description: |
        Assess risk for each clause: {{ extractor.clauses }}
        Parties: {{ extractor.parties }}
    output_mode: guided_json
    output_schema:
      type: object
      required: [risk_level, flagged_clauses, recommendation]
      properties:
        risk_level: {type: string, enum: [low, medium, high, critical]}
        flagged_clauses: {type: array, items: {type: string}}
        recommendation: {type: string}
    depends_on: [extractor]

The engine resolves the execution order from depends_on, runs stages in parallel when their dependencies are met, handles guided_json validation failures by escalating to the next model tier, and records every stage execution to SQLite.

After the run, armature dashboard risk-assessment.yml shows a 4-panel health view: IHR trend, success rate, output validity rate, latency percentiles. armature improve risk-assessment.yml does the self-improvement cycle.

The self-improvement loop (IHR)

IHR — Implicit Harness Rating — is a single composite score:

IHR = 0.35 × output_valid_rate
    + 0.25 × success_rate
    + 0.20 × avg_quorum_score
    + 0.10 × latency_score
    + 0.10 × happy_path_rate

When IHR drops below 0.75, the self-improvement cycle fires: DiagnosticAnalyzer extracts failure signatures from the traces (stage_failed, output_invalid, low_confidence, high_escalation), a medium-tier LLM proposes targeted YAML rewrites based on those specific failure modes, safe changes (prompt rewrites, schema tightening, model tier adjustments) are applied in-place, and structural changes go to .pending.yaml for your review.

The v0.2.0 release added --auto-improve to run this automatically after every execution. If IHR < 0.75, it fixes itself. If the fix requires review, it queues it.

The research backing

Seven arXiv papers published between February and May 2026 converge on the same insight from different angles: the harness is more important than the model. NLAH (Tsinghua) defined the architectural primitives. MetaHarness (Stanford) proved trace-driven optimization works. Continual Harness formalized the two-loop self-improvement design. AgentSpec gave the safety DSL. AHE introduced prediction-verification to make improvement cycles accountable. KYA added static risk scoring and safety composition rules. ActiveGraph added caching, reactive behavior rules, and the post-run improvement gate.

Every major design decision in Armature traces directly to one of these papers. The citations are in CHANGELOG.md.

The ElfTech vision

Armature is one piece of something larger.

I'm building a platform called ElfTech — an autonomous-organization stack. Armature handles reasoning workflows. Other components handle deliberation, code generation, deployment, and inter-system coordination. The goal is an organization where AI systems handle end-to-end business processes — not as assistants to human workflows, but as the primary actors in those workflows, with humans reviewing outcomes rather than approving every step.

That's a large claim and a long road. Armature is working and shipped. The rest is in progress.

What to do next

pip install armature
armature doctor           # verify your environment
armature new              # interactive spec wizard
armature validate my_workflow.yml
armature run my_workflow.yml --input topic="your topic here" --auto-improve

The docs are at https://bryansparks.github.io/armature. The full USER-GUIDE.md covers fan-out pipelines, continuation blocks for long-horizon workflows, the safety DSL, memory, and the HTTP service.

If your multi-agent system is failing silently in production, I built this for you.

GitHub: https://github.com/bryansparks/armature
pip install armature
MIT license, 1,330 tests, Python 3.11+