Open Source · Python 3.11+ · MIT License

A maturity

of AI agents.

Not just a collective noun — a design principle.

Armature is a YAML-first multi-agent workflow harness. Define researcher, worker, and judge agents. Execute them as a DAG. Then let the system study its own traces and rewrite its own specification — every run, every time.

Star on GitHub →How it works ↓

Like a murder of crows is more dangerous than one,
a maturity of agents is smarter than before.

Why "a maturity"?

Birds flock. Geese gaggle.
Crows murder.

AI agents mature.

Every collective noun for animals captures something true about how they move and behave together. A murder of crows isn't just a group — it names the coordinated, intelligent behavior that makes them formidable as a collective. We chose maturity deliberately.

Armature's agents don't just coordinate to complete a task. After every run, the system collects execution traces, runs the DiagnosticAnalyzer against them, and uses the SpecRefiner to rewrite the YAML sections that underperformed. The next run is better. Your workflow doesn't just run. It matures.

“A murder of crows is more dangerous than one.
A maturity of agents is smarter than before.”

How It Works

Three steps. One cycle.
Spec. Execute. Improve.

Armature isn't a run-once tool. It's a loop. Each step feeds the next, and the next run is smarter than the last.

SPECIFY

Write a YAML spec. That's it.

Define your agents by role, model tier, and dependencies. Armature validates the DAG before the first run — catching cycles, missing deps, and misconfigured stages. No framework to learn. No graph API to wire.

▸role: researcher | worker | judge | orchestrator

▸tier: small | medium | large (maps to your model config)

▸depends_on: [list of upstream stage IDs]

▸output_mode: text | guided_json with schema validation

EXECUTE

DAG execution. Context flows automatically.

Independent stages run in parallel. Dependent stages wait for their inputs. Every stage receives the full accumulated context from all upstream stages — no wiring, no passing variables by hand. One shared dict, built up as the workflow runs.

▸Parallel fan-out for independent branches

▸Context dict accumulates all upstream outputs

▸guided_json with automatic tier escalation on failure

▸Checkpoint & resume — survive crashes mid-workflow

▸Deliberate iteration with loop: and carry_forward

IMPROVE

The workflow rewrites itself.

Every run generates a trace. The SelfImproveRunner computes HQS across all stages, identifies which ones drag the score down, and rewrites targeted YAML sections. Add --auto-improve to any run and Armature applies safe fixes automatically — or stages structural rewrites for human review. The next run is better. Verifiably.

▸HQS = 0.35×valid + 0.25×success + 0.20×quorum + 0.10×latency + 0.10×HFR

▸DiagnosticAnalyzer identifies the lowest-scoring stages

▸SpecRefiner rewrites only the underperforming YAML sections

▸Prediction-verification: fixes are confirmed or flagged each cycle

Agent Roles

Three roles. Every agent has one.

Roles aren't a label — they determine execution order, context access, and contribution to the self-improvement health score. A well-designed maturity has all three.

◎

Researcher

Gathers.

The information foundation. Researchers query tools, read context, search external sources, and build the knowledge base that downstream agents draw from. They run first — and in parallel when independent.

Common uses

Market signal aggregation

Competitor analysis

Evidence synthesis across sources

Tool call fan-out

◈

Worker

Transforms.

The production engine. Workers synthesize research into drafts, summaries, reports, code, or structured data. They consume upstream researcher output and produce the artifacts that judges and downstream workers will evaluate.

Common uses

Draft generation

Data transformation

Code synthesis

Report writing

◉

Judge

Evaluates.

The quality gate. Judges score output quality, validate against criteria, flag hallucinations, and decide whether a result meets the bar. Only judges contribute to the quorum score in the HQS — they are the accountability layer.

Common uses

Output quality scoring (0–10)

Hallucination detection

Criteria validation

Structured pass/fail decisions

The Differentiator

Static orchestration is table stakes.
Armature learns.

AWS AgentCore, LangGraph, and CrewAI let you build agent workflows. Armature does that too — and then automatically improves them across runs using the Harness Quality Score loop.

Harness Quality Score (HQS)

HQS = 0.35 × valid_rate + 0.25 × success_rate + 0.20 × avg_quorum + 0.10 × latency_score + 0.10 × HFR

Scored 0–1.0 per run. SpecRefiner targets stages whose contribution drops the overall HQS.

Run & Trace

Every workflow run generates a structured trace — inputs, outputs, scores, latencies, and errors per stage.

Diagnose

DiagnosticAnalyzer computes HQS and identifies stages with the lowest per-metric contribution.

Rewrite

SpecRefiner (an LLM) receives the underperforming stage spec and rewrites the system prompt, output schema, or parameters.

Verify

The next run's HQS is compared to predictions. SpecRefiner tracks which fixes held and which missed — so it improves its own rewrites too.

Prediction-verification closes the loop: SpecRefiner declares what it expects each rewrite to fix. The subsequent run confirms whether the fixes held — and which ones missed. The rewriter improves its own judgment over time.

Auto self-improvement — zero manual steps

armature run my-workflow.yaml --auto-improve

Add --auto-improve to any run. When HQS drops below 0.75, Armature automatically calls SpecRefiner after execution — rewriting prompts, relaxing schemas, rebalancing model tiers, or tuning retry limits. Safe changes apply immediately; structural rewrites stage to {spec}.pending.yaml for human review.

New

Iteration ≠ Retry.
First-class loops.

Most agent frameworks give you retry-on-failure as the only looping mechanism. Armature adds first-class iteration: declare a loop with intent — “run until approved”, “carry forward gaps between rounds” — and the engine handles the rest.

- id: research_round
  loop:
    max_iterations: 5
    until: "{{ confidence > 0.85 }}"
    carry_forward: [findings, gaps, confidence]
  role:
    name: Researcher
    type: researcher
    description: |
      {% if _iteration.is_first %}Start fresh: {{ topic }}{% else %}
      Iteration {{ _iteration.num }}. Prior gaps: {{ gaps }}
      Build on findings: {{ findings }}{% endif %}
  output_mode: guided_json
  depends_on: []

loop:

Deliberate iteration on any stage. Research rounds, refinement cycles, convergence loops — declared as intent, not retry logic.

_iteration

Always-defined context: .num (1-based), .is_first, .is_last. No undefined-on-first-pass surprises.

carry_forward:

Dot-paths for selective state carry between iterations. Pass only what matters — not the entire prior result.

until:

Jinja2 stop condition evaluated against the stage result. "{{ approved == true }}" — says exactly what you mean.

Research Foundation

Nine papers. One toolkit.
All implemented.

Armature isn't invented from first principles — it's a synthesis of the best current academic thinking on agent harness design — all but one published this year, plus Microsoft's Agent Governance Toolkit, Yohei Nakajima's event-sourced execution model, and Veldt Labs' KYA trust layer. Every source contributed concrete, implemented capabilities.

Mature has two meanings here. The agents grow smarter every run — and the harness itself matures alongside the field, tracking the latest research as it ships.

01 · Mar 2026

arXiv:2603.25723↗

Natural-Language Agent Harnesses

Tsinghua University

Workflows defined in structured natural language outperform equivalent code-based harnesses — and can be reasoned about and rewritten by an optimizer.

▸YAML spec format & DAG executor

▸Four role types (researcher/worker/judge/orchestrator)

▸HQS quality metric & parallel fan-out

02 · Mar 2026

arXiv:2603.28052↗

Meta-Harness: Automated Optimization

Stanford University

Giving a frontier model access to full execution traces — not just pass/fail scores — enables causal reasoning about why runs failed and how to fix them.

▸`armature optimize` command

▸A/B spec testing by HQS

▸Multi-iteration optimizer with proposal history

03 · Feb 2026

arXiv:2603.03329↗

AutoHarness: LLM-Synthesized Harnesses

arXiv:2603.03329

LLMs can generate, run, evaluate, and refine their own harness specs — producing systems that outperform larger models running without a harness.

▸`armature new` spec wizard

▸NL → YAML synthesis loop

▸Prompt bootstrapping from trace examples

04 · Mar 2025

arXiv:2503.18666↗

AgentSpec: Runtime Safety Enforcement

arXiv:2503.18666

Safety constraints should be declarative rules co-located with the workflow spec — not hardcoded logic — so they can be audited, reasoned about, and generated by LLMs.

▸Declarative `safety_rules` YAML DSL

▸Pre/post-stage and pre/post-tool hooks

▸`ToolBlocked` non-retryable exception

05 · May 2026

arXiv:2605.09998↗

Continual Harness: Reset-Free Self-Improvement

arXiv:2605.09998

Agentic systems can improve continuously — without human intervention or new training runs — using a two-loop design: in-run adaptation and cross-run spec refinement.

▸`post_run` in-run refiner stage

▸`armature improve` outer self-improvement loop

▸Trace export for SFT/DPO fine-tuning

06 · Apr 2026

arXiv:2604.25850↗

AHE: Observability-Driven Automatic Evolution

arXiv:2604.25850

Every improvement proposal must declare what it predicts it will fix — and the next cycle must verify those predictions. "Did the score go up?" is not enough.

▸Prediction-verification loop per improvement cycle

▸`predicted_fixes` / `verified_fixes` tracking

▸Falsifiable contracts on every spec revision

07 · May 2026

arXiv:2605.26112↗

From Model Scaling to System Scaling

arXiv:2605.26112

Three system-level failure modes that model size alone cannot fix: stale memory reaching LLMs without warning, context values flowing without provenance, and tool side effects going unverified.

▸Memory staleness detection + `_stale_memory_keys` injection

▸Context provenance tracking per trace key

▸Post-condition verification for tool side effects

▸Drift score + component governance classification

AGT · 2025

↗

Agent Governance Toolkit

Microsoft

Production agents require auditable governance primitives baked into the execution layer — not bolted on as policy checks. Reversibility, trace integrity, and fail-closed safety modes belong in the harness spec itself.

▸Reversibility classification on every tool (FULL / PARTIAL / NONE)

▸SHA-256 trace input hashing + policy version fingerprint

▸`require_approval` gate on the tool-call path

▸`safety_mode: strict` — fail-closed, deny on no-match

AG · May 2026

arXiv:2605.21997↗

The Log is the Agent

Yohei Nakajima

Append-only event logs make agent runs reproducible and auditable. Content-addressed LLM caching turns expensive re-runs into instant cache hits — enabling replay, debugging, and future fork-and-diff without paying LLM costs.

▸Content-addressed LLM response cache (`--no-cache` to opt out)

▸`armature replay <run_id>` — stage-by-stage audit from TraceStore

▸Trace-triggered behaviors (`BehaviorRule`) with HQS feedback built-in

▸`--auto-improve`: after each run, auto-applies spec improvements when HQS drops below 0.75

KYA · May 2026

arXiv:2605.25376↗

KYA: Trust Layer for Autonomous Systems

Veldt Labs

Governance must operate before execution, not only at runtime. A risk score computed from the agent's definition — its tools, governance mode, and safety rules — tells you how dangerous a workflow is before it runs. And safety rules must only tighten: an allow rule that contradicts a block rule is a misconfiguration, not a feature.

▸Static spec risk score [0–100] surfaced by `armature validate` (LOW/MEDIUM/HIGH/CRITICAL)

▸Rogue signal counter — every tool block incremented, shown in run summary

▸Only-tighten rule validation — `CONFLICTING_SAFETY_RULES` when allow loosens a block

The core finding shared across all of them: the harness is more important than the model. Armature ships the harness — production-grade, self-improving, and open source.

Quick Start

From zero to running
in minutes.

Write a YAML spec, point Armature at it, and watch your maturity of agents get to work.

market-briefing.yaml

name: market-briefing
model_tiers:
  small: {provider: anthropic, model: claude-haiku-4-5-20251001}
  large: {provider: anthropic, model: claude-sonnet-4-6}

stages:
  - id: researcher
    role: researcher
    tier: small
    system: |
      Gather and summarize key signals on the given topic.
      Focus on recent developments, key players, and trends.

  - id: analyst
    role: worker
    tier: small
    depends_on: [researcher]
    system: |
      From the research, identify the top 3 opportunities.
      Quantify each with available evidence.

  - id: editor
    role: judge
    tier: large
    depends_on: [analyst]
    system: |
      Review the analysis. Score quality 0–10.
      Flag any gaps or unsupported claims.

terminal

$armature run market-briefing.yaml \

--topic "AI in healthcare diagnostics"

✓DAG validated (3 stages, no cycles)

◌researcher running...

✓researcher done (1.4s)

◌analyst running...

✓analyst done (2.2s)

◌editor running...

✓editor done (0.9s, score=8.7/10)

✓Complete in 4.5s · HQS=0.91

→.armature/traces/run-20260517.json

$ pip install armature-agents · then set ANTHROPIC_API_KEY and run · view on PyPI ↗

Open Source

Built to be shared.

Armature is free, MIT licensed, and built in the open. Fork it, extend it, build on it. Contributions welcome — especially new role types, tool integrations, and self-improvement strategies.

View on GitHub →Read the Docs

Python 3.11+

runtime

MIT

license

LiteLLM

provider layer

1,432

tests passing