Open Source · Python 3.11+ · MIT License

A maturity

of AI agents.

Not just a collective noun — a design principle.

Armature is a YAML-first multi-agent workflow harness. Define researcher, worker, and judge agents. Execute them as a DAG. Then let the system study its own traces and rewrite its own specification — every run, every time.

Star on GitHub →How it works ↓

Like a murder of crows is more dangerous than one,
a maturity of agents is smarter than before.

scoperesearcherworker aworkerworker bworkerjudgejudgeoutput
Why "a maturity"?

Birds flock. Geese gaggle.
Crows murder.

AI agents mature.

Every collective noun for animals captures something true about how they move and behave together. A murder of crows isn't just a group — it names the coordinated, intelligent behavior that makes them formidable as a collective. We chose maturity deliberately.

Armature's agents don't just coordinate to complete a task. After every run, the system collects execution traces, runs the DiagnosticAnalyzer against them, and uses the SpecRefiner to rewrite the YAML sections that underperformed. The next run is better. Your workflow doesn't just run. It matures.

“A murder of crows is more dangerous than one.
A maturity of agents is smarter than before.”

How It Works

Three steps. One cycle.
Spec. Execute. Improve.

Armature isn't a run-once tool. It's a loop. Each step feeds the next, and the next run is smarter than the last.

01
SPECIFY

Write a YAML spec. That's it.

Define your agents by role, model tier, and dependencies. Armature validates the DAG before the first run — catching cycles, missing deps, and misconfigured stages. No framework to learn. No graph API to wire.

role: researcher | worker | judge | orchestrator
tier: small | medium | large (maps to your model config)
depends_on: [list of upstream stage IDs]
output_mode: text | guided_json with schema validation
02
EXECUTE

DAG execution. Context flows automatically.

Independent stages run in parallel. Dependent stages wait for their inputs. Every stage receives the full accumulated context from all upstream stages — no wiring, no passing variables by hand. One shared dict, built up as the workflow runs.

Parallel fan-out for independent branches
Context dict accumulates all upstream outputs
guided_json with automatic tier escalation on failure
Checkpoint & resume — survive crashes mid-workflow
03
IMPROVE

The workflow rewrites itself.

Every run generates a trace. The SelfImproveRunner computes IHR across all stages, identifies which ones drag the score down, and rewrites targeted YAML sections. Add --auto-improve to any run and Armature applies safe fixes automatically — or stages structural rewrites for human review. The next run is better. Verifiably.

IHR = 0.40×valid + 0.30×success + 0.20×quorum + 0.10×latency
DiagnosticAnalyzer identifies the lowest-scoring stages
SpecRefiner rewrites only the underperforming YAML sections
Prediction-verification: fixes are confirmed or flagged each cycle
Agent Roles

Three roles. Every agent has one.

Roles aren't a label — they determine execution order, context access, and contribution to the self-improvement health score. A well-designed maturity has all three.

Researcher

Gathers.

The information foundation. Researchers query tools, read context, search external sources, and build the knowledge base that downstream agents draw from. They run first — and in parallel when independent.

Common uses
Market signal aggregation
Competitor analysis
Evidence synthesis across sources
Tool call fan-out
Worker

Transforms.

The production engine. Workers synthesize research into drafts, summaries, reports, code, or structured data. They consume upstream researcher output and produce the artifacts that judges and downstream workers will evaluate.

Common uses
Draft generation
Data transformation
Code synthesis
Report writing
Judge

Evaluates.

The quality gate. Judges score output quality, validate against criteria, flag hallucinations, and decide whether a result meets the bar. Only judges contribute to the quorum score in the IHR — they are the accountability layer.

Common uses
Output quality scoring (0–10)
Hallucination detection
Criteria validation
Structured pass/fail decisions
The Differentiator

Static orchestration is table stakes.
Armature learns.

AWS AgentCore, LangGraph, and CrewAI let you build agent workflows. Armature does that too — and then automatically improves them across runs using the Improvement Health Rating loop.

Improvement Health Rating (IHR)
IHR = 0.40 × valid_rate + 0.30 × success_rate + 0.20 × avg_quorum + 0.10 × latency_score
Scored 0–1.0 per run. SpecRefiner targets stages whose contribution drops the overall IHR.
01
Run & Trace
Every workflow run generates a structured trace — inputs, outputs, scores, latencies, and errors per stage.
02
Diagnose
DiagnosticAnalyzer computes IHR and identifies stages with the lowest per-metric contribution.
03
Rewrite
SpecRefiner (an LLM) receives the underperforming stage spec and rewrites the system prompt, output schema, or parameters.
04
Verify
The next run's IHR is compared to predictions. SpecRefiner tracks which fixes held and which missed — so it improves its own rewrites too.

Prediction-verification closes the loop: SpecRefiner declares what it expects each rewrite to fix. The subsequent run confirms whether the fixes held — and which ones missed. The rewriter improves its own judgment over time.

Auto self-improvement — zero manual steps
armature run my-workflow.yaml --auto-improve

Add --auto-improve to any run. When IHR drops below 0.75, Armature automatically calls SpecRefiner after execution — rewriting prompts, relaxing schemas, rebalancing model tiers, or tuning retry limits. Safe changes apply immediately; structural rewrites stage to {spec}.pending.yaml for human review.

Research Foundation

Nine papers. One framework.
All implemented.

Armature isn't invented from first principles — it's a synthesis of the best current academic thinking on agent harness design, published between February and May 2026, plus Microsoft's Agent Governance Toolkit, ActiveGraph's event-sourced execution model, and Veldt Labs' KYA trust layer. Every source contributed concrete, implemented capabilities.

Mature has two meanings here. The agents grow smarter every run — and the harness itself matures alongside the field, tracking the latest research as it ships.

01 · Mar 2026
arXiv:2603.25723

Natural-Language Agent Harnesses

Tsinghua University

Workflows defined in structured natural language outperform equivalent code-based harnesses — and can be reasoned about and rewritten by an optimizer.

YAML spec format & DAG executor
Four role types (researcher/worker/judge/orchestrator)
IHR quality metric & parallel fan-out
02 · Mar 2026
arXiv:2603.28052

Meta-Harness: Automated Optimization

Stanford University

Giving a frontier model access to full execution traces — not just pass/fail scores — enables causal reasoning about why runs failed and how to fix them.

`armature optimize` command
A/B spec testing by IHR
Multi-iteration optimizer with proposal history
03 · Feb 2026
arXiv:2603.03329

AutoHarness: LLM-Synthesized Harnesses

arXiv:2603.03329

LLMs can generate, run, evaluate, and refine their own harness specs — producing systems that outperform larger models running without a harness.

`armature new` spec wizard
NL → YAML synthesis loop
Prompt bootstrapping from trace examples
04 · Mar 2025
arXiv:2503.18666

AgentSpec: Runtime Safety Enforcement

arXiv:2503.18666

Safety constraints should be declarative rules co-located with the workflow spec — not hardcoded logic — so they can be audited, reasoned about, and generated by LLMs.

Declarative `safety_rules` YAML DSL
Pre/post-stage and pre/post-tool hooks
`ToolBlocked` non-retryable exception
05 · May 2026
arXiv:2605.09998

Continual Harness: Reset-Free Self-Improvement

arXiv:2605.09998

Agentic systems can improve continuously — without human intervention or new training runs — using a two-loop design: in-run adaptation and cross-run spec refinement.

`post_run` in-run refiner stage
`armature improve` outer self-improvement loop
Trace export for SFT/DPO fine-tuning
06 · Apr 2026
arXiv:2604.25850

AHE: Observability-Driven Automatic Evolution

arXiv:2604.25850

Every improvement proposal must declare what it predicts it will fix — and the next cycle must verify those predictions. "Did the score go up?" is not enough.

Prediction-verification loop per improvement cycle
`predicted_fixes` / `verified_fixes` tracking
Falsifiable contracts on every spec revision
07 · May 2026
arXiv:2605.26112

From Model Scaling to System Scaling

arXiv:2605.26112

Three system-level failure modes that model size alone cannot fix: stale memory reaching LLMs without warning, context values flowing without provenance, and tool side effects going unverified.

Memory staleness detection + `_stale_memory_keys` injection
Context provenance tracking per trace key
Post-condition verification for tool side effects
Drift score + component governance classification
AGT · 2025

Agent Governance Toolkit

Microsoft

Production agents require auditable governance primitives baked into the execution layer — not bolted on as policy checks. Reversibility, trace integrity, and fail-closed safety modes belong in the harness spec itself.

Reversibility classification on every tool (FULL / PARTIAL / NONE)
SHA-256 trace input hashing + policy version fingerprint
`require_approval` gate on the tool-call path
`safety_mode: strict` — fail-closed, deny on no-match
AG · May 2026
arXiv:2605.21997

ActiveGraph: Event-Sourced Agents

Yohei Nakajima

Append-only event logs make agent runs reproducible and auditable. Content-addressed LLM caching turns expensive re-runs into instant cache hits — enabling replay, debugging, and future fork-and-diff without paying LLM costs.

Content-addressed LLM response cache (`--no-cache` to opt out)
`armature replay <run_id>` — stage-by-stage audit from TraceStore
Trace-triggered behaviors (`BehaviorRule`) with IHR feedback built-in
`--auto-improve`: after each run, auto-applies spec improvements when IHR drops below 0.75
KYA · May 2026
arXiv:2605.25376

KYA: Trust Layer for Autonomous Systems

Veldt Labs

Governance must operate before execution, not only at runtime. A risk score computed from the agent's definition — its tools, governance mode, and safety rules — tells you how dangerous a workflow is before it runs. And safety rules must only tighten: an allow rule that contradicts a block rule is a misconfiguration, not a feature.

Static spec risk score [0–100] surfaced by `armature validate` (LOW/MEDIUM/HIGH/CRITICAL)
Rogue signal counter — every tool block incremented, shown in run summary
Only-tighten rule validation — `CONFLICTING_SAFETY_RULES` when allow loosens a block

The core finding shared across all seven: the harness is more important than the model. Armature ships the harness — production-grade, self-improving, and open source.

Quick Start

From zero to running
in minutes.

Write a YAML spec, point Armature at it, and watch your maturity of agents get to work.

market-briefing.yaml
name: market-briefing
model_tiers:
  small: {provider: anthropic, model: claude-haiku-4-5-20251001}
  large: {provider: anthropic, model: claude-sonnet-4-6}

stages:
  - id: researcher
    role: researcher
    tier: small
    system: |
      Gather and summarize key signals on the given topic.
      Focus on recent developments, key players, and trends.

  - id: analyst
    role: worker
    tier: small
    depends_on: [researcher]
    system: |
      From the research, identify the top 3 opportunities.
      Quantify each with available evidence.

  - id: editor
    role: judge
    tier: large
    depends_on: [analyst]
    system: |
      Review the analysis. Score quality 0–10.
      Flag any gaps or unsupported claims.
terminal
$armature run market-briefing.yaml \
--topic "AI in healthcare diagnostics"
DAG validated (3 stages, no cycles)
researcher running...
researcher done (1.4s)
analyst running...
analyst done (2.2s)
editor running...
editor done (0.9s, score=8.7/10)
Complete in 4.5s · IHR=0.91
.armature/traces/run-20260517.json
$ pip install armature-harness  · then set ANTHROPIC_API_KEY and run.
Open Source

Built to be shared.

Armature is free, MIT licensed, and built in the open. Fork it, extend it, build on it. Contributions welcome — especially new role types, tool integrations, and self-improvement strategies.

View on GitHub →Read the Docs
Python 3.11+
runtime
MIT
license
LiteLLM
provider layer
1,221+
tests passing