ACE Self-Improvement¶
ACE (Autonomous Context Engineering) is the Factory's self-improvement loop. It evolves the agent playbooks — behavioral rules that guide each specialist agent — based on real experiment outcomes.
How It Works¶
Experiment outcomes Reflect Curate Inject
(results.tsv) ──────────▶ ──────────▶ ──────────▶ Agent prompts
across all projects Generate Merge & Auto-append
candidate prune at runtime
bullets playbooks
1. Reflect (factory/ace/reflector.py)¶
Analyzes experiment outcomes across all factory-managed projects (discovered via the global registry at ~/.factory/registry.json, with directory scanning as fallback):
- Loads data from performance reports (.factory/performance_report.json) with TSV fallback
- Computes category success rates (which types of changes get kept vs reverted)
- Generates candidate playbook bullets for all 7 agent roles from experiment outcomes, CEO verdict patterns, and observation coverage
- Each bullet is a behavioral rule: DO (reinforced pattern) or DON'T (anti-pattern)
2. Curate (factory/ace/curator.py)¶
Merges candidate bullets with existing playbooks: - Deduplicates similar rules - Increments helpful/harmful counters on existing bullets - Prunes low-value bullets (low net score) - Caps playbook size to prevent unbounded growth
3. Inject (factory/ace/injector.py)¶
At runtime, when an agent is spawned, evolved playbooks are automatically appended to the agent's prompt. This happens transparently in factory/agents/runner.py.
Playbook Format¶
Factory ships clean default playbooks in factory/agents/playbooks/<role>.md. When ACE evolves playbooks from your experiment data, it writes to ~/.factory/playbooks/<role>.md (user-local). The injector checks user-local first, then falls back to factory defaults. Your evolved playbooks are never committed to the factory repo — they're personal to your experiment history.
Example format:
---
role: builder
updated: 2026-04-22
item_count: 5
---
## Behavioral Playbook — Builder
### DO
- [build-00001] helpful=12 harmful=1 :: Always run ruff + mypy after making changes
### DON'T
- [build-00002] helpful=3 harmful=0 :: Don't add type: ignore comments — fix the actual type error
Each bullet tracks:
- ID: Unique identifier (e.g. build-00001)
- helpful/harmful counters: How many times this rule correlated with kept vs reverted experiments
- Net score: helpful - harmful — rules with negative net scores get pruned
Running ACE¶
# Run ACE on a specific project
factory ace ~/my-project
# Run ACE as part of meta mode (includes full improve cycle first)
factory ceo ~/my-project --mode meta
Meta mode runs the full improvement loop, then reflects on the outcomes to evolve all 7 agent playbooks. See Self-Improvement Loop for the full picture — including cross-project learning, CEO self-evaluation, and how the pieces fit together.
When to Run¶
ACE produces meaningful playbook updates only when there is enough experiment data to analyze. Running it too frequently churns rules on small samples; ignoring it means agents never learn from their mistakes.
Recommended cadence: Weekly for most projects, nightly if you are running 5+ experiments per day. Wait until at least 5 experiments have been recorded across your managed projects before the first run (10+ preferred for stronger signal).
When to skip: Right after initial project setup (no experiment data yet), or when fewer than 3 new experiments have completed since the last ACE run. See When to Run Meta Mode for the full decision framework.
What Gets Evolved¶
All 7 agent roles have playbooks:
| Role | What ACE learns |
|---|---|
| CEO | Keep/revert decision patterns, when to trust eval scores |
| Researcher | Which research approaches produce actionable insights |
| Strategist | Which hypothesis categories succeed in which contexts |
| Builder | Implementation patterns that pass review, common pitfalls |
| Reviewer | What to focus on in code review, false positive patterns |
| Evaluator | Score interpretation, when to flag anomalies |
| Archivist | What to record, archive organization patterns |
Design Principles¶
- Evidence-based: Every playbook bullet is derived from real experiment outcomes, not hand-written rules
- Self-correcting: If a rule leads to reverted experiments, its harmful counter increases until it's pruned
- Bounded: Playbook size is capped to prevent prompt bloat
- Transparent: Playbooks are human-readable markdown — you can read, edit, or override them
- Cross-project: Learnings from one project inform behavior on others