re:factory¶
Describe what you want. re:factory builds it, tests it, and keeps improving it — autonomously.
You give it a spec file, a rough idea, or an existing codebase. re:factory researches best practices, scaffolds the project, sets up evaluation, and runs a continuous improvement loop — measuring every change and keeping only what makes things better. The agents that do this work learn from every experiment and get sharper over time.
# Build — have a fleshed-out idea? Pass the file.
factory ceo ~/ideas/weather-dashboard.md
# Design — just starting to think about it? Brainstorm first.
factory ceo "distributed eval runner" --mode design
# Research — have a metric to optimize? re:factory runs experiments.
factory ceo "SWE-bench solver agent" --mode research
# Improve — point it at any codebase
factory ceo ~/my-project
# Focus — build exactly one thing
factory ceo ~/my-project --focus "add WebSocket support"
How It Works¶
graph LR
A["🔍 Researcher<br><i>observe</i>"] --> B["🎯 Strategist<br><i>hypothesize</i>"]
B --> C["🔨 Builder<br><i>implement</i>"]
C --> RV["🛡️ Reviewer<br><i>guard</i>"]
RV --> D["📊 Evaluator<br><i>measure</i>"]
D --> E{"CEO<br><i>decide</i>"}
E -- "score ↑" --> F["✅ KEEP"]
E -- "score ↓" --> G["↩️ REVERT"]
F --> H["📝 Archivist<br><i>record</i>"]
G --> H
H -.-> A
style E fill:#5c6bc0,color:#fff,stroke:#3949ab
style F fill:#43a047,color:#fff,stroke:#2e7d32
style G fill:#e53935,color:#fff,stroke:#c62828
A CEO agent orchestrates eight specialists — Researcher, Strategist, Builder, Reviewer, Evaluator, Archivist, Refiner, and Failure Analyst — each running as an independent Claude Code subprocess. The Researcher searches the web and reads prior knowledge from the archive. The Strategist generates ranked hypotheses and also handles design-mode ideation. The Builder implements one on an experiment branch. The Evaluator scores before and after. The CEO decides keep or revert. The Archivist records everything to .factory/archive/ and regenerates performance reports for cross-project learning. In design mode, the Strategist synthesizes research into a buildable plan through user feedback. In research mode, the Failure Analyst classifies run failures to guide targeted hypothesis generation.
Workflows¶
Build — start from an idea¶
factory ceo "Build a REST API for bookmark management"
factory ceo ~/ideas/weather-dashboard.md
factory ceo https://github.com/user/repo
Give re:factory an idea (raw string, spec file, or GitHub URL) and it builds a complete project: scaffolding, tests, eval, and iterative improvement.
Improve — make an existing codebase better¶
Point it at any codebase. Each cycle observes the project, hypothesizes changes, implements one, and keeps it only if the score goes up.
Focus — build exactly one thing¶
When you know exactly what you want, --focus pins a single backlog item, generates one hypothesis, runs one experiment, and exits. The entire pipeline is scoped to that single target.
Design — brainstorm before building¶
Have a rough idea? Design mode researches the space, drafts a structured plan via the Strategist, and lets you iterate on it before any code is written.
Research — optimize a metric iteratively¶
factory ceo "SWE-bench solver agent" --mode research
factory ceo ~/my-research-project --mode research
For projects with a measurable target metric (benchmark accuracy, solve rate, query precision). Research mode replaces the standard Improve loop with a specialized cycle: Baseline → Failure Analyst → Researcher → Strategist → Builder → Run → Verdict. Leakage guards prevent ground truth from contaminating hypotheses, and monotonic improvement ensures the metric never regresses below the previous best. See Getting Started for the full picture.
Headless & continuous loop¶
For unattended operation — scripting, cron jobs, or always-on machines:
# Headless — pipe mode, no interaction
factory ceo ~/my-project --headless
# Loop — continuous improvement (default: every 30 min)
factory run ~/my-project --loop
# Detached tmux — loop in the background
factory tmux ~/my-project --loop
--headless disables the interactive session. --loop wraps the CEO in a heartbeat loop: run one cycle, sleep, repeat. Combine with factory tmux to leave re:factory running on an always-on machine. See Getting Started for full details.
Quick Start¶
# Install from source (recommended — re:factory evolves fast)
git clone https://github.com/akashgit/remote-factory.git
cd remote-factory && uv sync && uv tool install -e .
# Register the CEO as a Claude Code agent
factory install
Prerequisites: Python 3.11+ and Claude Code (installed and authenticated). No external services, databases, or Obsidian required — re:factory stores all state locally.
Per-project state lives in .factory/ (experiment history, strategy, archive notes). Global state lives in ~/.factory/ (project registry, evolved playbooks). Projects are auto-registered when experiments begin — no manual setup needed. See Setup Guide for environment variables and authentication options.
Self-Evolving Agents¶
re:factory doesn't just improve your project — it improves itself. Every keep/revert decision becomes training data for the next cycle.
This is powered by ACE (Autonomous Context Engineering) — inspired by Anthropic's work on context engineering — a Reflect → Curate → Inject loop that evolves agent playbooks from real experiment outcomes.
graph LR
A["Experiment Outcomes<br><i>kept or reverted</i>"] -->|Reflect| B["Generate<br>candidate bullets"]
B -->|Curate| C["Merge & prune<br>playbooks"]
C -->|Inject| D["Agent Prompts<br><i>auto-appended</i>"]
D -.->|"next cycle"| A
style A fill:#fff3e0,stroke:#ff8f00
style D fill:#e8eaf6,stroke:#5c6bc0
Each agent accumulates behavioral rules — DOs and DON'Ts — with evidence counters. Rules that correlate with kept experiments get reinforced. Rules that correlate with reverts get pruned.
# Run a full improvement cycle, then evolve all agent playbooks
factory ceo ~/my-project --mode meta
See Self-Improvement Loop for the full picture — how the CEO tracks agents, how cross-project learning works, and how the CEO improves itself. See ACE Playbook Evolution for the playbook mechanics.
Architecture¶
graph TB
subgraph agents ["Specialist Agents"]
R["Researcher"] ~~~ S["Strategist"] ~~~ BU["Builder"]
RE["Reviewer"] ~~~ EV["Evaluator"] ~~~ AR["Archivist"]
RF["Refiner"] ~~~ FA["Failure Analyst"]
end
subgraph ceo ["CEO Agent"]
C["Detect state → Route mode → Spawn agents → Keep/Revert → Archive"]
end
subgraph cli ["Python CLI"]
T["eval · guard · store · discover · events · strategy"]
end
agents --> ceo --> cli
style agents fill:#e8eaf6,stroke:#5c6bc0
style ceo fill:#fff3e0,stroke:#ff8f00
style cli fill:#e8f5e9,stroke:#43a047
The Eval System¶
graph LR
subgraph hygiene ["Hygiene · 6 dims"]
H1["tests · lint · types<br>coverage · guards · config"]
end
subgraph growth ["Growth · 5 dims"]
G1["capability · diversity<br>observability · research<br>effectiveness"]
end
subgraph project ["Project · N dims"]
P1["your custom metrics<br>benchmarks · latency<br>accuracy · win rate"]
end
hygiene --> M["⚖️ Weighted<br>Composite"]
growth --> M
project --> M
M --> S{"score ≥<br>threshold?"}
S -- "yes" --> K["✅ Keep"]
S -- "no" --> R["↩️ Revert"]
style hygiene fill:#e8eaf6,stroke:#5c6bc0
style growth fill:#fff3e0,stroke:#ff8f00
style project fill:#e8f5e9,stroke:#43a047
style K fill:#43a047,color:#fff
style R fill:#e53935,color:#fff
| Tier | What it measures | Examples |
|---|---|---|
| Hygiene (6 dimensions) | Code quality basics | Tests, lint, type checking, coverage |
| Growth (5 dimensions) | Capability evolution | API surface area, experiment diversity, observability |
| Project (user-defined) | Domain-specific metrics | Benchmark accuracy, latency, win rate |
Built with re:factory¶
re:factory has shipped something every day for the last 30 days — products, research experiments, production features, papers. Here are a few examples:
| Project | What it does | Mode |
|---|---|---|
| SWE-bench solver | Autonomous agent that resolves GitHub issues from the SWE-bench dataset, iteratively improved via failure analysis | Research |
| HMMT math solver | Multi-agent team (Explorer, Theorist, Computationalist, Critic, Synthesizer) that solved HMMT Feb 2025 Combinatorics Problem 7 | Research |
| Text/Sketch → CAD | Converts natural language and hand-drawn sketches into executable CadQuery code for 3D model generation | Research |
| HLS design space explorer | Per-function AI agents explore HLS pragma/code variants in parallel, an ILP solver finds the optimal combination, then global expert agents apply cross-function optimizations — achieving up to 92% execution time reduction on cryptographic benchmarks | Build |
| Pluck | iOS app that extracts structured data from screenshots, links, and shared content using on-device AI | Build + Improve |
| Group chat digest | Turns iMessage group chats into weekly family newsletters with AI-curated highlights and photo selection | Build + Improve |
| Production enterprise features | Complete UI components and backend features shipped into a large-scale production codebase | Focus + Improve |
| re:factory itself | re:factory runs on itself in meta mode — its own agent playbooks are evolved from its own experiment outcomes | Meta |
Built something with re:factory? Open a PR to add it here.
License¶
MIT — Akash Srivastava