Skip to content

re:factory

re:factory

Describe what you want. re:factory builds it, tests it, and keeps improving it — autonomously.

You give it a spec file, a rough idea, or an existing codebase. re:factory researches best practices, scaffolds the project, sets up evaluation, and runs a continuous improvement loop — measuring every change and keeping only what makes things better. The agents that do this work learn from every experiment and get sharper over time.

# Build — have a fleshed-out idea? Pass the file.
factory ceo ~/ideas/weather-dashboard.md

# Design — just starting to think about it? Brainstorm first.
factory ceo "distributed eval runner" --mode design

# Research — have a metric to optimize? re:factory runs experiments.
factory ceo "SWE-bench solver agent" --mode research

# Improve — point it at any codebase
factory ceo ~/my-project

# Focus — build exactly one thing
factory ceo ~/my-project --focus "add WebSocket support"

How It Works

graph LR
    A["🔍 Researcher<br><i>observe</i>"] --> B["🎯 Strategist<br><i>hypothesize</i>"]
    B --> C["🔨 Builder<br><i>implement</i>"]
    C --> RV["🛡️ Reviewer<br><i>guard</i>"]
    RV --> D["📊 Evaluator<br><i>measure</i>"]
    D --> E{"CEO<br><i>decide</i>"}
    E -- "score ↑" --> F["✅ KEEP"]
    E -- "score ↓" --> G["↩️ REVERT"]
    F --> H["📝 Archivist<br><i>record</i>"]
    G --> H
    H -.-> A

    style E fill:#5c6bc0,color:#fff,stroke:#3949ab
    style F fill:#43a047,color:#fff,stroke:#2e7d32
    style G fill:#e53935,color:#fff,stroke:#c62828

A CEO agent orchestrates eight specialists — Researcher, Strategist, Builder, Reviewer, Evaluator, Archivist, Refiner, and Failure Analyst — each running as an independent Claude Code subprocess. The Researcher searches the web and reads prior knowledge from the archive. The Strategist generates ranked hypotheses and also handles design-mode ideation. The Builder implements one on an experiment branch. The Evaluator scores before and after. The CEO decides keep or revert. The Archivist records everything to .factory/archive/ and regenerates performance reports for cross-project learning. In design mode, the Strategist synthesizes research into a buildable plan through user feedback. In research mode, the Failure Analyst classifies run failures to guide targeted hypothesis generation.

Workflows

Build — start from an idea

factory ceo "Build a REST API for bookmark management"
factory ceo ~/ideas/weather-dashboard.md
factory ceo https://github.com/user/repo

Give re:factory an idea (raw string, spec file, or GitHub URL) and it builds a complete project: scaffolding, tests, eval, and iterative improvement.

Improve — make an existing codebase better

factory ceo ~/my-project
factory run ~/my-project --loop

Point it at any codebase. Each cycle observes the project, hypothesizes changes, implements one, and keeps it only if the score goes up.

Focus — build exactly one thing

factory ceo ~/my-project --focus "add authentication middleware"

When you know exactly what you want, --focus pins a single backlog item, generates one hypothesis, runs one experiment, and exits. The entire pipeline is scoped to that single target.

Design — brainstorm before building

factory ceo "distributed eval runner" --mode design

Have a rough idea? Design mode researches the space, drafts a structured plan via the Strategist, and lets you iterate on it before any code is written.

Research — optimize a metric iteratively

factory ceo "SWE-bench solver agent" --mode research
factory ceo ~/my-research-project --mode research

For projects with a measurable target metric (benchmark accuracy, solve rate, query precision). Research mode replaces the standard Improve loop with a specialized cycle: Baseline → Failure Analyst → Researcher → Strategist → Builder → Run → Verdict. Leakage guards prevent ground truth from contaminating hypotheses, and monotonic improvement ensures the metric never regresses below the previous best. See Getting Started for the full picture.

Headless & continuous loop

For unattended operation — scripting, cron jobs, or always-on machines:

# Headless — pipe mode, no interaction
factory ceo ~/my-project --headless

# Loop — continuous improvement (default: every 30 min)
factory run ~/my-project --loop

# Detached tmux — loop in the background
factory tmux ~/my-project --loop

--headless disables the interactive session. --loop wraps the CEO in a heartbeat loop: run one cycle, sleep, repeat. Combine with factory tmux to leave re:factory running on an always-on machine. See Getting Started for full details.

Quick Start

# Install from source (recommended — re:factory evolves fast)
git clone https://github.com/akashgit/remote-factory.git
cd remote-factory && uv sync && uv tool install -e .

# Register the CEO as a Claude Code agent
factory install

Prerequisites: Python 3.11+ and Claude Code (installed and authenticated). No external services, databases, or Obsidian required — re:factory stores all state locally.

Per-project state lives in .factory/ (experiment history, strategy, archive notes). Global state lives in ~/.factory/ (project registry, evolved playbooks). Projects are auto-registered when experiments begin — no manual setup needed. See Setup Guide for environment variables and authentication options.

Self-Evolving Agents

re:factory doesn't just improve your project — it improves itself. Every keep/revert decision becomes training data for the next cycle.

This is powered by ACE (Autonomous Context Engineering) — inspired by Anthropic's work on context engineering — a Reflect → Curate → Inject loop that evolves agent playbooks from real experiment outcomes.

graph LR
    A["Experiment Outcomes<br><i>kept or reverted</i>"] -->|Reflect| B["Generate<br>candidate bullets"]
    B -->|Curate| C["Merge & prune<br>playbooks"]
    C -->|Inject| D["Agent Prompts<br><i>auto-appended</i>"]
    D -.->|"next cycle"| A

    style A fill:#fff3e0,stroke:#ff8f00
    style D fill:#e8eaf6,stroke:#5c6bc0

Each agent accumulates behavioral rules — DOs and DON'Ts — with evidence counters. Rules that correlate with kept experiments get reinforced. Rules that correlate with reverts get pruned.

# Run a full improvement cycle, then evolve all agent playbooks
factory ceo ~/my-project --mode meta

See Self-Improvement Loop for the full picture — how the CEO tracks agents, how cross-project learning works, and how the CEO improves itself. See ACE Playbook Evolution for the playbook mechanics.

Architecture

graph TB
    subgraph agents ["Specialist Agents"]
        R["Researcher"] ~~~ S["Strategist"] ~~~ BU["Builder"]
        RE["Reviewer"] ~~~ EV["Evaluator"] ~~~ AR["Archivist"]
        RF["Refiner"] ~~~ FA["Failure Analyst"]
    end
    subgraph ceo ["CEO Agent"]
        C["Detect state → Route mode → Spawn agents → Keep/Revert → Archive"]
    end
    subgraph cli ["Python CLI"]
        T["eval · guard · store · discover · events · strategy"]
    end

    agents --> ceo --> cli

    style agents fill:#e8eaf6,stroke:#5c6bc0
    style ceo fill:#fff3e0,stroke:#ff8f00
    style cli fill:#e8f5e9,stroke:#43a047

The Eval System

graph LR
    subgraph hygiene ["Hygiene · 6 dims"]
        H1["tests · lint · types<br>coverage · guards · config"]
    end
    subgraph growth ["Growth · 5 dims"]
        G1["capability · diversity<br>observability · research<br>effectiveness"]
    end
    subgraph project ["Project · N dims"]
        P1["your custom metrics<br>benchmarks · latency<br>accuracy · win rate"]
    end

    hygiene --> M["⚖️ Weighted<br>Composite"]
    growth --> M
    project --> M
    M --> S{"score ≥<br>threshold?"}
    S -- "yes" --> K["✅ Keep"]
    S -- "no" --> R["↩️ Revert"]

    style hygiene fill:#e8eaf6,stroke:#5c6bc0
    style growth fill:#fff3e0,stroke:#ff8f00
    style project fill:#e8f5e9,stroke:#43a047
    style K fill:#43a047,color:#fff
    style R fill:#e53935,color:#fff
Tier What it measures Examples
Hygiene (6 dimensions) Code quality basics Tests, lint, type checking, coverage
Growth (5 dimensions) Capability evolution API surface area, experiment diversity, observability
Project (user-defined) Domain-specific metrics Benchmark accuracy, latency, win rate

Built with re:factory

re:factory has shipped something every day for the last 30 days — products, research experiments, production features, papers. Here are a few examples:

Project What it does Mode
SWE-bench solver Autonomous agent that resolves GitHub issues from the SWE-bench dataset, iteratively improved via failure analysis Research
HMMT math solver Multi-agent team (Explorer, Theorist, Computationalist, Critic, Synthesizer) that solved HMMT Feb 2025 Combinatorics Problem 7 Research
Text/Sketch → CAD Converts natural language and hand-drawn sketches into executable CadQuery code for 3D model generation Research
HLS design space explorer Per-function AI agents explore HLS pragma/code variants in parallel, an ILP solver finds the optimal combination, then global expert agents apply cross-function optimizations — achieving up to 92% execution time reduction on cryptographic benchmarks Build
Pluck iOS app that extracts structured data from screenshots, links, and shared content using on-device AI Build + Improve
Group chat digest Turns iMessage group chats into weekly family newsletters with AI-curated highlights and photo selection Build + Improve
Production enterprise features Complete UI components and backend features shipped into a large-scale production codebase Focus + Improve
re:factory itself re:factory runs on itself in meta mode — its own agent playbooks are evolved from its own experiment outcomes Meta

Built something with re:factory? Open a PR to add it here.

License

MIT — Akash Srivastava