Skip to content

The Factory

Describe what you want. The Factory builds it, tests it, and keeps improving it — autonomously.

You give it a spec file, a rough idea, or an existing codebase. The Factory researches best practices, scaffolds the project, sets up evaluation, and runs a continuous improvement loop — measuring every change and keeping only what makes things better. The agents that do this work learn from every experiment and get sharper over time.

# Build — have a fleshed-out idea? Pass the file.
factory ceo ~/ideas/weather-dashboard.md

# Interactive — just starting to think about it? Brainstorm first.
factory ceo "distributed eval runner" --mode interactive

# Research — have a metric to optimize? The factory runs experiments.
factory ceo "SWE-bench solver agent" --mode research

# Improve — point it at any codebase
factory ceo ~/my-project

# Focus — build exactly one thing
factory ceo ~/my-project --focus "add WebSocket support"

How It Works

graph LR
    A["🔍 Researcher<br><i>observe</i>"] --> B["🎯 Strategist<br><i>hypothesize</i>"]
    B --> C["🔨 Builder<br><i>implement</i>"]
    C --> RV["🛡️ Reviewer<br><i>guard</i>"]
    RV --> D["📊 Evaluator<br><i>measure</i>"]
    D --> E{"CEO<br><i>decide</i>"}
    E -- "score ↑" --> F["✅ KEEP"]
    E -- "score ↓" --> G["↩️ REVERT"]
    F --> H["📝 Archivist<br><i>record</i>"]
    G --> H
    H -.-> A

    style E fill:#5c6bc0,color:#fff,stroke:#3949ab
    style F fill:#43a047,color:#fff,stroke:#2e7d32
    style G fill:#e53935,color:#fff,stroke:#c62828

A CEO agent orchestrates eight specialists — Researcher, Strategist, Builder, Reviewer, Evaluator, Archivist, Distiller, and Failure Analyst — each running as an independent Claude Code subprocess. The Researcher searches the web and reads prior knowledge from the archive. The Strategist generates ranked hypotheses. The Builder implements one on an experiment branch. The Evaluator scores before and after. The CEO decides keep or revert. The Archivist records everything to .factory/archive/ and regenerates performance reports for cross-project learning. In interactive mode, the Distiller synthesizes research into a buildable spec through user feedback. In research mode, the Failure Analyst classifies run failures to guide targeted hypothesis generation.

Workflows

Build — start from an idea

factory ceo "Build a REST API for bookmark management"
factory ceo ~/ideas/weather-dashboard.md
factory ceo https://github.com/user/repo

Give the Factory an idea (raw string, spec file, or GitHub URL) and it builds a complete project: scaffolding, tests, eval, and iterative improvement.

Improve — make an existing codebase better

factory ceo ~/my-project
factory run ~/my-project --loop

Point it at any codebase. Each cycle observes the project, hypothesizes changes, implements one, and keeps it only if the score goes up.

Focus — build exactly one thing

factory ceo ~/my-project --focus "add authentication middleware"

When you know exactly what you want, --focus pins a single backlog item, generates one hypothesis, runs one experiment, and exits. The entire pipeline is scoped to that single target.

Interactive — brainstorm before building

factory ceo "distributed eval runner" --mode interactive

Have a rough idea? Interactive mode researches the space, drafts a structured spec via the Distiller agent, and lets you iterate on it before any code is written.

Research — optimize a metric iteratively

factory ceo "SWE-bench solver agent" --mode research
factory ceo ~/my-research-project --mode research

For projects with a measurable target metric (benchmark accuracy, solve rate, query precision). Research mode replaces the standard Improve loop with a specialized cycle: Baseline → Failure Analyst → Researcher → Strategist → Builder → Run → Verdict. Leakage guards prevent ground truth from contaminating hypotheses, and monotonic improvement ensures the metric never regresses below the previous best. See Getting Started for the full picture.

Headless & continuous loop

For unattended operation — scripting, cron jobs, or always-on machines:

# Headless — pipe mode, no interaction
factory ceo ~/my-project --headless

# Loop — continuous improvement (default: every 30 min)
factory run ~/my-project --loop

# Detached tmux — loop in the background
factory tmux ~/my-project --loop

--headless disables the interactive session. --loop wraps the CEO in a heartbeat loop: run one cycle, sleep, repeat. Combine with factory tmux to leave the Factory running on an always-on machine. See Getting Started for full details.

Quick Start

# Install from source (recommended — the factory evolves fast)
git clone https://github.com/akashgit/remote-factory.git
cd remote-factory && uv sync && uv tool install -e .

# Register the CEO as a Claude Code agent
factory install

Prerequisites: Python 3.11+ and Claude Code (installed and authenticated). No external services, databases, or Obsidian required — the factory stores all state locally.

Per-project state lives in .factory/ (experiment history, strategy, archive notes). Global state lives in ~/.factory/ (project registry, evolved playbooks). Projects are auto-registered when experiments begin — no manual setup needed. See Setup Guide for environment variables and authentication options.

Self-Evolving Agents

The factory doesn't just improve your project — it improves itself. Every keep/revert decision becomes training data for the next cycle.

This is powered by ACE (Autonomous Context Engineering) — inspired by Anthropic's work on context engineering — a Reflect → Curate → Inject loop that evolves agent playbooks from real experiment outcomes.

graph LR
    A["Experiment Outcomes<br><i>kept or reverted</i>"] -->|Reflect| B["Generate<br>candidate bullets"]
    B -->|Curate| C["Merge & prune<br>playbooks"]
    C -->|Inject| D["Agent Prompts<br><i>auto-appended</i>"]
    D -.->|"next cycle"| A

    style A fill:#fff3e0,stroke:#ff8f00
    style D fill:#e8eaf6,stroke:#5c6bc0

Each agent accumulates behavioral rules — DOs and DON'Ts — with evidence counters. Rules that correlate with kept experiments get reinforced. Rules that correlate with reverts get pruned.

# Run a full improvement cycle, then evolve all agent playbooks
factory ceo ~/my-project --mode meta

See Self-Improvement Loop for the full picture — how the CEO tracks agents, how cross-project learning works, and how the CEO improves itself. See ACE Playbook Evolution for the playbook mechanics.

Architecture

graph TB
    subgraph agents ["Specialist Agents"]
        R["Researcher"] ~~~ S["Strategist"] ~~~ BU["Builder"]
        RE["Reviewer"] ~~~ EV["Evaluator"] ~~~ AR["Archivist"]
        DI["Distiller"] ~~~ FA["Failure Analyst"]
    end
    subgraph ceo ["CEO Agent"]
        C["Detect state → Route mode → Spawn agents → Keep/Revert → Archive"]
    end
    subgraph cli ["Python CLI"]
        T["eval · guard · store · discover · events · strategy"]
    end

    agents --> ceo --> cli

    style agents fill:#e8eaf6,stroke:#5c6bc0
    style ceo fill:#fff3e0,stroke:#ff8f00
    style cli fill:#e8f5e9,stroke:#43a047

The Eval System

graph LR
    subgraph hygiene ["Hygiene · 6 dims"]
        H1["tests · lint · types<br>coverage · guards · config"]
    end
    subgraph growth ["Growth · 5 dims"]
        G1["capability · diversity<br>observability · research<br>effectiveness"]
    end
    subgraph project ["Project · N dims"]
        P1["your custom metrics<br>benchmarks · latency<br>accuracy · win rate"]
    end

    hygiene --> M["⚖️ Weighted<br>Composite"]
    growth --> M
    project --> M
    M --> S{"score ≥<br>threshold?"}
    S -- "yes" --> K["✅ Keep"]
    S -- "no" --> R["↩️ Revert"]

    style hygiene fill:#e8eaf6,stroke:#5c6bc0
    style growth fill:#fff3e0,stroke:#ff8f00
    style project fill:#e8f5e9,stroke:#43a047
    style K fill:#43a047,color:#fff
    style R fill:#e53935,color:#fff
Tier What it measures Examples
Hygiene (6 dimensions) Code quality basics Tests, lint, type checking, coverage
Growth (5 dimensions) Capability evolution API surface area, experiment diversity, observability
Project (user-defined) Domain-specific metrics Benchmark accuracy, latency, win rate

Built with the Factory

The factory has shipped something every day for the last 30 days — products, research experiments, production features, papers. Here are a few examples:

Project What it does Mode
SWE-bench solver Autonomous agent that resolves GitHub issues from the SWE-bench dataset, iteratively improved via failure analysis Research
HMMT math solver Multi-agent team (Explorer, Theorist, Computationalist, Critic, Synthesizer) that solved HMMT Feb 2025 Combinatorics Problem 7 Research
Text/Sketch → CAD Converts natural language and hand-drawn sketches into executable CadQuery code for 3D model generation Research
HLS design space explorer Per-function AI agents explore HLS pragma/code variants in parallel, an ILP solver finds the optimal combination, then global expert agents apply cross-function optimizations — achieving up to 92% execution time reduction on cryptographic benchmarks Build
Pluck iOS app that extracts structured data from screenshots, links, and shared content using on-device AI Build + Improve
Group chat digest Turns iMessage group chats into weekly family newsletters with AI-curated highlights and photo selection Build + Improve
Production enterprise features Complete UI components and backend features shipped into a large-scale production codebase Focus + Improve
The Factory itself The factory runs on itself in meta mode — its own agent playbooks are evolved from its own experiment outcomes Meta

Built something with the Factory? Open a PR to add it here.

License

MIT — Akash Srivastava