Skip to content

Getting Started

This guide follows the lifecycle of a Factory project — from a one-line idea through autonomous improvement and back to your steering wheel.

Prerequisites

Make sure you've completed the Setup steps:

  • Python 3.11+
  • Claude Code installed and authenticated
  • The Factory installed (factory --help should work)

The Lifecycle

Every Factory project follows the same arc:

Idea → Build → Backlog appears → Improve (auto / focus / prompt / issues) → Steer → Loop

The Factory handles the transitions automatically. You decide when to intervene.


1. Start from an Idea

The Factory accepts three entry points depending on how far along your thinking is.

Build — you know what you want

The simplest path. Describe what you want and the Factory handles everything else:

factory ceo "Build a CLI that converts CSV to JSON with streaming support"

This will:

  1. Create a project directory at ~/factory-projects/build-a-cli-that-converts-csv-to-json-with-streami/
  2. Initialize a git repo and scaffold the project
  3. Save your prompt as the build spec (.factory/strategy/current.md)
  4. Launch the CEO agent in Build mode

The directory name is derived from your prompt (lowercased, slugified, truncated to 50 chars). Set FACTORY_PROJECTS_DIR to change the parent directory.

You can also pass a spec file or a GitHub URL:

factory ceo ~/ideas/weather-dashboard.md      # longer spec as markdown
factory ceo https://github.com/user/repo      # clone and improve

Interactive — you have a rough idea

When you want to brainstorm before committing to a design:

factory ceo "distributed eval runner" --mode interactive

Interactive mode runs a three-step loop before any code is written:

  1. Research — the Researcher surveys similar projects, tech stacks, and pitfalls
  2. Distill — the Distiller synthesizes the research into a structured spec (features, architecture, non-goals)
  3. Iterate — the CEO presents the draft to you for feedback. Revise until you approve.

Once you sign off, the spec is persisted and the Factory proceeds to Build mode. Incompatible with --headless and --focus.

Research — you have a metric to optimize

For projects where the goal is to improve a measurable metric against a dataset — benchmarks, model tuning, prompt optimization:

factory ceo "SWE-bench solver agent" --mode research

Research ideation works like interactive mode but the Distiller collects additional configuration:

  • Research Target — the metric to improve, the command to run evaluation, where results are written
  • Mutable Surfaces — files the Builder is allowed to modify
  • Fixed Surfaces — ground truth data and eval infrastructure that must never be touched
  • Research Constraints — additional rules (e.g., "do not use GPT-4 for cost reasons")

Once you approve the spec, the Factory builds the project and transitions to the research improvement loop. See Research Mode in Detail below.


2. The Build Phase

Whichever entry point you chose, Build mode follows the same sequence:

  1. The Researcher does a focused research pass ("how do we build this?")
  2. The Strategist creates a phased implementation plan
  3. The Builder implements each phase, opening PRs along the way
  4. An E2E verification gate confirms the project actually runs

When Build completes, the project has code, tests, a factory.md configuration, and a discovered eval profile. Items that were deferred during build — performance improvements, edge cases, nice-to-haves — appear in the backlog.


3. The Backlog Appears

After the first build, the Factory creates .factory/strategy/backlog.md — a unified work queue that feeds all future improvement. The backlog accumulates items from several sources:

  • Features deferred during initial build
  • Issues you file on GitHub
  • Items the Researcher discovers during observation
  • Ideas you add manually with factory backlog-add
factory backlog-list ~/my-project                     # see what's queued
factory backlog-add ~/my-project "add rate limiting"  # add your own item
factory backlog-remove ~/my-project "old item"        # remove a completed item

4. Improve — The Core Loop

Point the Factory at an existing codebase and it runs the improvement cycle:

factory ceo ~/my-project

If the project already has a .factory/ directory, the Factory resumes where it left off. If not, it runs discovery first — detecting the language, framework, and test setup — then starts improving.

What happens in a cycle

  1. Observe — the Researcher analyzes the project and searches for best practices
  2. Hypothesize — the Strategist generates ranked hypotheses from the backlog using FEEC priority (Fix > Exploit > Explore > Combine)
  3. Build — the Builder implements one hypothesis on an experiment branch
  4. Guard — the Reviewer checks for guard violations and code quality
  5. Measure — the Evaluator scores before and after using the three-tier eval system
  6. Decide — the CEO runs precheck (non-overridable hard gate) then keeps (score went up) or reverts (score went down)
  7. Record — the Archivist records the outcome for future learning

Each cycle produces a numbered experiment directory under .factory/experiments/ with the hypothesis, diffs, eval results, and verdict.


5. Steering the Factory

The Factory runs autonomously, but you have four ways to steer it:

--focus — build exactly one thing

When you know exactly what you want, --focus pins a single backlog item, generates one hypothesis, runs one experiment, and exits:

factory ceo ~/my-project --focus "add authentication middleware"
factory ceo ~/my-project --focus "fix the CSV export bug"

The entire pipeline is scoped to that single target — the Researcher focuses its research, the Strategist generates exactly one hypothesis, and after the keep/revert decision the cycle ends. Mutually exclusive with --loop.

--prompt — give general direction

Nudge the Strategist's hypothesis generation without pinning a specific item:

factory ceo ~/my-project --prompt "focus on performance improvements"

GitHub Issues — async steering

File issues on the project's GitHub repo. The Strategist reads open issues and factors them into hypothesis ranking:

gh issue create --title "Add WebSocket support" --body "Need real-time updates for the dashboard"

backlog-add — queue an item

Add items directly to the backlog for the next cycle to pick up:

factory backlog-add ~/my-project "add structured logging"

6. Continuous Loop

For unattended operation, wrap the CEO in a heartbeat loop:

factory run ~/my-project --loop                    # every 30 min (default)
factory run ~/my-project --loop --interval 900     # every 15 min
factory run ~/my-project --loop --max-cycles 5     # stop after 5 cycles

For long-running sessions, use tmux:

factory tmux ~/my-project --loop              # launches in a detached tmux session
factory tmux-ls                               # list active factory sessions
factory tmux-stop --path ~/my-project         # stop a session

Interactive vs headless

By default, factory ceo launches an interactive Claude Code session — you can see what the agents are doing and intervene if needed:

factory ceo ~/my-project              # interactive (default)
factory ceo ~/my-project --headless   # pipe mode, no interaction

Research Mode in Detail

Research mode replaces the standard Improve loop with a specialized cycle designed for metric optimization against a dataset. It adds the Failure Analyst agent, leakage guards, and monotonic improvement enforcement.

When to use it

Use research mode when your project has a measurable target metric and a reproducible evaluation command — benchmarks (SWE-bench, HumanEval), model accuracy, prompt optimization, CAD query systems, mathematical reasoning.

Configuring a research project

The research target is configured in factory.md:

## Research Target
- objective: maximize SWE-bench resolve rate
- metric: resolved/total
- target: 0.35
- run_command: python run_benchmark.py
- result_path: results/output.json
- timeout: 3600

## Mutable Surfaces
- src/agent.py
- src/localization.py
- prompts/*.md

## Fixed Surfaces
- eval/
- data/ground_truth.json
- tests/

Mutable surfaces are files the Builder can change. Fixed surfaces are ground truth data and eval infrastructure that must never be modified. Fixed surfaces are fingerprinted for leakage detection.

The research cycle

Research mode follows seven phases:

Phase Agent What happens
R0 — Baseline Evaluator Run run_command, record starting metric
R1 — Failure Analysis Failure Analyst Classify failures by root cause, aggregate into categories, suggest interventions
R1.5 — Research Researcher Search web for targeted solutions to dominant failure patterns
R2 — Strategy Strategist Generate 1–3 hypotheses targeting dominant failure modes
R3 — Build Builder Implement hypothesis, modifying only mutable surfaces
R4 — Run Evaluator Re-run run_command, extract new metric
R5 — Verdict CEO Keep if metric improved monotonically; revert otherwise

Cycle progression example

A SWE-bench solver agent improving over five cycles:

Cycle Metric Failure Mode Targeted Verdict Cumulative
000 0.18 — (baseline) 0.18
001 0.22 FILE_NOT_FOUND — agent searched wrong directories KEEP 0.22
002 0.24 SYNTAX_ERROR — generated patches had indentation bugs KEEP 0.24
003 0.21 TIMEOUT — overly broad search strategy REVERT 0.24
004 0.27 INCOMPLETE_EDIT — partial file modifications KEEP 0.27
005 0.30 WRONG_FILE — localization errors KEEP 0.30

Cycle 003 regressed below the previous best (0.24), so it was automatically reverted. The metric ratchets forward — it can never go below the previous best.

Leakage guards

Research mode enforces three layers of ground truth protection:

  1. Token overlap — fingerprints fixed surface files and checks hypothesis/diff text for suspicious token overlap using Jaccard similarity
  2. Negation hints — detects patterns like "do NOT use subtraction" that encode ground truth by exclusion
  3. Specific values — extracts numeric literals and quoted strings from fixed surfaces, flags if they appear in hypothesis text

Leakage checks run at three hard gates: Strategy review, Builder review, and Precheck. A medium or high leakage risk triggers an automatic redirect or revert.

Running research mode

# New research project (ideation → build → research loop)
factory ceo "SWE-bench solver agent" --mode research

# Existing research project (skip ideation, run research loop)
factory ceo ~/my-swe-bench-solver --mode research

# Focus on a specific hypothesis within research mode
factory ceo ~/my-swe-bench-solver --mode research --focus "try chain-of-thought prompting"

# Continuous research loop
factory run ~/my-swe-bench-solver --mode research --loop

Named use cases

Project Metric Mutable Surfaces What improves
SWE-bench solver resolve rate agent logic, prompts, localization Patch generation accuracy
Mathematical reasoning solve rate chain-of-thought templates, tool calls Proof strategy selection
CAD query optimization query accuracy query builder, schema mapping Entity resolution, join logic

Writing a factory.md

Once the CEO creates your project, it auto-generates a factory.md configuration file. You can also write one manually for more control:

## Goal
A CLI tool that converts CSV files to JSON with streaming support.

## Scope
### Modifiable
- src/**
- tests/**

## Guards
- Do not delete existing tests
- Do not modify files outside scope

## Eval
### Command
pytest --tb=short -q

### Threshold
0.8

See the Configuration Reference for all available sections.

Next Steps