Getting Started¶
This guide follows the lifecycle of a Factory project — from a one-line idea through autonomous improvement and back to your steering wheel.
Prerequisites¶
Make sure you've completed the Setup steps:
- Python 3.11+
- Claude Code installed and authenticated
- The Factory installed (
factory --helpshould work)
The Lifecycle¶
Every Factory project follows the same arc:
The Factory handles the transitions automatically. You decide when to intervene.
1. Start from an Idea¶
The Factory accepts three entry points depending on how far along your thinking is.
Build — you know what you want¶
The simplest path. Describe what you want and the Factory handles everything else:
This will:
- Create a project directory at
~/factory-projects/build-a-cli-that-converts-csv-to-json-with-streami/ - Initialize a git repo and scaffold the project
- Save your prompt as the build spec (
.factory/strategy/current.md) - Launch the CEO agent in Build mode
The directory name is derived from your prompt (lowercased, slugified, truncated to 50 chars). Set FACTORY_PROJECTS_DIR to change the parent directory.
You can also pass a spec file or a GitHub URL:
factory ceo ~/ideas/weather-dashboard.md # longer spec as markdown
factory ceo https://github.com/user/repo # clone and improve
Interactive — you have a rough idea¶
When you want to brainstorm before committing to a design:
Interactive mode runs a three-step loop before any code is written:
- Research — the Researcher surveys similar projects, tech stacks, and pitfalls
- Distill — the Distiller synthesizes the research into a structured spec (features, architecture, non-goals)
- Iterate — the CEO presents the draft to you for feedback. Revise until you approve.
Once you sign off, the spec is persisted and the Factory proceeds to Build mode. Incompatible with --headless and --focus.
Research — you have a metric to optimize¶
For projects where the goal is to improve a measurable metric against a dataset — benchmarks, model tuning, prompt optimization:
Research ideation works like interactive mode but the Distiller collects additional configuration:
- Research Target — the metric to improve, the command to run evaluation, where results are written
- Mutable Surfaces — files the Builder is allowed to modify
- Fixed Surfaces — ground truth data and eval infrastructure that must never be touched
- Research Constraints — additional rules (e.g., "do not use GPT-4 for cost reasons")
Once you approve the spec, the Factory builds the project and transitions to the research improvement loop. See Research Mode in Detail below.
2. The Build Phase¶
Whichever entry point you chose, Build mode follows the same sequence:
- The Researcher does a focused research pass ("how do we build this?")
- The Strategist creates a phased implementation plan
- The Builder implements each phase, opening PRs along the way
- An E2E verification gate confirms the project actually runs
When Build completes, the project has code, tests, a factory.md configuration, and a discovered eval profile. Items that were deferred during build — performance improvements, edge cases, nice-to-haves — appear in the backlog.
3. The Backlog Appears¶
After the first build, the Factory creates .factory/strategy/backlog.md — a unified work queue that feeds all future improvement. The backlog accumulates items from several sources:
- Features deferred during initial build
- Issues you file on GitHub
- Items the Researcher discovers during observation
- Ideas you add manually with
factory backlog-add
factory backlog-list ~/my-project # see what's queued
factory backlog-add ~/my-project "add rate limiting" # add your own item
factory backlog-remove ~/my-project "old item" # remove a completed item
4. Improve — The Core Loop¶
Point the Factory at an existing codebase and it runs the improvement cycle:
If the project already has a .factory/ directory, the Factory resumes where it left off. If not, it runs discovery first — detecting the language, framework, and test setup — then starts improving.
What happens in a cycle¶
- Observe — the Researcher analyzes the project and searches for best practices
- Hypothesize — the Strategist generates ranked hypotheses from the backlog using FEEC priority (Fix > Exploit > Explore > Combine)
- Build — the Builder implements one hypothesis on an experiment branch
- Guard — the Reviewer checks for guard violations and code quality
- Measure — the Evaluator scores before and after using the three-tier eval system
- Decide — the CEO runs precheck (non-overridable hard gate) then keeps (score went up) or reverts (score went down)
- Record — the Archivist records the outcome for future learning
Each cycle produces a numbered experiment directory under .factory/experiments/ with the hypothesis, diffs, eval results, and verdict.
5. Steering the Factory¶
The Factory runs autonomously, but you have four ways to steer it:
--focus — build exactly one thing¶
When you know exactly what you want, --focus pins a single backlog item, generates one hypothesis, runs one experiment, and exits:
factory ceo ~/my-project --focus "add authentication middleware"
factory ceo ~/my-project --focus "fix the CSV export bug"
The entire pipeline is scoped to that single target — the Researcher focuses its research, the Strategist generates exactly one hypothesis, and after the keep/revert decision the cycle ends. Mutually exclusive with --loop.
--prompt — give general direction¶
Nudge the Strategist's hypothesis generation without pinning a specific item:
GitHub Issues — async steering¶
File issues on the project's GitHub repo. The Strategist reads open issues and factors them into hypothesis ranking:
backlog-add — queue an item¶
Add items directly to the backlog for the next cycle to pick up:
6. Continuous Loop¶
For unattended operation, wrap the CEO in a heartbeat loop:
factory run ~/my-project --loop # every 30 min (default)
factory run ~/my-project --loop --interval 900 # every 15 min
factory run ~/my-project --loop --max-cycles 5 # stop after 5 cycles
For long-running sessions, use tmux:
factory tmux ~/my-project --loop # launches in a detached tmux session
factory tmux-ls # list active factory sessions
factory tmux-stop --path ~/my-project # stop a session
Interactive vs headless¶
By default, factory ceo launches an interactive Claude Code session — you can see what the agents are doing and intervene if needed:
factory ceo ~/my-project # interactive (default)
factory ceo ~/my-project --headless # pipe mode, no interaction
Research Mode in Detail¶
Research mode replaces the standard Improve loop with a specialized cycle designed for metric optimization against a dataset. It adds the Failure Analyst agent, leakage guards, and monotonic improvement enforcement.
When to use it¶
Use research mode when your project has a measurable target metric and a reproducible evaluation command — benchmarks (SWE-bench, HumanEval), model accuracy, prompt optimization, CAD query systems, mathematical reasoning.
Configuring a research project¶
The research target is configured in factory.md:
## Research Target
- objective: maximize SWE-bench resolve rate
- metric: resolved/total
- target: 0.35
- run_command: python run_benchmark.py
- result_path: results/output.json
- timeout: 3600
## Mutable Surfaces
- src/agent.py
- src/localization.py
- prompts/*.md
## Fixed Surfaces
- eval/
- data/ground_truth.json
- tests/
Mutable surfaces are files the Builder can change. Fixed surfaces are ground truth data and eval infrastructure that must never be modified. Fixed surfaces are fingerprinted for leakage detection.
The research cycle¶
Research mode follows seven phases:
| Phase | Agent | What happens |
|---|---|---|
| R0 — Baseline | Evaluator | Run run_command, record starting metric |
| R1 — Failure Analysis | Failure Analyst | Classify failures by root cause, aggregate into categories, suggest interventions |
| R1.5 — Research | Researcher | Search web for targeted solutions to dominant failure patterns |
| R2 — Strategy | Strategist | Generate 1–3 hypotheses targeting dominant failure modes |
| R3 — Build | Builder | Implement hypothesis, modifying only mutable surfaces |
| R4 — Run | Evaluator | Re-run run_command, extract new metric |
| R5 — Verdict | CEO | Keep if metric improved monotonically; revert otherwise |
Cycle progression example¶
A SWE-bench solver agent improving over five cycles:
| Cycle | Metric | Failure Mode Targeted | Verdict | Cumulative |
|---|---|---|---|---|
| 000 | 0.18 | — (baseline) | — | 0.18 |
| 001 | 0.22 | FILE_NOT_FOUND — agent searched wrong directories | KEEP | 0.22 |
| 002 | 0.24 | SYNTAX_ERROR — generated patches had indentation bugs | KEEP | 0.24 |
| 003 | 0.21 | TIMEOUT — overly broad search strategy | REVERT | 0.24 |
| 004 | 0.27 | INCOMPLETE_EDIT — partial file modifications | KEEP | 0.27 |
| 005 | 0.30 | WRONG_FILE — localization errors | KEEP | 0.30 |
Cycle 003 regressed below the previous best (0.24), so it was automatically reverted. The metric ratchets forward — it can never go below the previous best.
Leakage guards¶
Research mode enforces three layers of ground truth protection:
- Token overlap — fingerprints fixed surface files and checks hypothesis/diff text for suspicious token overlap using Jaccard similarity
- Negation hints — detects patterns like "do NOT use subtraction" that encode ground truth by exclusion
- Specific values — extracts numeric literals and quoted strings from fixed surfaces, flags if they appear in hypothesis text
Leakage checks run at three hard gates: Strategy review, Builder review, and Precheck. A medium or high leakage risk triggers an automatic redirect or revert.
Running research mode¶
# New research project (ideation → build → research loop)
factory ceo "SWE-bench solver agent" --mode research
# Existing research project (skip ideation, run research loop)
factory ceo ~/my-swe-bench-solver --mode research
# Focus on a specific hypothesis within research mode
factory ceo ~/my-swe-bench-solver --mode research --focus "try chain-of-thought prompting"
# Continuous research loop
factory run ~/my-swe-bench-solver --mode research --loop
Named use cases¶
| Project | Metric | Mutable Surfaces | What improves |
|---|---|---|---|
| SWE-bench solver | resolve rate | agent logic, prompts, localization | Patch generation accuracy |
| Mathematical reasoning | solve rate | chain-of-thought templates, tool calls | Proof strategy selection |
| CAD query optimization | query accuracy | query builder, schema mapping | Entity resolution, join logic |
Writing a factory.md¶
Once the CEO creates your project, it auto-generates a factory.md configuration file. You can also write one manually for more control:
## Goal
A CLI tool that converts CSV files to JSON with streaming support.
## Scope
### Modifiable
- src/**
- tests/**
## Guards
- Do not delete existing tests
- Do not modify files outside scope
## Eval
### Command
pytest --tb=short -q
### Threshold
0.8
See the Configuration Reference for all available sections.
Next Steps¶
- Configuration Reference — all
factory.mdoptions - Architecture — how the CEO and specialist agents work
- Eval System — how projects are scored
- Self-Improvement Loop — how agents evolve over time