Eval System¶
The Factory uses a three-tier composite scoring system to measure every change objectively. No change is kept without a measured improvement.
Three Tiers¶
Tier 1: Hygiene (6 dimensions)¶
Auto-detected from project tooling. These measure basic code quality:
| Dimension | What it checks | How |
|---|---|---|
tests |
Test suite passes | Runs detected test command |
lint |
No lint errors | Runs detected linter (ruff, eslint, etc.) |
type_check |
Type checking passes | Runs mypy, pyright, tsc, etc. |
coverage |
Test coverage level | Parses coverage reports |
guard_patterns |
Guard rules respected | Checks scope, immutability rules |
config_parser |
factory.md is valid |
Validates configuration |
Implementation: factory/eval/hygiene.py
Tier 2: Growth (5 dimensions)¶
Computed by the factory itself. These measure whether the project is actually evolving:
| Dimension | What it measures | Weight |
|---|---|---|
capability_surface |
Modules, public functions, entry points | 0.28 |
experiment_diversity |
Variety of hypothesis categories attempted | 0.22 |
observability |
Logging, error handling, monitoring | 0.20 |
research_grounding |
Changes informed by research (archive, papers) | 0.16 |
factory_effectiveness |
Keep rate, score trajectory | 0.14 |
Implementation: factory/eval/growth.py
Tier 3: Project Eval (user-defined)¶
Custom dimensions for domain-specific metrics. Defined in factory.md:
## Project Eval
- name: benchmark_accuracy
command: python eval/benchmark.py
parse: json
weight: 0.6
timeout: 300
- name: inference_latency
command: python eval/latency.py
parse: exit_code
weight: 0.4
Each command must output either:
- json: {"score": 0.0-1.0} to stdout (optionally {"score": 0.85, "details": "..."})
- exit_code: Exit 0 for pass (score 1.0), non-zero for fail (score 0.0)
Weight Distribution¶
| Scenario | Hygiene | Growth | Project |
|---|---|---|---|
| No project eval (default) | 50% | 50% | — |
| With project eval (default) | 30% | 20% | 50% |
Custom (via ## Eval Weights) |
Configurable | Configurable | Configurable |
Configure in factory.md:
Scoring¶
The composite score is computed by factory/eval/scorer.py:
- Each dimension produces a score (0.0 to 1.0) and a weight
- Within each tier, scores are weighted and normalized
- Tiers are combined using the weight distribution above
- Guard violations force the composite to fail regardless of score
- The threshold (default 0.8) determines keep/revert
Guards¶
Guard rules are inviolable constraints checked via factory/eval/guards.py:
- Scope guard: Changes must be within
## Scope / Modifiablepatterns - Eval immutability: The eval system itself cannot be modified by experiments
Guard failures override eval scores — a failing guard means mandatory revert.
Precheck Gate¶
factory precheck runs 4 non-overridable checks before any keep/revert decision:
- Score direction — score must not regress and must meet threshold
- Scope — guard check must pass
- Anti-pattern — hypothesis must not be >60% similar to a previously reverted one
- Smoke test — if configured in
factory.md, the smoke test command must pass
The CEO cannot override a failed precheck.
factory precheck ~/my-project \
--score-before 0.7 \
--score-after 0.85 \
--hypothesis "add structured logging" \
--baseline abc123
Research Mode Interaction¶
In research mode, the eval system works differently. The research target metric is the primary signal — hygiene scores serve as a hard gate but don't drive the keep/revert decision.
Decision hierarchy¶
- Hygiene gate — any regression in tests, lint, or type_check forces an automatic revert, regardless of metric improvement
- Monotonic improvement — the research target metric must be
>= previous_best. If the metric regresses below the highest value achieved in any prior run, the experiment is reverted. The metric ratchets forward — it can never go backward. - Leakage guard — if ground truth contamination is detected, the experiment is reverted
- Precheck — standard precheck (scope, anti-pattern, smoke test) still applies
Leakage guards for fixed surfaces¶
Research mode defines fixed surfaces — ground truth data, eval scripts, and test fixtures that must never be modified or leaked into hypotheses. Three layers of protection:
| Guard | What it detects | Risk level |
|---|---|---|
| Token overlap | Distinctive tokens from fixed surfaces appearing in hypothesis/diff text (Jaccard similarity) | low–medium |
| Negation hints | Patterns like "do NOT use X" where X appears in ground truth — encoding answers by exclusion | high |
| Specific values | Numeric literals or quoted strings extracted from fixed surfaces appearing in hypothesis text | medium |
Leakage checks run at three hard gates:
- Strategy review — CEO scans each hypothesis before approving
- Builder review — CEO scans the PR diff after implementation
- Precheck — automated guard check before keep/revert
A medium or high leakage risk triggers an automatic redirect (at Strategy/Builder) or revert (at Precheck).
Monotonic improvement policy¶
The research target metric must satisfy metric_after >= previous_best for every accepted experiment. This prevents:
- Oscillation between local optima
- Aggregate regression from individually plausible changes
- "Two steps forward, one step back" patterns
If a change improves the metric on some instances but regresses on others, the aggregate must still be at or above the previous best. The CEO cannot override a monotonic improvement violation.