AI Harness
Why this exists
Three problems that drove us to build the harness.
Inconsistent specs
Specs lived in Linear, Slack, Notion, heads โ scattered across too many places.
Silent regressions
Features broke when adjacent code was touched. No one was looking.
Ad-hoc review
Code review happened when someone had time. Not every change was audited.
Three-Phase Flow
Verifiable end to end. Same playbook for every change.
Plan
Source โ proposal โ validate โ grill-me โ alignment
Implement
Ralph Loop per task ยท checkpoint commits
Ship
Review + auto-fix + contract + changelog + deploy
Phase 1 ยท Plan
Source โ proposal โ 3 parallel validators โ composite score
Source of Intent
- Linear ticket
- Video transcript
- Verbal prompt
- Slack thread
โโ design.md
โโ tasks.md
โโ specs/
Requirement
- completeness
- traceability
- clarity
NFR
- coverage
- baseline
- testability
Impact
- coverage
- dep. depth
- risk ID
Composite score
The proposal reads the raw source directly โ no pre-refinement. Separating refinement from authoring creates anchoring bias that degrades the alignment check later.
Grill-me ยท Pre-mortem ยท Re-validation
Mandatory in Attendly. Exception scenarios are explicit, not implicit.
1. Grill-me
- Actor & access
- Happy path
- Data model
- Edge cases
- Error handling
- Multi-tenant
- Existing behavior
2. Pre-mortem
- data
- load
- partial failure
- rollout
- external
- security
- adjacent
3. Re-validation
- 3 validators re-run
- Pre-grill vs post-grill
- Delta shown to user
- Both kept in history
- Approval uses FINAL
- Critical pre-mortem risks โ task or acceptance
Phase 1 ยท Alignment Check
The last gate of Phase 1. Clean-context comparison on Sonnet.
Raw source
Unchanged since the proposal was written.
Covered
Gaps
Scope creep
Intent mismatch
๐จ Conditional scope modifiers sweep
Dedicated pass for stray-sentence qualifiers โ the #1 source of post-deploy regressions. Modifier gaps default to Critical.
Example: "only for Sant'Anna, high school students only" โ buried in paragraph 3 of a long ticket. Caught before production.
Verdict: Aligned โ proceed ยท Misaligned โ grill-me / edit spec, re-run alignment.
Phase 2 ยท Implement
Ralph Loop per task ยท checkpoint commits ยท 3-attempt convergence rule.
๐ด Red
Write failing test
๐ข Green
Make it pass
๐ต Refactor
Improve, stay green
Checkpoint commits
- task 1 โ
- task 2 โ
- task 3 โ
- โฆ
Phase 3a + 3b ยท Review & Auto-Fix
Parallel sub-agent review. Critical/High auto-fixed. Drill on design decisions.
- architecture
- data access
- security
- performance
- testing
- code quality
- contract violations
- blast radius
- exception paths
- perf degradation
Auto-Fix โ Exception Paths
- Critical / Major (review) + Critical / High (regression) โ attempt fix autonomously
- Fix needs a design decision (hard-error vs warn, A vs B) โ STOP + drill user via AskUserQuestion
- 3-attempt convergence rule โ if fix doesn't converge, revert and report
- Fix breaks pre-existing tests โ revert (never degrade test integrity)
- Re-score after fix ยท both runs kept in daily score JSON history
Phase 3c ยท Contract Check
Binary verdict: PASS or FAIL. Breaking changes require explicit mitigation.
GraphQL Schema
- field removal
- required arg add
- type rename
- nullable โ non-null
Prisma / DB
- DROP COLUMN
- NOT NULL without default
- column rename
- narrowing type
TypeScript Public Types
- removed symbol
- field removal
- stricter generic
- narrower return
REST API when applicable
- route removal
- required param add
- response schema change
- auth change
Phase 3d + 3e ยท Changelog & Deploy
Deliver the narrative. Verify ready to ship.
3d ยท Changelog
Auto-detects ONE delivery target:
ยท Linear ticket
stakeholder-friendly comment
ยท Open PR
technical description update
ยท CHANGELOG.md
Keep-a-Changelog format
3e ยท Deploy Checklist
Conditional โ only on Critical/High or Contract FAIL
- Migrations tested
- Rollback plan
- Feature flags
- Secrets in vault
- Runbook updated
- Observability
- Dependencies
- Stakeholders
OpenSpec Structure
One source of truth for architecture, specs, and change history.
Six Sub-Agents
Each in an isolated context. Four in Phase 1, two in Phase 3.
PR Workflow
Every change ships via pull request. No direct pushes.
Create Branch
type/ATT-XXXX-description
Draft PR
when Phase 1 approved ยท still in progress
Ready for Review
when Phase 2 completes ยท tests green
Phase 3 updates PR
changelog ยท scores ยท risk matrix in body
Squash Merge
single commit ยท --delete-branch cleanup
Commands Cheat Sheet
Tags in parentheses show whether a command auto-runs inside the 3-phase flow.
Setup โ once per project
/update-skillsinstall / sync skills, rules, agents (manual)/setup-permissionspre-authorize non-destructive commands (manual)Diagnostic โ anytime
/check-skills-versioncompare project vs latest tooling (manual)/telemetry reportaggregated view of tool / skill usage (manual)Workflow โ inside the 3-phase flow
/openspec-proposalraw source โ full proposal (auto โ Phase 1)/workflow-impact-analysisblast radius analysis (auto โ Phase 1)/workflow-grill-meinterview + pre-mortem (auto โ Phase 1)/workflow-alignment-checkclosing gate (spec vs source) (auto โ Phase 1)/workflow-contract-checkbreaking-change validation (auto โ Phase 3)/workflow-changelogfinal narrative delivery (auto โ Phase 3)/workflow-refine-ticketstructure a vague ticket into a pre-spec (manual, standalone)A Typical Task
Mostly autonomous. A handful of human touchpoints.
The 6 touchpoints
- Start โ paste the ticket link.
- Grill-me โ answer branch-by-branch questions.
- Pre-mortem โ tell the "two weeks live, something broke" story.
- Phase 1 approval โ confirm the final composite score + alignment verdict.
- During auto-fix (only if prompted) โ resolve a design-decision drill from the AI.
- Merge โ approve the squash merge.
Everything else โ proposal drafting, validation, re-validation, implementation, review, regression analysis, auto-fix, contract check, changelog posting โ runs on its own.
Getting Started โ First Install
Three steps, ~2 minutes. Once per machine + once per project.
Clone the tooling repo
Source of truth for skills, agents, and rules. Do this once per machine.
Install global commands
Writes update-skills, setup-permissions, telemetry, check-skills-version into ~/.claude/ so every Claude Code session finds them.
Install skills in each project
Detects the stack, sets up .claude/ symlinks to .ai-tooling/, initializes OpenSpec, wires up CLAUDE.md.
Staying Updated
/update-skills auto-git-pulls the tooling repo and version-gates.
git pull + compute hash
Diagnostic, read-only
See where you stand without syncing. Reports installed version + latest available + delta.
Zero wasted context โ the version gate only syncs when there's something new.