Heartbeat & Health
Heartbeat Protocol
Section titled “Heartbeat Protocol”Grace implements a continuous health monitoring system defined in HEARTBEAT.md. The heartbeat is not a periodic cron job—it is an activity-triggered protocol that verifies operational health across STRATT, Choco HQ, the 22-organization ecosystem, and Grace itself.
Health Check Tiers
Section titled “Health Check Tiers”Grace monitors health on a tiered cadence based on organizational tier:
Tier 1: Active Core (Daily)
Section titled “Tier 1: Active Core (Daily)”Organizations: stratt-hq, grace-hq, choco-hq, so1-io, iris-hq
Checks (every session or daily):
| System | Check | Success Criteria |
|---|---|---|
| STRATT | Fingerprint integrity (FM-01) | No stored ≠ computed hashes detected |
| STRATT | Import resolution (FM-02) | All stratt:// and choco:// imports resolve |
| STRATT | Protected agent compliance (FM-04) | All protected agents exist in their councils |
| STRATT | Capability check (FM-09) | All agents have required capabilities for assigned steps |
| STRATT | Gate queue | No gates pending > 24 hours (timeout threshold) |
| STRATT | Council state | All 7 councils have gate authorities defined |
| STRATT | MERIDIAN build | Documentation site builds without errors |
| Grace | Agent context readable | All 8 files load without error |
| Grace | Memory accessible | Both session and long-term memory available |
| Grace | Learning pipeline | .learnings/ can be read and written |
Tier 2: Active Products (Weekly)
Section titled “Tier 2: Active Products (Weekly)”Organizations: traceo-ai, sparki-tools, nestr-tools, cookr-hq, devarno-cloud, chronicle-hq
Checks (every Monday or on demand):
| System | Check | Success Criteria |
|---|---|---|
| Primer/Neuro | Domain health | Neuro domain units valid and importable |
| Choco HQ | Contracts valid | No breaking contract changes unsigned |
| Choco HQ | Requirements stable | Requirement counts within expected ranges |
| Choco HQ | Service availability | ChocoBridgeResolver can reach API |
| Integration | Cross-namespace resolution | choco:// URIs resolvable |
Tier 3: Scaffolded (Monthly)
Section titled “Tier 3: Scaffolded (Monthly)”Organizations: clari-tools, aphelion-craft, msx-io, skyflow-me, tektree-io, v01t-io, smo1-io
Checks (first day of month or on demand):
| System | Check | Success Criteria |
|---|---|---|
| Ecosystem | Org registry current | ecosystem.yaml last updated within 30 days |
| Ecosystem | GitHub reachable | All 22 orgs have accessible repos |
| Ecosystem | Integration patterns | Cross-org imports follow documented patterns |
STRATT Health Checks
Section titled “STRATT Health Checks”FM-01: Fingerprint Integrity
Section titled “FM-01: Fingerprint Integrity”Test: For each published unit, recompute fingerprint and compare to stored.
bun packages/cli/src/index.ts verify <all-units>Pass Condition: All fingerprints match. No tampering detected.
Failure Action: Alert ops. Flag unit as tampered (FM-01). Block execution.
FM-02: Broken Imports
Section titled “FM-02: Broken Imports”Test: For each unit, validate all imports exist and are resolvable.
bun packages/cli/src/index.ts ci <all-units>Pass Condition: All imports resolve to existing units (stratt:// or choco://).
Failure Action: Report unresolvable import. Suggest fix (correct URI, publish missing unit, update import).
FM-04: Protected Agent Compliance
Section titled “FM-04: Protected Agent Compliance”Test: For each council, verify all protected agents exist.
Check:
Pathfinder: protected: [BECK-02] ← Does BECK-02 exist in Pathfinder agents?Hermes: protected: [EECOM-02] ← Does EECOM-02 exist in Hermes agents?[... check all 7 councils]Pass Condition: Every protected agent designation exists in its council roster.
Failure Action: Report missing protected agent. This is a critical governance issue.
FM-09: Capability Check
Section titled “FM-09: Capability Check”Test: For each step in a chain, verify the assigned agent has the required capabilities.
Capabilities: 11 values: analysis, architecture, audit, code, data, domain, gates, memory, operations, planning, write
Check:
For each composition step: - Extract required capability from unit - Look up agent (e.g., LEWIS-06) - Verify agent has that capabilityFailure Action: Report capability mismatch. Suggest reassigning to agent with capability.
Gate Queue Monitoring
Section titled “Gate Queue Monitoring”Test: For each gate in PENDING state, check age.
Check:
gates: - id: "gate-123" unit: stratt://dev/task/review-code@1.0.0 created_at: "2026-04-05T14:32Z" ← 2 days old? timeout: "2026-04-06T14:32Z" ← 1 day remaining? status: PENDINGEscalation:
- 0-12 hours: No action
- 12-24 hours: Proactive reminder to gate authority
- 24+ hours: Auto-escalate to Veritas council for review
Council State Verification
Section titled “Council State Verification”Test: All 7 councils have valid configuration.
Check:
Councils: - Pathfinder (dev): LEWIS-06 gate authority, BECK-02 protected - Hermes (ops): RETRO-04 gate authority, EECOM-02 protected - [... verify all 7]Pass Condition: All councils defined with gate authority and protected agents.
Failure Action: Report missing council. This prevents governance.
MERIDIAN Build Verification
Section titled “MERIDIAN Build Verification”Test: MERIDIAN documentation site builds cleanly.
cd ~/code/workspace/stratt-hq/apps/meridianbun run buildPass Condition: Zero build errors, all pages render.
Failure Action: Report build failure. Block deployment until fixed.
Grace Internal Health Checks
Section titled “Grace Internal Health Checks”Agent Context Files Readable
Section titled “Agent Context Files Readable”Test: Load all 8 context files.
IDENTITY.md ✓SOUL.md ✓TOOLS.md ✓USER.md ✓AGENTS.md ✓BOOTSTRAP.md ✓ (or deleted if not first session)HEARTBEAT.md ✓MEMORY.md ✓Failure Action: Report missing or corrupted file. Grace cannot operate without all 8 files.
Memory Accessibility
Section titled “Memory Accessibility”Test: Read session and long-term memory.
Session memory: /memory/YYYY-MM-DD.md ✓Long-term: MEMORY.md ✓Learning: .learnings/ ✓Failure Action: Report memory corruption. Cannot proceed without memory.
Learning Pipeline
Section titled “Learning Pipeline”Test: Write a test entry to learnings.
.learnings/LEARNINGS.md ← Can write.learnings/ERRORS.md ← Can write.learnings/FEATURE_REQUESTS.md ← Can writeFailure Action: Report learning pipeline blocked. Grace cannot capture improvements.
Failure Mode Awareness
Section titled “Failure Mode Awareness”Grace monitors for all 9 failure modes during normal operations:
| FM | Name | Detection | Handler |
|---|---|---|---|
| FM-01 | Fingerprint Tamper | Verify CLI detects hash mismatch | Block execution, alert |
| FM-02 | Broken Imports | CI detects unresolvable URI | Block publish |
| FM-03 | DAG Cycles | DAG algorithm detects cycle | Block publish, suggest refactor |
| FM-04 | Protected Agent Missing | Council roster check fails | Block step, report |
| FM-05 | Gate Removal | Lifecycle transition without approval | Block transition |
| FM-06 | Contract Breaking Change | Schema validation detects signature change | Block publish, suggest major bump |
| FM-07 | Draft Isolation | Importer status check fails | Block import, promote both |
| FM-08 | R2 Infrastructure Failure | Publish request fails | Retry with backoff, escalate if persistent |
| FM-09 | Capability Check | Agent roster lookup fails | Block assignment, suggest alternative |
Timing & Quiet Hours
Section titled “Timing & Quiet Hours”Work Rhythm
Section titled “Work Rhythm”Grace respects operator work rhythm:
| Day | Mode | Activity |
|---|---|---|
| Monday | Build Day + Heartbeat | Normal operations, weekly health checks |
| Tuesday | Connect Day | Cross-org coordination, external sync |
| Wednesday | Build Day | Focused implementation, unit authoring |
| Thursday | Connect Day | Meetings, ecosystem coordination |
| Friday | Flex | Either build or connect as needed |
Quiet Hours
Section titled “Quiet Hours”Grace does not initiate proactive work:
- 23:00 to 08:00 London time — Quiet hours (operator sleep)
- Outside quiet hours — Heartbeat checks and proactive actions allowed
- Emergency — FM violations always escalate immediately, regardless of time
Heartbeat Triggers
Section titled “Heartbeat Triggers”Scheduled
Section titled “Scheduled”- Daily: STRATT health (FM-01, FM-02, FM-04, FM-09, gate queue, council state)
- Weekly (Monday): Tier 2 health checks + gate queue review
- Monthly (1st): Tier 3 health checks + learning pipeline review
Reactive
Section titled “Reactive”- On FM violation: Immediate escalation and remediation attempt
- On gate timeout: Escalation to gate authority
- On campaign milestone: Activation of next phase
- On demand: User can trigger
HEARTBEATexplicitly
Proactive
Section titled “Proactive”- Gate nearing timeout: Notify gate authority at 12h remaining
- Campaign stalled: Check n8n workflow status
- Import broken: Notify unit maintainer
- Capability mismatch: Suggest alternative agent assignment
Self-Improvement Health
Section titled “Self-Improvement Health”Grace captures improvements systematically:
Error Tracking
Section titled “Error Tracking”# .learnings/ERRORS.md- FM-06 violation message too technical (feature request #3)- Gate timeout UX could be clearer (learning #5)Feature Requests
Section titled “Feature Requests”# .learnings/FEATURE_REQUESTS.md- Add blast radius computation for FM-09 capability changes- Implement DSPy trace export (SPEC-05)- Add council meeting schedulerQuality Baseline
Section titled “Quality Baseline”{ "operations_per_day": 23.5, "fm_detection_rate": 0.98, "gate_approval_time_hours": 4.2, "ecosystem_health_score": 0.94}Manual Heartbeat Trigger
Section titled “Manual Heartbeat Trigger”Users can trigger a heartbeat check explicitly:
HEARTBEAT
→ Checking STRATT health... ✓ Fingerprint integrity (FM-01) ✓ Import resolution (FM-02) ✓ Protected agents (FM-04) ✓ Capability check (FM-09) ✓ Gate queue (1 pending, 8h old) ✓ Council state (7 active) ✓ MERIDIAN build
→ Checking Grace health... ✓ Agent context readable ✓ Memory accessible ✓ Learning pipeline
→ Checking ecosystem health... ✓ 22 orgs reachable ✓ ecosystem.yaml current (2 days old)
HEARTBEAT COMPLETE: All systems nominal. No proactive action required.Next Steps
Section titled “Next Steps”- Orchestration & Routing: Trigger words and tool routing
- Agent Context: The 8 markdown files that define behavior
- Ecosystem: 22 organizations and health monitoring
- Protocol: Failure modes FM-01 through FM-09