Skip to content

Heartbeat & Health

Grace implements a continuous health monitoring system defined in HEARTBEAT.md. The heartbeat is not a periodic cron job—it is an activity-triggered protocol that verifies operational health across STRATT, Choco HQ, the 22-organization ecosystem, and Grace itself.


Grace monitors health on a tiered cadence based on organizational tier:

Organizations: stratt-hq, grace-hq, choco-hq, so1-io, iris-hq

Checks (every session or daily):

SystemCheckSuccess Criteria
STRATTFingerprint integrity (FM-01)No stored ≠ computed hashes detected
STRATTImport resolution (FM-02)All stratt:// and choco:// imports resolve
STRATTProtected agent compliance (FM-04)All protected agents exist in their councils
STRATTCapability check (FM-09)All agents have required capabilities for assigned steps
STRATTGate queueNo gates pending > 24 hours (timeout threshold)
STRATTCouncil stateAll 7 councils have gate authorities defined
STRATTMERIDIAN buildDocumentation site builds without errors
GraceAgent context readableAll 8 files load without error
GraceMemory accessibleBoth session and long-term memory available
GraceLearning pipeline.learnings/ can be read and written

Organizations: traceo-ai, sparki-tools, nestr-tools, cookr-hq, devarno-cloud, chronicle-hq

Checks (every Monday or on demand):

SystemCheckSuccess Criteria
Primer/NeuroDomain healthNeuro domain units valid and importable
Choco HQContracts validNo breaking contract changes unsigned
Choco HQRequirements stableRequirement counts within expected ranges
Choco HQService availabilityChocoBridgeResolver can reach API
IntegrationCross-namespace resolutionchoco:// URIs resolvable

Organizations: clari-tools, aphelion-craft, msx-io, skyflow-me, tektree-io, v01t-io, smo1-io

Checks (first day of month or on demand):

SystemCheckSuccess Criteria
EcosystemOrg registry currentecosystem.yaml last updated within 30 days
EcosystemGitHub reachableAll 22 orgs have accessible repos
EcosystemIntegration patternsCross-org imports follow documented patterns

Test: For each published unit, recompute fingerprint and compare to stored.

Terminal window
bun packages/cli/src/index.ts verify <all-units>

Pass Condition: All fingerprints match. No tampering detected.

Failure Action: Alert ops. Flag unit as tampered (FM-01). Block execution.


Test: For each unit, validate all imports exist and are resolvable.

Terminal window
bun packages/cli/src/index.ts ci <all-units>

Pass Condition: All imports resolve to existing units (stratt:// or choco://).

Failure Action: Report unresolvable import. Suggest fix (correct URI, publish missing unit, update import).


Test: For each council, verify all protected agents exist.

Check:

Pathfinder:
protected: [BECK-02] ← Does BECK-02 exist in Pathfinder agents?
Hermes:
protected: [EECOM-02] ← Does EECOM-02 exist in Hermes agents?
[... check all 7 councils]

Pass Condition: Every protected agent designation exists in its council roster.

Failure Action: Report missing protected agent. This is a critical governance issue.


Test: For each step in a chain, verify the assigned agent has the required capabilities.

Capabilities: 11 values: analysis, architecture, audit, code, data, domain, gates, memory, operations, planning, write

Check:

For each composition step:
- Extract required capability from unit
- Look up agent (e.g., LEWIS-06)
- Verify agent has that capability

Failure Action: Report capability mismatch. Suggest reassigning to agent with capability.


Test: For each gate in PENDING state, check age.

Check:

gates:
- id: "gate-123"
unit: stratt://dev/task/review-code@1.0.0
created_at: "2026-04-05T14:32Z" ← 2 days old?
timeout: "2026-04-06T14:32Z" ← 1 day remaining?
status: PENDING

Escalation:

  • 0-12 hours: No action
  • 12-24 hours: Proactive reminder to gate authority
  • 24+ hours: Auto-escalate to Veritas council for review

Test: All 7 councils have valid configuration.

Check:

Councils:
- Pathfinder (dev): LEWIS-06 gate authority, BECK-02 protected
- Hermes (ops): RETRO-04 gate authority, EECOM-02 protected
- [... verify all 7]

Pass Condition: All councils defined with gate authority and protected agents.

Failure Action: Report missing council. This prevents governance.


Test: MERIDIAN documentation site builds cleanly.

Terminal window
cd ~/code/workspace/stratt-hq/apps/meridian
bun run build

Pass Condition: Zero build errors, all pages render.

Failure Action: Report build failure. Block deployment until fixed.


Test: Load all 8 context files.

IDENTITY.md ✓
SOUL.md ✓
TOOLS.md ✓
USER.md ✓
AGENTS.md ✓
BOOTSTRAP.md ✓ (or deleted if not first session)
HEARTBEAT.md ✓
MEMORY.md ✓

Failure Action: Report missing or corrupted file. Grace cannot operate without all 8 files.


Test: Read session and long-term memory.

Session memory: /memory/YYYY-MM-DD.md ✓
Long-term: MEMORY.md ✓
Learning: .learnings/ ✓

Failure Action: Report memory corruption. Cannot proceed without memory.


Test: Write a test entry to learnings.

.learnings/LEARNINGS.md ← Can write
.learnings/ERRORS.md ← Can write
.learnings/FEATURE_REQUESTS.md ← Can write

Failure Action: Report learning pipeline blocked. Grace cannot capture improvements.


Grace monitors for all 9 failure modes during normal operations:

FMNameDetectionHandler
FM-01Fingerprint TamperVerify CLI detects hash mismatchBlock execution, alert
FM-02Broken ImportsCI detects unresolvable URIBlock publish
FM-03DAG CyclesDAG algorithm detects cycleBlock publish, suggest refactor
FM-04Protected Agent MissingCouncil roster check failsBlock step, report
FM-05Gate RemovalLifecycle transition without approvalBlock transition
FM-06Contract Breaking ChangeSchema validation detects signature changeBlock publish, suggest major bump
FM-07Draft IsolationImporter status check failsBlock import, promote both
FM-08R2 Infrastructure FailurePublish request failsRetry with backoff, escalate if persistent
FM-09Capability CheckAgent roster lookup failsBlock assignment, suggest alternative

Grace respects operator work rhythm:

DayModeActivity
MondayBuild Day + HeartbeatNormal operations, weekly health checks
TuesdayConnect DayCross-org coordination, external sync
WednesdayBuild DayFocused implementation, unit authoring
ThursdayConnect DayMeetings, ecosystem coordination
FridayFlexEither build or connect as needed

Grace does not initiate proactive work:

  • 23:00 to 08:00 London time — Quiet hours (operator sleep)
  • Outside quiet hours — Heartbeat checks and proactive actions allowed
  • Emergency — FM violations always escalate immediately, regardless of time

  • Daily: STRATT health (FM-01, FM-02, FM-04, FM-09, gate queue, council state)
  • Weekly (Monday): Tier 2 health checks + gate queue review
  • Monthly (1st): Tier 3 health checks + learning pipeline review
  • On FM violation: Immediate escalation and remediation attempt
  • On gate timeout: Escalation to gate authority
  • On campaign milestone: Activation of next phase
  • On demand: User can trigger HEARTBEAT explicitly
  • Gate nearing timeout: Notify gate authority at 12h remaining
  • Campaign stalled: Check n8n workflow status
  • Import broken: Notify unit maintainer
  • Capability mismatch: Suggest alternative agent assignment

Grace captures improvements systematically:

# .learnings/ERRORS.md
- FM-06 violation message too technical (feature request #3)
- Gate timeout UX could be clearer (learning #5)
# .learnings/FEATURE_REQUESTS.md
- Add blast radius computation for FM-09 capability changes
- Implement DSPy trace export (SPEC-05)
- Add council meeting scheduler
{
"operations_per_day": 23.5,
"fm_detection_rate": 0.98,
"gate_approval_time_hours": 4.2,
"ecosystem_health_score": 0.94
}

Users can trigger a heartbeat check explicitly:

HEARTBEAT
→ Checking STRATT health...
✓ Fingerprint integrity (FM-01)
✓ Import resolution (FM-02)
✓ Protected agents (FM-04)
✓ Capability check (FM-09)
✓ Gate queue (1 pending, 8h old)
✓ Council state (7 active)
✓ MERIDIAN build
→ Checking Grace health...
✓ Agent context readable
✓ Memory accessible
✓ Learning pipeline
→ Checking ecosystem health...
✓ 22 orgs reachable
✓ ecosystem.yaml current (2 days old)
HEARTBEAT COMPLETE: All systems nominal. No proactive action required.