Skip to content

Execution Traces & Self-Enhancement

Traces are the observability layer for GRACE. Every chain execution produces a structured trace recording what ran, how well it performed, and what changed. Over time, traces enable Grace to detect regressions, identify high-performing units, and feed data to automated refinement loops.

Traces serve four strategic functions:

  1. Auditability — Answer: “exactly which prompt ran, at what version, with what inputs/outputs?”
  2. Quality tracking — Measure each unit’s performance over time
  3. Regression detection — Flag when a unit’s quality degrades unexpectedly
  4. Self-improvement — Feed traces into DSPy MIPROv2 to automatically refine prompts

Aligned with SPEC-05 and the IRProgram interface from @stratt/ir, ensuring structural compatibility between compiled chains and their execution records.

Each execution produces a top-level trace record:

trace_id: "tr-01aryz6s41tsqt0yp91qsydysq"
chain_uri: "stratt://dev/chain/sol-1-boot@0.1.0"
version: "0.1.0"
fingerprint: "blake3:a1b2c3d4e5f6..."
session_id: "sess-2026-04-01T10:00:00Z"
council: "pathfinder"
started_at: "2026-04-01T10:00:00Z"
completed_at: "2026-04-01T10:05:23Z"
duration_ms: 323000
status: "completed" # completed | failed | gated | timeout
quality_score: 0.82 # 0.0–1.0, heuristic-based
token_counts:
input: 4200
output: 1850
total: 6050
steps: [] # see Step-Level Trace below

Each step in the chain produces its own trace entry:

- step_id: "step-01"
unit_uri: "stratt://dev/task/intake-parse@0.1.0"
unit_fingerprint: "blake3:a1b2c3d4e5f6..."
agent: "WATNEY-01"
gate: false
started_at: "2026-04-01T10:00:00Z"
completed_at: "2026-04-01T10:01:12Z"
duration_ms: 72000
status: "completed" # completed | failed | retry_exhausted
input_hash: "blake3:..." # hash of actual inputs (for privacy)
output_hash: "blake3:..." # hash of actual outputs
token_counts:
input: 800
output: 350
quality_indicators:
contract_conformance: true # output matches declared types
completeness: 0.9 # all required fields present
token_efficiency: 0.75 # output tokens / input tokens

Gate steps (checkpoints requiring human approval) produce audit records:

- step_id: "step-06"
unit_uri: "stratt://dev/gate/boot-review@0.1.0"
agent: "LEWIS-06"
gate: true
gate_resolution:
state: "APPROVED" # APPROVED | REJECTED | TIMEOUT | ESCALATED
resolved_by: "LEWIS-06"
resolved_at: "2026-04-01T10:04:50Z"
wait_duration_ms: 180000
reason: null # populated on REJECTED (why human rejected)
packet_hash: "blake3:..." # hash of the unit output being gated

A pluggable quality scorer evaluates each execution. The default heuristic uses three factors:

FactorWeightMeasurement
Contract conformance40%Output matches declared types and constraints
Completeness35%All required outputs present and non-empty
Token efficiency25%Ratio of useful output tokens to input tokens

Formula:

quality_score = (conformance × 0.40) + (completeness × 0.35) + (efficiency × 0.25)
RangeMeaningAction
≥ 0.80✅ GoodNo action; unit performing well
0.60–0.79⚠️ ReviewFlag for manual inspection; check for false negatives
< 0.60❌ PoorTrigger refinement cycle; regenerate prompts
Delta > 0.05 between versions⛔ RegressionAuto-trigger Veritas gate; block publish until approved

Location: automation/traces/YYYY-MM-DD-{chain-slug}.yaml

Gitignore policy: Traces contain prompt/response content and are NOT committed to version control

Retention policy:

  • Local storage: 90 days (rolling window)
  • Archive to Cloudflare R2: Long-term retention (planned)

Sampling strategy:

  • Default: Trace all executions
  • At 100+ runs/day: Sample at 0.2 rate (20%), but always trace:
    • Failures (status: failed)
    • Gate resolutions (gate_resolution.state != APPROVED)
    • Quality scores < 0.60

When a unit is updated, compare its quality score against the previous version:

delta = new_version_score - old_version_score
if delta < -0.05:
→ Regression detected
→ Auto-trigger Veritas gate
→ Block publication until human approves or reverts change

Example:

  • Unit stratt://dev/task/parse@1.0.0 averaged 0.85 over 10 runs
  • Unit stratt://dev/task/parse@1.1.0 averages 0.79 over 5 runs
  • Delta = 0.79 − 0.85 = −0.06 (exceeds −0.05 threshold)
  • Action: Veritas gate blocks publication; human must decide to revert or approve

Every Monday morning (as part of HEARTBEAT memory maintenance):

  1. Collect — Read all traces from the past week
  2. Score — Compute average quality per unit and per chain
  3. Compare — Check for regressions against previous week
  4. Identify — Flag underperforming units (score < 0.60 or delta > 0.05)
  5. Refine — For flagged units, determine action:
    • Adjust prompt_body content
    • Change agent assignment in chain
    • Modify step ordering or gate placement
    • Tune gate approval thresholds
  6. Log — Write evaluation summary to memory/YYYY-MM-DD.md
  7. Validate — If a unit was modified, run stratt validate + stratt fingerprint --verify

This cycle ensures Grace continuously learns from execution data and proactively identifies degradation.

For future integration with DSPy MIPROv2, traces export to JSONL format with flexible filtering.

Terminal window
stratt export-dspy \
--min-score 0.7 \
--date-range 2026-03-01..2026-04-01 \
--version-range 0.1.0..0.2.0 \
--chain stratt://dev/chain/sol-1-boot@* \
--exclude-gates \
> training-data.jsonl
{
"trace_id": "tr-01aryz6s41tsqt0yp91qsydysq",
"chain_uri": "stratt://dev/chain/sol-1-boot@0.1.0",
"chain_version": "0.1.0",
"quality_score": 0.82,
"steps": [
{
"step_id": "step-01",
"unit_uri": "stratt://dev/task/intake-parse@0.1.0",
"agent": "WATNEY-01",
"input": "raw user request...",
"output": "parsed task object...",
"quality_indicators": {
"contract_conformance": true,
"completeness": 0.9,
"token_efficiency": 0.75
}
}
]
}
OptionPurposeExample
--min-scoreOnly export high-quality traces--min-score 0.7
--date-rangeDate window for collection--date-range 2026-03-01..2026-04-01
--version-rangeVersion window--version-range 0.1.0..0.2.0
--chainSpecific chain only--chain stratt://dev/chain/boot@*
--exclude-gatesOmit gate resolution data(useful for privacy)

The full self-improvement loop (currently design-phase, BUNDLE-C target):

1. COLLECT
Gather traces from past week
2. ANALYZE
Compute quality scores per unit
Detect regressions (delta > 0.05)
Identify underperformers (score < 0.60)
3. EXPORT
Run `stratt export-dspy` with filters
Produce JSONL training dataset
4. OPTIMIZE
Feed JSONL into DSPy MIPROv2
MIPROv2 auto-tunes prompt parameters
Produces candidate prompt revisions
5. VALIDATE
CI checks new prompts (contract conformance, fingerprint)
Manual review (domain expertise check)
6. PUBLISH
Update unit with new prompt_body
Increment minor version (0.1.0 → 0.2.0)
Re-fingerprint + publish
7. MONITOR
Run new version in production
Collect traces for next cycle
Compare quality against previous version
8. REPEAT
Weekly evaluation, continuous improvement

The trace format is structurally compatible with @stratt/ir interfaces:

IRProgram FieldTrace EquivalentPurpose
chainUrichain_uriUnique chain identifier
versionversionVersion of chain executed
meta.councilcouncilAssigned council for this chain
steps[].idsteps[].step_idStep sequence number
steps[].unitUristeps[].unit_uriWhich unit ran
steps[].agentsteps[].agentWhich agent was assigned
steps[].gatesteps[].gateIs this a gate checkpoint?
steps[].inputssteps[].input_hashHashed inputs (for privacy)
steps[].outputssteps[].output_hashHashed outputs (for privacy)
edges(implicit)Step ordering and gate dependencies
failureModessteps[].statusTransition states (completed, failed, gated)

This means a compiled IRProgram can be directly used as the template for generating trace records during execution—no schema translation needed.

Scenario: stratt://dev/chain/sol-1-boot@0.1.0 executes with 3 steps + 1 gate

trace_id: tr-2026-04-01-001
chain_uri: stratt://dev/chain/sol-1-boot@0.1.0
version: 0.1.0
fingerprint: blake3:9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f
session_id: sess-2026-04-01T10:00:00Z
council: pathfinder
started_at: 2026-04-01T10:00:00Z
completed_at: 2026-04-01T10:05:23Z
duration_ms: 323000
status: completed
quality_score: 0.82
token_counts:
input: 4200
output: 1850
total: 6050
steps:
- step_id: step-01
unit_uri: stratt://dev/task/intake-parse@0.1.0
unit_fingerprint: blake3:a1...
agent: WATNEY-01
gate: false
started_at: 2026-04-01T10:00:00Z
completed_at: 2026-04-01T10:01:12Z
duration_ms: 72000
status: completed
input_hash: blake3:8a...
output_hash: blake3:2d...
token_counts:
input: 800
output: 350
quality_indicators:
contract_conformance: true
completeness: 0.9
token_efficiency: 0.75
- step_id: step-02
unit_uri: stratt://dev/task/process-analysis@0.1.0
unit_fingerprint: blake3:c3...
agent: LEWIS-06
gate: false
started_at: 2026-04-01T10:01:13Z
completed_at: 2026-04-01T10:04:45Z
duration_ms: 212000
status: completed
input_hash: blake3:5b...
output_hash: blake3:7f...
token_counts:
input: 3200
output: 1200
quality_indicators:
contract_conformance: true
completeness: 0.85
token_efficiency: 0.68
- step_id: step-03
unit_uri: stratt://dev/gate/boot-review@0.1.0
agent: LEWIS-06
gate: true
started_at: 2026-04-01T10:04:46Z
completed_at: 2026-04-01T10:04:50Z
duration_ms: 4000
status: completed
gate_resolution:
state: APPROVED
resolved_by: LEWIS-06
resolved_at: 2026-04-01T10:04:50Z
wait_duration_ms: 4000
reason: null
packet_hash: blake3:7f...