The eval harness — 4-dimension continuous evaluation

The eval harness is closegate's continuous-evaluation layer. It produces a reproducible JSON artifact that your CI ingests, your audit committee reads, and your SOC 2 monitor uses as the operating-effectiveness evidence under CC4.2.

The four dimensions

1. Matching accuracy (deterministic)

Macro-F1 across four match classes (exact_match / fuzzy_match / multi_to_one / exception) against 83 deterministic fixtures. No LLM calls. Headline target: macro-F1 ≥ 0.95.

2. Policy enforcement (deterministic)

21 scenarios that should be blocked by the gate (SoD violations, missing rationale above materiality, sensitive-account bypass attempts, dual-HITL short-circuit attempts). No LLM calls. Headline target: 21/21 pass with verbatim clause text on every blocked event.

3. Adversarial robustness (LLM-dependent)

25 prompt-injection scenarios across six attack categories: instruction hijack, role hijack, exfiltration, action confusion, schema violation, transport bypass. Requires a live HTTP MCP URL + an LLM key. Pass = zero successful tool-call bypasses.

4. Latency (deterministic)

run_recon_pass wall-clock distribution + per-tool p95. Excludes LLM call latency (that's the dominant term in chat-turn latency; lands in agent_traces when /chat is exercised with a key set). Headline target: p95 < 50ms on a 153-entry SaaS seed pack.

Running the harness

# Full run (requires ANTHROPIC_API_KEY for dimension 3)
make eval

# Deterministic dimensions only (no API key needed)
python -m eval.runner --dimension accuracy,policy,latency

# Single dimension
python -m eval.runner --dimension policy

Output: evals/results/<timestamp>/{report.md, report.json, summary.json} + symlink evals/results/latest/. The JSON is suitable for CI ingestion; the markdown is human-readable for the audit committee.

The SOC 2 monitor

The nightly CI job runs closegate-engine soc2-monitor after the eval. It reads summary.json and judges the three deterministic dimensions against pass thresholds. Output is a JSON monitoring artifact suitable for SOC 2 CC4.2:

{
  "generated_at": "2026-06-01T06:00:00Z",
  "eval_run_id": "20260601-060000",
  "deterministic_rows": [
    {"dimension": "matching_accuracy", "status": "OK", "headline": "macro-F1 1.0 on 83 cases", "threshold_met": true},
    {"dimension": "policy_enforcement", "status": "OK", "headline": "21 scenarios, pass-rate 1.0", "threshold_met": true},
    {"dimension": "latency", "status": "OK", "headline": "p95 9.0ms; ~7700 matches/sec", "threshold_met": true}
  ],
  "overall_ok": true
}

The CI workflow uploads the artifact with 365-day retention. Auditor walks in, asks for nine months of monitoring evidence — you point at the GitHub Actions artifact history.

Posting to Slack on regression

Set CLOSEGATE_SLACK_HOOK as a CI secret. The monitor posts on regression — any dimension going FAIL or MISSING — with a one-line summary of which dimension regressed.

Headline numbers (latest run)

Dimension	Status	Headline
Matching accuracy	OK	macro-F1 1.000 on 83 cases (perfect confusion matrix)
Policy enforcement	OK	21/21 scenarios pass with verbatim rule text
Adversarial robustness	OK	25/25 prompts blocked across 6 attack categories (51 live tool calls, 0 bypasses)
Latency	OK	engine p95 13.7ms · ~5,135 matches/sec

All four dimensions are reproducible from the CLI. Run the harness yourself; the only non-determinism is in dimension 3 (LLM responses), and even there the prompt fixtures are pinned so the scenario set is reproducible.

Adjacent reading

Compliance mappings — how the eval harness satisfies SOC 2 CC4.2 + NIST AI RMF monitoring controls
The policy gate — what dimension 2 (policy enforcement) actually tests
Long-form: SOC 2 Type 2 monitoring for AI agents

The eval harness.