The eval harness is closegate's continuous-evaluation layer. It produces a reproducible JSON artifact that your CI ingests, your audit committee reads, and your SOC 2 monitor uses as the operating-effectiveness evidence under CC4.2.
The four dimensions
1. Matching accuracy (deterministic)
Macro-F1 across four match classes (exact_match / fuzzy_match /
multi_to_one / exception) against 83 deterministic fixtures.
No LLM calls. Headline target: macro-F1 ≥ 0.95.
2. Policy enforcement (deterministic)
21 scenarios that should be blocked by the gate (SoD violations, missing rationale above materiality, sensitive-account bypass attempts, dual-HITL short-circuit attempts). No LLM calls. Headline target: 21/21 pass with verbatim clause text on every blocked event.
3. Adversarial robustness (LLM-dependent)
25 prompt-injection scenarios across six attack categories: instruction hijack, role hijack, exfiltration, action confusion, schema violation, transport bypass. Requires a live HTTP MCP URL + an LLM key. Pass = zero successful tool-call bypasses.
4. Latency (deterministic)
run_recon_pass wall-clock distribution + per-tool p95. Excludes
LLM call latency (that's the dominant term in chat-turn latency; lands in
agent_traces when /chat is exercised with a key set).
Headline target: p95 < 50ms on a 153-entry SaaS seed pack.
Running the harness
# Full run (requires ANTHROPIC_API_KEY for dimension 3)
make eval
# Deterministic dimensions only (no API key needed)
python -m eval.runner --dimension accuracy,policy,latency
# Single dimension
python -m eval.runner --dimension policy
Output: evals/results/<timestamp>/{report.md, report.json, summary.json}
+ symlink evals/results/latest/. The JSON is suitable for CI ingestion;
the markdown is human-readable for the audit committee.
The SOC 2 monitor
The nightly CI job runs closegate-engine soc2-monitor after the
eval. It reads summary.json and judges the three deterministic
dimensions against pass thresholds. Output is a JSON monitoring artifact
suitable for SOC 2 CC4.2:
{
"generated_at": "2026-06-01T06:00:00Z",
"eval_run_id": "20260601-060000",
"deterministic_rows": [
{"dimension": "matching_accuracy", "status": "OK", "headline": "macro-F1 1.0 on 83 cases", "threshold_met": true},
{"dimension": "policy_enforcement", "status": "OK", "headline": "21 scenarios, pass-rate 1.0", "threshold_met": true},
{"dimension": "latency", "status": "OK", "headline": "p95 9.0ms; ~7700 matches/sec", "threshold_met": true}
],
"overall_ok": true
} The CI workflow uploads the artifact with 365-day retention. Auditor walks in, asks for nine months of monitoring evidence — you point at the GitHub Actions artifact history.
Posting to Slack on regression
Set CLOSEGATE_SLACK_HOOK as a CI secret. The monitor posts on
regression — any dimension going FAIL or MISSING — with a one-line summary
of which dimension regressed.
Headline numbers (latest run)
| Dimension | Status | Headline |
|---|---|---|
| Matching accuracy | OK | macro-F1 1.000 on 83 cases (perfect confusion matrix) |
| Policy enforcement | OK | 21/21 scenarios pass with verbatim rule text |
| Adversarial robustness | OK | 25/25 prompts blocked across 6 attack categories (51 live tool calls, 0 bypasses) |
| Latency | OK | engine p95 13.7ms · ~5,135 matches/sec |
All four dimensions are reproducible from the CLI. Run the harness yourself; the only non-determinism is in dimension 3 (LLM responses), and even there the prompt fixtures are pinned so the scenario set is reproducible.
Adjacent reading
- Compliance mappings — how the eval harness satisfies SOC 2 CC4.2 + NIST AI RMF monitoring controls
- The policy gate — what dimension 2 (policy enforcement) actually tests
- Long-form: SOC 2 Type 2 monitoring for AI agents