The CFO’s question is “when can we put AI in our close?” The controller’s question is “how do we put AI in our close without our auditor blocking the pilot?” The answer is the same: start with the workflow where AI helps most + the controls layer is most defensible, and graduate from there. This article is the playbook.

The shape of a successful AI-close pilot

Three observations from 8 months of design-partner pilots:

  1. The agent doesn’t need to do everything. A close cycle has ~20 distinct workflow steps. AI does 4 of them well (recon matching, exception triage, invoice coding, anomaly flagging). AI is irrelevant or counterproductive on the other 16. Start with the 4.
  2. The controls layer matters more than the model. Pilots stall on the audit committee, not on the LLM. A mediocre LLM with a strong policy gate ships; a great LLM without one doesn’t.
  3. The first pilot is one cycle, one workflow. Teams that try to AI-ify the entire close in month one fail. Teams that AI-ify reconciliation for one cycle, prove it, then expand to AP 3-way match, then to coding suggestions, succeed.

Week 1–2: scope the pilot

Workflow choice: reconciliation. Why:

  • The agent works against a deterministic matcher baseline. Even if the LLM fails completely, the deterministic exact + multi-to-one matchers cover 60–70% of pairs. Worst case, you’re back to manual on the 30% the matchers didn’t catch.
  • The blast radius is contained. A bad match recommendation gets reviewed in the HITL inbox; a bad payment submission moves money.
  • The audit narrative is simple. “Agent proposes; human confirms; full audit trail; SOX-defensible.”

Policy shape: start with the SaaS or holdco starter policy.yaml (from closegate’s policy-library) and adjust three knobs:

  1. materiality_threshold_usd — the auto-confirm threshold. Set conservatively for the pilot ($5K is reasonable for a Series A team; $50K for an enterprise).
  2. always_human_accounts — your sensitive-account list. Always include cash, legal accruals, intercompany clearing, and vendor bank-change event types.
  3. entity_materiality_overrides — only if you’re multi-entity. Otherwise leave empty.

Stakeholders to brief: controller, AI architect, CISO, external auditor (briefly). Send each the relevant page on closegate.neullabs.com — for-finance-teams, for-architects, security, for-auditors.

Week 3–6: parallel run

The setup: point closegate at a snapshot of last month’s GL + SL data. Run the deterministic eval (no LLM key needed) — confirm the matchers produce expected output. Then enable LLM mode and run the full recon pass against the snapshot.

What to compare:

  • Matches the AI proposes vs matches your team made manually. Macro-F1 should be ≥ 0.95 against your team’s ground truth on 100+ historical cases.
  • HITL queue volume. If the gate routes more than 40% of matches to HITL, the materiality threshold is too tight. If under 5%, it’s too loose.
  • Average wall-clock to close. This is the productivity number. Expect 30% reduction in week-2 close effort once the agent is reliable.

What your auditor wants to see at this point: sample 25 audit events. For each, confirm the verbatim policy clause text matches the rule in policy.yaml. Walk the policy gate code path. Verify SoD enforcement on a synthetic test (e.g., same actor proposes + confirms → expect Deny(SOD_SAME_ACTOR)).

Week 7–10: production cutover

The criteria for going live:

  • ✓ Macro-F1 on matching ≥ 0.95 on the parallel run
  • ✓ All 21 policy-enforcement scenarios pass
  • ✓ Latency p95 under 50ms on the recon pass
  • ✓ Adversarial robustness (prompt-injection) — 25/25 attacks blocked
  • ✓ External auditor walk-through completed; no blocking concerns
  • ✓ Nightly closegate-engine soc2-monitor is wired to CI and posting to Slack on regression
  • ✓ Backup of audit_events to immutable cold storage is configured

The shape: one close cycle, with the AI agent in the loop, but the controller still manually verifies the HITL queue at the end. The agent’s autonomy in the first production cycle is “propose + route”; the human is still doing all the confirms.

Week 11–14: graduated autonomy

After one successful production cycle:

  • Raise the materiality threshold slowly (e.g., $5K → $7K → $10K) — watch the audit-event distribution
  • Add a second workflow (AP 3-way match is the natural next step)
  • Start letting the agent auto-confirm T1 actions (reject, flag, escalate) that have audit-row evidence but don’t require HITL
  • Brief the audit committee on the pilot results with concrete numbers (hours saved, exceptions caught, audit-event sample)

What to avoid

Don’t start with journal entries. Posting to the GL is a T2/T3 action depending on the account; the audit pressure is much higher than recon; the failure mode is worse.

Don’t start with payment submission. Submitting to bank/ACH is T3 (dual HITL, full SoD chain); save it for after the team trusts the gate.

Don’t skip the deterministic eval. Run the policy-enforcement scenarios (21 of them) before the first LLM call. If those don’t pass, no amount of model tuning will save the pilot.

Don’t let the model proposer be the same actor as the confirmer. This is literally the SoD violation the gate is designed to catch — make sure your agent loop sets X-Actor-Id correctly per request.

Don’t promise the audit committee anything you can’t reproduce. Every claim should be backed by an audit-log query or an eval run output. “The agent caught 73% of exceptions” is a claim you can show; “the agent will save us $X” is a claim that depends on assumptions you should put in writing.

The audit-committee narrative

The pitch that lands, in three slides:

Slide 1: The problem. Close cycle takes 6 days. Controller spends 40% of week on repetitive tie-outs. Exception triage doesn’t get the attention it needs.

Slide 2: The shape of the AI assist. Agent proposes matches; the policy gate decides Allow/HITL/Deny based on materiality + sensitive accounts + SoD rules; the audit log records every decision with the verbatim policy clause text; SOC 2 monitoring runs nightly.

Slide 3: The defensibility. Open-source policy gate, readable by the external auditor; tamper-evident audit log enforced at the SQLite layer; reproducible eval harness for ongoing monitoring; per-IdP SSO for actor identity; no LLM lock-in.

Pair with: a sample audit-event row showing verbatim clause text; a screenshot of the eval-harness JSON output; the SOC 2 control-mapping cross-walk.

What this gives your finance team

After 14 weeks:

  • 30% reduction in close-week controller hours (typical)
  • HITL queue under 20% of total matches (varies by materiality + your data shape)
  • SOC 2 CC4.2 monitoring evidence on autopilot
  • Audit-evidence-export PBC bundle ready in one CLI command
  • No vendor lock-in, no seat licensing, no $200K/yr line item