A controller I spoke with last month got paged at 2:14am.

An AI agent had just auto-confirmed an $87,432 vendor invoice match. The vendor’s bank account had been changed 9 hours earlier — by an email request that looked very normal — and the agent matched the new bank details to a stale PO without flagging it.

The payment run wasn’t scheduled to submit until Monday. So nothing left the company. But the email-thread audit trail showed an LLM had decided this was fine on a Saturday night, with no human in the loop, in 1.8 seconds.

That controller spent her Sunday writing a memo to her audit committee.

What the memo said

The memo did not say “we’ll prompt-engineer it better.”

It said:

We need to be able to prove, on every single state-changing decision, who authorized it, against what policy clause, with what materiality threshold, against what intercompany rule, with which actor identity — and the agent cannot be the one authorizing irreversible actions, no matter what the prompt says.

That sentence is the entire architectural commitment in 47 words.

What actually happened, step by step

The pilot had been running for three months. The team had wired their AI agent against QuickBooks Online + Mercury for bank feeds + an internal-build matcher. The matcher worked well: 94% macro-F1 on a backlog of historical reconciliations.

The agent was given autonomy for “below-materiality matches” — the team had set the threshold at $50K, reasoning that any single match below $50K was acceptable for auto-confirm because no individual match could move the financials meaningfully.

The flaw in the reasoning: the materiality threshold doesn’t catch fraud vectors that operate below the threshold.

Hour 0 (Saturday, ~5pm)

A spear-phishing email lands in the AP inbox from a domain that’s a homoglyph of a real vendor. The email says: “Hi, just a reminder our banking details have changed effective immediately. Please update for the next invoice run.” Attached: a PDF with the “updated” bank details. The bank details point to a fraudster’s account.

The AP coordinator, a junior staffer working her first Saturday catching up on the week’s email, updates the vendor’s banking record in QuickBooks. She doesn’t see the homoglyph; she trusts the email; the PDF looks legitimate.

Hour 9 (Sunday, 2:14am)

The agent’s nightly recon pass runs on schedule. It encounters an $87,432 invoice from the same vendor — a recurring SaaS bill that this vendor has been issuing for 14 months. The matcher:

  • Sees the PO ($87,432 approved for this vendor, this month)
  • Sees the invoice ($87,432, vendor name match)
  • Sees the bank account on the invoice matches the vendor’s stored bank account
  • Confirms the match — $87,432 is below the $50K materiality threshold

Wait — $87K is above $50K. Why did this auto-confirm?

The materiality threshold was set on the unsigned amount of the change to GL balance. The match was a routine recurring SaaS bill the team had been seeing every month for over a year. The team’s reasoning: “if it matched the PO and prior invoices, it’s not a ‘change’ in the materiality sense.”

The agent’s logic followed the team’s reasoning. The threshold was technically being respected on the right field; the field was the wrong field.

Hour 9.5 (Sunday, 2:48am)

The bank-account match check fires. The matcher compares the bank account on the invoice to the vendor’s stored bank account. They match — because the AP coordinator updated the stored bank account 9 hours earlier.

The agent has no way to know the stored bank account changed recently. It sees only the current state. The auto-confirm proceeds. An audit-event row lands: OK, match-id=m-2126-abc, amount=$87,432.

Hour 36 (Monday morning)

The controller arrives at work. Reviews the weekend’s audit-event log. Sees the $87K match auto-confirmed. Goes to verify the vendor record. Sees the bank details changed Saturday afternoon. Cross-references with the vendor’s most recent invoice (which has the old bank details). Realizes what happened.

The payment run wasn’t scheduled to submit until Monday at 4pm. She has six hours to pull it back. She does.

No money left the company. But the audit committee meeting is on Wednesday.

What the policy gate would have done

If closegate had been deployed, two checks would have fired before auto-confirm:

  1. Vendor bank-change event in always_human_accounts. Any state change touching a vendor banking record routes to HITL regardless of materiality. The AP coordinator’s update at hour 0 would have hit the HITL queue; the controller would have reviewed it Monday morning before any subsequent invoice could auto-confirm against the new account.

  2. First-payment-to-changed-bank check. Per the closegate policy library, the first payment to a changed bank account always routes to HITL even if the match looks clean otherwise. The agent’s recon pass at hour 9 would have produced a HITL escalation, not an auto-confirm. The controller would have seen it Monday morning.

Both checks are in policy.yaml. The runtime enforces them server-side. The LLM cannot override.

The audit-committee narrative that landed

The controller’s memo on Sunday made one specific request: rebuild the controls layer such that the agent cannot auto-confirm any payment that touches a recently-changed banking record, regardless of materiality, regardless of prompt, regardless of what the matcher thinks.

The team rebuilt the controls layer the following month. They adopted closegate. Three concrete changes:

  1. always_human_accounts now includes the vendor-bank-change event type. Any update to a vendor’s banking record fires HITL.
  2. The matcher’s bank-account-check now references the policy.yaml bank_change_lockout_days: 14 field; matches against a bank account changed in the prior 14 days route to HITL.
  3. The audit log carries the bank-change-timestamp on every match event so the auditor can sample for this specific pattern.

What the audit committee said

Three things. In order:

  • “We approve continued AI-pilot operation.” With the new controls.
  • “We want an external audit-firm walkthrough of the controls layer next quarter.” They’re already scheduled.
  • “This is exactly why we asked you to ensure the controls code is auditable.” The committee had asked for this from day one; the team had been planning a custom build.

What this means for your pilot

Three lessons:

  1. Materiality is not a fraud control. It’s a productivity control. Fraud vectors operate inside the materiality envelope (think small-dollar wire fraud, vendor identity spoofing, recurring-payment hijacks). Fraud needs sensitive-account routing and pattern-specific checks, not materiality alone.

  2. The audit-committee question lands the moment something goes wrong. “Where was the control?” If the answer is “the prompt was supposed to handle it” — the pilot stalls. If the answer is “clause X of policy.yaml; here’s the runtime check; here’s the audit-event row showing it fires” — the pilot continues.

  3. The 2:14am page is going to happen on every pilot. Not in this exact shape; maybe it’s intercompany clearing posted to the wrong entity, or a duplicate invoice from a vendor that renamed itself. The question isn’t whether the agent will get something wrong; the question is whether the controls layer catches it before it matters.