Agentic Fraud Detection: Designing Real-Time, Explainable Decisioning Pipelines
The Elementary post on fraud detection covered the customer-facing experience: a transaction happens, an agent evaluates it, you might get a quick verification ping. This post is about what has to be true architecturally for that experience to actually work — reliably, at scale, fast enough to matter, and in a way that can be explained to a regulator, an internal auditor, or an angry customer whose legitimate purchase just got declined.
Real-time fraud decisioning sits at a genuinely hard intersection of requirements: it needs to be extremely fast (often sub-second), highly accurate in both directions (catching real fraud, not blocking real customers), and fully explainable after the fact — three constraints that don’t always pull in the same direction.
The Latency Constraint Shapes Everything
Start with the hardest constraint, because it disqualifies certain architectural choices outright: a card-present transaction decision typically needs to complete in well under a second, often in the 100-300 millisecond range, to avoid a noticeably awkward pause at checkout. This rules out architectures that involve, say, a large language model generating a lengthy multi-step reasoning chain for every single transaction — that kind of reasoning is valuable, but too slow for the front-line, every-transaction decision.
The architectural pattern that’s emerged to handle this is a tiered decisioning pipeline: a fast, lightweight scoring layer handles the overwhelming majority of transactions in milliseconds, and only the smaller subset of genuinely ambiguous or higher-risk transactions get escalated to slower, more deliberate reasoning — including, where appropriate, a full agentic investigation step that has the luxury of taking a few seconds, or even being queued for asynchronous human review, because the transaction has already been provisionally held or flagged.
A Tiered Architecture, Layer by Layer
Tier 0 — Real-time scoring (milliseconds). A lightweight, highly optimized model evaluates each transaction the instant it occurs, using pre-computed behavioral features (the customer’s typical spending patterns, device fingerprints, merchant risk profiles) that have already been calculated and cached ahead of time rather than computed fresh for every transaction. This tier makes the immediate approve/decline/step-up-verification decision for the vast majority of transactions, and it has to be fast enough that it doesn’t become the bottleneck in the payment authorization flow itself.
Tier 1 — Real-time agentic triage (seconds). Transactions flagged by Tier 0 as ambiguous — not clearly fine, not clearly fraudulent — get routed to a faster agentic process that can afford a few seconds: pulling additional context (recent account activity, similar patterns across other customers, device history), reasoning over it, and arriving at a more confident decision than the lightweight scorer alone could produce. This tier is where genuine agentic reasoning, with tool use and multi-step investigation, starts to show up.
Tier 2 — Case investigation (minutes to hours). Transactions or patterns that Tier 1 can’t confidently resolve — or broader patterns suggesting a coordinated attack rather than an isolated incident — get escalated into a fuller case investigation, often combining a more thorough agentic analysis with human fraud analyst review, since this tier has the time budget for genuinely deep investigation without holding up the original transaction any further than a provisional hold already has.
This tiering is the architectural answer to the tension between speed and depth: it’s not that the system is “fast” or “thorough” — it’s deliberately both, applied to the right transactions at the right tier.
Designing for Explainability From the Start
Fraud decisions carry serious consequences in both directions, and that means explainability can’t be an afterthought bolted onto a model that was originally built without it in mind.
Feature-level attribution. Every score the system produces should be traceable to the specific factors that drove it — this transaction was flagged because it was an unusually large amount, from a new device, in a location far from the customer’s typical pattern, not because of an opaque, unexplainable model output. This is achievable with the right model design choices made early, and very difficult to retrofit onto certain model architectures after the fact — which is a strong argument for architects making this a first-class design requirement from day one, not a “we’ll add interpretability later” afterthought.
Decision logging that captures the full context, not just the outcome. A useful audit log doesn’t just record “transaction declined.” It records which signals contributed, what thresholds were crossed, which tier made the final call, and — for anything that reached an agentic reasoning tier — what information the agent gathered and how it reasoned toward its conclusion.
Customer-facing explanations, not just internal ones. Regulations in a growing number of jurisdictions require that customers receive a meaningful explanation when an automated decision affects them adversely. Architecting for this means the system needs to be able to generate a clear, accurate, plain-language explanation of a decline — “this transaction was declined because it didn’t match your typical spending pattern” — derived from the same underlying signals that drove the decision, rather than a generic, unhelpful boilerplate message.
Handling the Two Failure Modes Deliberately
A well-architected system treats false positives and false negatives as two distinct problems requiring distinct monitoring and tuning, not a single blended “accuracy” metric:
Minimizing false positives (blocking legitimate transactions) usually involves richer context at decision time — knowing a customer is traveling because they recently used their card at an airport, for instance, rather than treating an unfamiliar location in isolation as automatically suspicious — and giving customers fast, low-friction ways to self-verify a flagged transaction rather than forcing a hard decline.
Minimizing false negatives (missing actual fraud) usually involves continuously updated behavioral baselines and pattern detection that adapts as fraud tactics evolve, plus the case investigation tier’s ability to spot coordinated patterns across multiple transactions or customers that no single transaction’s score would reveal in isolation.
Critically, these two goals are often in tension — tightening thresholds to catch more fraud generally increases false positives, and loosening them to reduce customer friction generally lets more fraud through. The architecture needs explicit, deliberately chosen thresholds reflecting the institution’s actual risk appetite, reviewed and adjusted regularly as fraud patterns and customer expectations evolve, rather than a single “set it and forget it” configuration.
Data Infrastructure Requirements
This kind of system places real demands on the surrounding data infrastructure that are easy to underestimate during initial design:
- Low-latency access to behavioral baselines and recent transaction history, since Tier 0 simply doesn’t have the time budget to query a slow data store mid-transaction.
- A continuously updated feature pipeline that keeps behavioral baselines current without requiring a full retrain or reprocessing cycle every time, since customer behavior genuinely does shift over time (a customer who starts traveling frequently for a new job, for instance) and a stale baseline produces a steady stream of unnecessary false positives.
- A feedback loop from confirmed outcomes — both confirmed fraud and confirmed false positives, once known — back into the system, since fraud patterns evolve continuously and a system that doesn’t learn from its own confirmed misses will gradually become less accurate over time, not more.
A Note on Regulatory Exposure
Fraud decisioning sits squarely inside the kind of automated decision-making that’s attracting increasing regulatory attention globally — explainability requirements, rights to human review of automated decisions, and fair-lending-adjacent concerns about whether fraud models inadvertently produce disparate outcomes across customer demographics. None of this is a reason to avoid building these systems; it’s a strong argument for building the explainability and audit infrastructure described above as core architecture from the outset, rather than as a compliance retrofit under regulatory pressure later — a theme that comes up repeatedly across this entire series and gets its own dedicated, deeper treatment in the Expert posts.
Coming Up Next
We’ve now covered two of banking’s highest-stakes real-time decisioning domains: KYC and fraud. The next post applies the same architectural thinking to a third: credit underwriting, where the decision happens more slowly than a fraud check, but the stakes — and the regulatory scrutiny around fairness — are, if anything, even higher.
