🏦
Financial Services Post
This post covers architecture patterns specifically for banking and financial services environments — including regulated cloud, audit requirements, and compliance constraints.

Model Risk Management Meets Agentic AI: Extending Three-Lines-of-Defence to Autonomous Agents

Most large financial institutions have a well-established model risk management (MRM) framework — a governance structure that defines how models used in consequential decisions (credit scoring, pricing, trading, fraud detection) are validated, monitored, and retired. In most institutions, this framework is anchored to guidance like the US Federal Reserve’s SR 11-7, the OCC’s parallel guidance, or equivalent frameworks from other regulators, and is operationalised through the three-lines-of-defence structure: model developers and business owners as the first line, a dedicated model risk and validation team as the second, and internal audit as the third.

That framework was built for a world of well-behaved, relatively static statistical models — models with defined inputs, defined outputs, stable production behaviour, and a clear, reviewable mathematical specification. Agentic AI systems are not that kind of model. They reason dynamically, use tools, produce varied outputs depending on runtime context, and can be updated (through prompt changes, retrieved knowledge updates, or tool modifications) in ways that don’t always trigger a conventional model change event. Extending MRM governance to cover them is non-trivial, genuinely important, and the subject of active guidance development from regulatory bodies in multiple jurisdictions. This post is a practitioner’s guide to how that extension needs to be designed.

Why Existing MRM Frameworks Struggle With Agentic Systems

A conventional model risk management lifecycle has a reasonably clear structure: a model is developed against a defined specification, validated by an independent team against a test dataset before production deployment, monitored through a defined set of performance metrics once live, and subject to a formal review trigger when it’s substantially changed or when performance metrics indicate something has shifted. At every stage, the “model” in question is a relatively well-defined, self-contained artefact — the parameters and architecture of a statistical model, or the logic of a rules-based decisioning system.

Agentic systems resist this structure at multiple points. The “model” is not a single, clearly-bounded artefact — it’s a composition of an underlying language model, prompt instructions, a toolset, a retrieval pipeline, an orchestration framework, and potentially a memory system, any of which can be modified independently and any of which can meaningfully change the system’s production behaviour. Validation against a static test dataset is necessary but insufficient — an agent’s behaviour on a set of scripted test cases may not predict its behaviour on the genuinely novel, unpredictable inputs that define real-world deployment. Performance monitoring through a conventional set of statistical metrics doesn’t easily capture the dimensions of agent behaviour that matter most — did it take the right sequence of actions, did it use its tools appropriately, did it correctly recognise the cases it should have escalated?

None of these challenges are reasons to exempt agentic systems from MRM governance. They are reasons to extend that governance thoughtfully rather than applying it unchanged and pretending the gaps don’t exist.

Redefining “The Model” for MRM Purposes in an Agentic Context

The most foundational extension required is a broader, more precise definition of what constitutes the governed “model” in an agentic system. A workable definition treats the entire agentic system — the underlying language model version, the system prompt and any prompt templates, the toolset and their interfaces, the RAG pipeline and its knowledge sources, the orchestration logic, and any memory or state management — as collectively constituting the governed model artefact, with any substantive change to any of these components triggering a proportionate review process, not just changes to the underlying language model weights.

This has a specific, practical implication that catches many teams by surprise: a prompt change — modifying the system instructions given to an agent — can meaningfully change that agent’s behaviour in consequential decisions, and should therefore be subject to a review and approval process under MRM governance, not treated as a lightweight configuration change that can be deployed without review. Institutions that have tried to exempt prompt changes from MRM governance on the grounds that “it’s just text, not a real model change” have generally discovered, through uncomfortable experience, that the exemption was wrong.

Extending Each Line of Defence

First line — Model development and business ownership. In an agentic context, the first line needs to maintain documentation not just of the underlying model but of the full system composition described above, with version control and change logs that would allow an examiner to reconstruct exactly what combination of components was in production at any given point in time. The first line also owns the definition of what the agent is intended to do and not do — its “model use case” in MRM terminology — which for an agentic system needs to be specific enough to be testable, not just a high-level description of intended purpose.

Second line — Independent model risk and validation. The validation function needs to develop new methodologies specific to agentic systems, alongside (not replacing) conventional quantitative validation approaches. These include: behavioural validation through adversarial and edge-case testing designed to probe failure modes specific to agent reasoning (injection susceptibility, reasoning loop failure, incorrect tool selection under ambiguous conditions); consistency validation confirming the agent behaves equivalently on semantically similar inputs across multiple runs, since language model stochasticity means a given input might not produce identical outputs on every call; and scope validation confirming the agent correctly recognises and appropriately escalates cases outside its intended operating domain rather than confidently guessing through them.

The second line also needs an explicit methodology for validating changes to individual components of the agentic system composition — particularly prompt changes and knowledge base updates — in proportion to the likely effect of that change on consequential decisions, rather than either ignoring these changes or treating every minor update as requiring a full validation cycle regardless of scope.

Third line — Internal audit. Audit’s role in this context extends beyond confirming that the validation process was followed, to specifically examining whether the first and second line processes have been genuinely adapted to the agentic context — whether validation methodologies are appropriate for the actual risk profile of the system, whether change management processes are actually capturing the full range of substantive system changes, and whether ongoing monitoring is providing the visibility into production behaviour that the governance framework requires. An audit function that applies conventional model audit approaches unchanged to agentic systems will likely miss the most significant governance gaps.

Ongoing Monitoring: What to Measure and How

The monitoring challenge for agentic systems is that the performance dimensions that matter most are not always easily reducible to the quantitative metrics conventional MRM monitoring tracks. A few approaches that have emerged as practically workable:

Task completion quality sampling. A regularly-drawn, statistically-designed sample of completed agentic tasks, reviewed by qualified human reviewers against a defined rubric, provides a direct read on whether the agent is actually doing its job well — more directly informative than indirect proxy metrics, at the cost of requiring genuine human review time and expertise. The cadence and sample size should be proportionate to the volume and risk level of the decisions involved.

Escalation pattern monitoring. Tracking what proportion of tasks are escalated to human review, and how that proportion changes over time, is a useful early indicator of distributional shift — if an agent that previously handled 80% of cases autonomously starts escalating 50%, something about the incoming cases or the agent’s behaviour has changed, and the cause is worth investigating before the metric drifts further.

Tool use anomaly detection. Monitoring which tools the agent uses, in what sequences, and how that compares to established baseline patterns — flagging significant deviations for review, not automatically as errors, but as indicators of potentially changed behaviour worth examining.

Human override tracking. Where human reviewers can override agent recommendations, tracking the rate and direction of overrides over time provides a continuous, organic signal on whether the agent’s recommendations are remaining appropriately aligned with human judgment on similar cases.

The Governance Gap That Most Needs Attention Right Now

Having reviewed MRM frameworks across financial institutions at various stages of agentic AI adoption, the single most common and consequential gap is the treatment of the agentic system’s knowledge base and retrieval pipeline as outside MRM scope. The RAG pipeline is, from an MRM perspective, a form of dynamic input to the decisioning system — and changes to the knowledge base (new documents added, outdated ones removed, the retrieval logic adjusted) can change decisioning outcomes in ways that are every bit as consequential as a parameter update to a conventional model. Yet in most current implementations, knowledge base management sits outside the MRM lifecycle entirely, subject to content governance processes designed for document management rather than for governing changes to a decisioning system’s effective inputs.

Closing this gap — bringing knowledge base change management explicitly within MRM scope, with proportionate review processes for material changes — is probably the single highest-priority extension most institutions need to make to their existing MRM frameworks when extending them to agentic systems.

Coming Up Next

This is the penultimate post in the Expert series. The final post addresses the practical question that ties the whole series together: how do you actually know whether a production agentic system is working correctly, at scale, over time — building the evaluation and observability infrastructure that makes everything discussed across this series inspectable and improvable in practice.

Ashish Pande
Ashish Pande
Solutions Architect · Agentic AI Specialist · AWS | GCP | Azure

20+ years delivering complex solutions in financial services. Currently building enterprise-grade Agentic AI on AWS, leading a team of 24 engineers.

View full profile →