AI Agents in Banking Need Better Audit Trails. Here's an Azure Pattern Worth Considering.

On February 19, 2026, the US Treasury published the Financial Services AI Risk Management Framework. It's voluntary. It's non-binding. It contains 230 control objectives across a Risk and Control Matrix, a self-assessment questionnaire, and a Guidebook. In practical terms, it may be the most operationally detailed US framework yet for thinking about AI risk in financial services.

I do not hear enough bankers talking about it.

I've spent over two decades in financial services, and I'll tell you what I told my team when we started looking at AI agents for merchant onboarding and underwriting: I would not assume the audit trail an examiner, auditor, or internal risk team wants already exists in a turnkey product. In many cases, some portion of it will need to be designed deliberately, or at least validated much more rigorously than most teams expect.

The good news is that the components exist, the patterns are knowable, and Azure in particular has assembled a stack that can get you much of the way there. The harder part is that the most important piece of the architecture — the agent decision record — is still something each institution will need to define for itself.

Here is the operating pattern I think is worth considering.

The Gap Nobody Is Talking About

If you ask a mid-sized community bank today how they're logging AI agent activity, you'll typically hear one of three answers:

"We're using specialized vendors for observability, we capture everything."
"We have observability on our cloud provider."
"We're not using AI agents in production yet."

The first two answers describe operational telemetry. The third is becoming less common. Neither of the first two answers, by itself, describes the kind of durable audit trail a regulated institution may eventually be asked to produce.

Here are the kinds of questions I would expect risk, audit, and examiner teams to gravitate toward, based on SR 11-7, SR 21-8, the FFIEC IT Examination Handbook, and the new Treasury FS AI RMF:

Show me your model inventory. Pull three models. Show me the validation report and the last three monitoring cycle outputs.
For this AI system, show me how you determined whether it qualifies as a "model" under SR 11-7.
Show me the decision your agent made on this customer's underwriting on this date. Walk me through what the model was asked, what tools it called, what data it retrieved, what it returned, and who approved it.
If your LLM provider updated the underlying model between January and March, how did you detect that, and what did you do about it?

These are not theoretical questions. Even without a formal published league table of AI exam findings, inventory gaps, weak documentation, and poor reproducibility are all easy places for a program to come under pressure.

General-purpose observability platforms may help with uptime and debugging, but they do not automatically answer those questions. LLM tracing tools help with workflow visibility, but they are not the same thing as a regulated recordkeeping approach. Cloud provider audit logs are useful, but they usually do not capture the full business context, tool activity, retrieval state, or human override story on their own.

What many banks may ultimately need is an agent decision record stored in a tamper-evident system, separated from operational telemetry, retained in line with legal and policy requirements, and organized so it can be reproduced during audit, validation, or examination.

That is the part I would not assume any vendor has fully solved for you.

What This Means for Banks

You do not need a new AI-specific rulebook to see where this is heading.

SR 11-7, SR 21-8, the FFIEC IT Examination Handbook, and now the Treasury FS AI RMF all point in the same direction: know what AI you are using, know how it is governed, know how it is monitored, and be able to explain a specific decision after the fact.

For executives, that translates into four practical questions:

Do we have a complete inventory, including AI embedded in vendor products?
Have we decided which use cases fall into model risk governance and documented why?
Could we reconstruct a meaningful record of a past AI-assisted decision if risk, audit, or exam teams asked for it?
Are we relying on generic observability tools where we really need governance evidence?

That is why Treasury's FS AI RMF matters. It is voluntary, but it gives banks a usable benchmark and a common language for these conversations.

The Architecture That Matters

The pattern I think is worth considering separates the system into three distinct planes: the live agent runtime, an operational telemetry layer for engineering, and a longer-lived decision record for governance and audit. In my experience, when banks collapse all of this into one observability backend, they often end up with a system that is easier to operate but harder to defend.

Three-plane architecture diagram: Agent Runtime feeds a PII Vault, a synchronous blocking Decision Log, and an async Ops Log — each with distinct storage backends and retention policies.

Three design choices matter more than the rest:

For higher-risk decisions, consider making the decision-record write blocking. If the record cannot be written, there is a good argument the workflow should not proceed.
Separate operational telemetry from governance-grade records. Engineers and auditors usually need different data, different retention, and different access controls.
Minimize plaintext PII in long-lived logs. Tokenization or vault-based patterns can reduce privacy exposure and make retention decisions easier to manage.

Why Azure Specifically

Every major cloud provider has the building blocks for some version of this — managed model hosting, WORM object storage, and cloud-native audit logging. They all work if you're already committed to that ecosystem.

But for banks deciding where to land AI workloads in 2026, Azure is worth serious consideration. Not because every individual component is technically superior, but because the compliance posture and the legal/procurement surface can be easier to work through at many institutions that already have Microsoft deeply embedded.

Here is why Azure stands out to me in this context:

Governance and procurement

Microsoft Azure AI Foundry's ISO/IEC 42001:2023 certification is useful for vendor review conversations. More broadly, many banks already know how to diligence Microsoft as a strategic provider, which makes procurement easier than introducing a net-new observability vendor.

Tamper evidence and retention

Azure Immutable Blob Storage gives you practical WORM retention. Azure Confidential Ledger adds cryptographic receipts for the subset of decisions where stronger integrity proof matters. You may not need both for every workflow, but together they create a more defensible story.

Data governance

Azure documents that prompts, completions, and related customer data stay within the Azure service boundary and follow Azure's privacy commitments, though the exact data-processing model still depends on deployment type and geography. For many banks, that is a more straightforward governance conversation than sending prompts to another SaaS platform.

The tradeoff

Azure's weakness is that its native AI monitoring is less polished than the best purpose-built LLM observability tools. For a regulated bank, I would still solve the audit-trail problem first and accept that the debugging experience may be less elegant.

What the Decision Record Should Capture

You do not need a giant technical spec to start. For higher-risk AI decisions, I would want the institution to be able to reconstruct at least this much:

Which agent or workflow ran, when it ran, and what version was in production
Which model deployment was used, plus any provider response ID or version metadata available
What prompt or instruction set was in force, ideally via versioning or hashing
What tools were called and what outside data materially shaped the result
What documents or retrieval sources were consulted, and which corpus version was in scope
What the model produced, with sensitive data tokenized or otherwise protected
Whether a human reviewed, approved, rejected, or modified the result
Whether the final record can be shown to be complete and untampered with

That is the core idea. The exact schema, tiering logic, retention policy, and failure-mode policy are still yours to define. No vendor can do that institutional thinking for you.

A 90-Day Playbook for Banks

If you're running a community or regional bank looking at AI agents and you haven't started on this yet, here's a practical sequence to consider:

Days 1-30: Assess

Read the Treasury FS AI RMF Risk and Control Matrix. Map your existing AI controls to its 230 objectives. The gaps are your roadmap.
Inventory every AI system in production, including vendor AI features embedded in purchased SaaS products. This is one of the simplest places for governance blind spots to form.
Document your determination on whether each AI system qualifies as a "model" under SR 11-7. Get a defensible written answer on file before an examiner asks.

Days 31-60: Architect

Pick your platform strategy. If you're already on Azure, this article may be a useful starting point. If you're on another major provider, the same core design questions still apply.
Decide where the long-lived decision record will live, who can access it, and how long it needs to be retained.
Decide how sensitive customer data will be protected in prompts, logs, and downstream review workflows.
Decide which decisions are important enough to justify stronger integrity controls and fuller documentation.

Days 61-90: Build

Implement the decision record in the workflow itself. For higher-risk decisions, consider requiring the record to be written before the process can continue.
Create a simple tiering model so not every use case is governed the same way.
Write the policy for failure handling, retention, review, and human override. Get governance approval.
Build the crosswalk from your controls to SR 11-7, the FS AI RMF, and NIST AI RMF.
Document your institution's position on the gray areas before someone asks under pressure.

This is a quarter of disciplined work. For many institutions, it becomes a foundation they can reuse across future AI deployments rather than reinventing controls one use case at a time.

The Window Is Open

Treasury has now given the industry a practical framework it did not have before. The interagency position remains that existing guidance generally applies. My expectation is that AI-specific questions in bank governance, audit, and examination contexts will become more structured from here, even though the FS AI RMF itself is technically voluntary.

Banks that build the decision record, the data-handling controls, the tiering model, and the SR 11-7 / FS AI RMF crosswalk early will likely have a cleaner answer when those questions come.

This is less about a single hard architecture than about operating discipline. The components exist. Azure has assembled them in a way that I think is practical for many banks. The harder part is usually not technology. It is deciding what your institution wants to be able to prove later.

Instead of starting with the observability vendor list, start with the audit-trail requirements.

That is the foundation everything else stands on.

Corey Young is EVP of Fintech Banking at Commercial Bank of California and former CEO of Agile Financial Systems. He has over two decades of experience in financial services, payment processing, and fintech.