Resources & frameworksMarch 2025

OpRisk fundamentals

RCSA, scenario analysis, three lines of defense, and the regulatory landscape. Lessons from the front lines of operational risk management.

A trader books a position in the wrong legal entity. A settlement instruction goes out with a stale SSI. A phishing email compromises a payments system. A third-party vendor pushes a bad update and half the bank's trading platforms go down for six hours.

These are operational risk events. They happen every day in CIB, and they range from minor annoyances that get fixed over coffee to billion-dollar losses that end careers and close banks. Operational risk is the risk of loss from inadequate or failed processes, people, systems, or external events — and in an industry that runs on complex, interconnected systems processing millions of transactions daily, it is ever-present.

Who This Is For

Two audiences. First: data scientists, engineers, PMs, and BAs building tools for operational risk — dashboards, automation, analytics platforms. You need to understand how OpRisk actually works before you can build anything useful for it. Second: practitioners running OpRisk processes — risk managers, control owners, compliance officers — who want a structured reference for how the pieces fit together.

What You'll Learn

  • How operational risk is defined, categorized, and governed under Basel
  • The Three Lines model and how accountability works in practice
  • The core OpRisk processes: RCSA, incident management, control testing, KRIs, reporting
  • How these processes connect into a single framework (they're not isolated)
  • Where the automation opportunities are — and they're everywhere
  • How AI and data analytics are changing the function

How to Navigate

This guide covers operational risk in six sections:

  1. What is Operational Risk? — Definition, Basel framework, Three Lines model, OpRisk in CIB
  2. Process & Taxonomy — How banks organize and categorize operational risks
  3. RCSA — Risk and Control Self-Assessment: the cornerstone process
  4. Incidents & Controls — Loss events, near misses, and how controls are tested
  5. KRIs & Reporting — Metrics, thresholds, committees, and what gets escalated
  6. Data, Analytics & AI — Where the function is headed and what to build

Each section includes Automation Opportunity callouts — concrete use cases where small teams deliver outsized impact. If you've read the CIB Fundamentals module, you'll recognize the format.

A Note on Sources

All content draws from public sources: Basel Committee publications (BCBS 195/d515, Basel III), EBA guidelines, IIA standards, ORX industry data, bank annual reports and Pillar 3 disclosures, and industry research from McKinsey, Deloitte, and PwC. No proprietary information is included.

Operational risk is the risk of loss from inadequate or failed internal processes, people, systems, or external events. That is the Basel Committee's definition, unchanged since 2001 and carried forward through every revision of the framework. It includes legal risk. It excludes strategic risk and reputational risk — though in practice, a major operational failure almost always damages reputation and can force strategic changes.

What makes operational risk different from market or credit risk: you can't diversify it away. A bank can reduce credit concentration by spreading loans across sectors. It can hedge market risk with derivatives. But operational risk is embedded in everything the bank does. Every trade booked, every payment processed, every system deployed, every person hired creates operational risk exposure. In CIB, where systems process millions of transactions daily across dozens of legal entities and jurisdictions, the surface area is enormous.

The Seven Basel Event Types

Basel II established seven categories for classifying operational risk losses. These remain the industry standard — every loss event gets mapped to one of them.

#Event TypeWhat It CoversCIB Examples
1Internal FraudUnauthorized activity, theft, or intentional misreporting by employeesRogue trading, deliberate mismarking of positions, insider trading
2External FraudTheft, forgery, or attack by third partiesCyber theft on trading systems, phishing targeting deal teams, forged trade finance documents
3Employment Practices & Workplace SafetyLabour law violations, discrimination, health & safetyDiscrimination lawsuits, compensation disputes, harassment claims
4Clients, Products & Business PracticesSuitability failures, mis-selling, market manipulation, AML violationsMis-selling derivatives to corporates, sanctions breaches, benchmark manipulation
5Damage to Physical AssetsNatural disasters, terrorism, vandalismData centre damage, trading floor disruption from natural disaster
6Business Disruption & System FailuresIT outages, infrastructure failuresTrading platform going down, settlement system failure, exchange connectivity loss
7Execution, Delivery & Process ManagementData entry errors, accounting mistakes, failed reporting, missed deadlinesSettlement fails, booking errors, incorrect client reporting, collateral management failures

Category 7 — Execution, Delivery & Process Management — generates the highest volume of losses in CIB. This is the daily operational grind: trades booked wrong, settlements that fail, reports submitted late or with errors. ORX data from 2023 shows transaction-related losses reaching nearly EUR 8 billion across member banks, making it the costliest category.

Category 4 — Clients, Products & Business Practices — generates the largest individual losses. This is where conduct fines, mis-selling penalties, and AML violations sit. TD Bank's $3.09 billion AML fine in 2024 — the largest ever imposed on a US bank — falls here.

The EBA updated the taxonomy in 2024, keeping the seven Level 1 categories but adding 26 new Level 2 subcategories. Notable additions: a standalone data privacy breach category, a financial crime category, and a cyber risk flag. The ORX Reference Taxonomy goes further, expanding to 14 Level 1 types to cover conduct risk, cyber risk, and third-party risk as standalone categories.

How Losses Are Measured

Loss data collection follows specific rules that matter for anyone building loss event systems:

  • Gross loss = total loss before any recoveries (insurance payouts, tax offsets)
  • Net loss = loss after recoveries
  • Opportunity costs and forgone revenue are excluded
  • Every loss must be classified into a Basel event type and mapped to a business line
  • Banks must maintain 10 years of loss history for capital calculation under the new Standardised Measurement Approach (SMA)
  • Losses above EUR 10 million and EUR 100 million receive heavier weighting in capital models — the framework explicitly accounts for fat tails

The boundary between "near miss" and "loss event" matters. A trade booking error caught before settlement is a near miss. The same error caught after settlement, requiring a correction trade at market prices, is a loss event. Most banks track both, but the rigour applied to each varies.

The Three Lines Model

The Three Lines model is the governance backbone of operational risk in banking. Originally published by the Institute of Internal Auditors in 2013 as the "Three Lines of Defence," it was updated in 2020 — dropping "Defence" to shift the framing from protection to value creation.

LineRoleWhoDay-to-Day Reality
1st LineOwns and manages riskBusiness and operations — trading desks, operations teams, technologyIdentifies risks in their processes, designs and operates controls, reports incidents. They live with the risk daily.
2nd LineOversees, challenges, sets policyRisk management, compliance, information securityWrites the policies, designs the framework, monitors the 1st line, runs the RCSA campaigns, challenges whether controls actually work.
3rd LineIndependent assuranceInternal auditTests whether the 1st and 2nd lines are doing what they say. Reports to the Board Audit Committee, not to management.

The Board sits above all three lines, accountable for risk appetite and oversight. In practice, the CRO reports to both the CEO and the Board Risk Committee.

How It Works in Practice

The model sounds clean on paper. Reality is messier. The 1st line is busy doing the work — running the operations floor, managing the trading desk — and often treats risk management as overhead. Getting front-office teams to complete RCSA questionnaires or report near misses is a perennial struggle. The 2nd line has the framework expertise but depends on the 1st line for accurate information. If the 1st line under-reports incidents or rates their own controls as "effective" without evidence, the 2nd line's risk picture is wrong.

The Financial Stability Institute argued in 2015 that the three-line model may not be enough for complex banks, proposing a fourth line: external auditors and regulators. Credit Suisse's Archegos disaster in 2021 proved the point — the risk framework was well-designed but nobody acted on the warnings. Risk limits were breached repeatedly and the response was to raise the limits. The 2nd line identified the problem months before the loss; the governance structure failed to force action.

A common tension: when the 2nd line gets too involved in daily operations, it loses independence. When it stays too removed, it loses relevance. Banks that get this balance right tend to have 2nd-line staff who have done 1st-line roles — they can challenge credibly because they understand the business.

Automation opportunity

Automation Opportunity: The 2nd line spends enormous time collecting information from the 1st line — chasing RCSA completions, gathering incident data, compiling KRI reports. Most of this happens over email and spreadsheets. A workflow tool that automates data collection, sends reminders, and provides a real-time dashboard of completion status saves hundreds of hours per campaign and gives the 2nd line time for actual analysis instead of data chasing.

Operational Risk Capital

Banks must hold capital against operational risk, and the framework for calculating how much has changed significantly.

Under Basel II (2004), banks could choose from three approaches: the Basic Indicator Approach (15% of gross income), the Standardised Approach (income split across business lines with different multipliers), or the Advanced Measurement Approach (AMA) — internal models using the bank's own loss data, external loss data, scenario analysis, and business environment factors.

AMA was retired. Banks used the modelling flexibility to produce wildly different capital numbers for similar risk profiles. Two banks with comparable size and business mix could calculate operational risk capital that differed by a factor of three. The Basel Committee concluded this destroyed comparability and credibility.

The replacement is the Standardised Measurement Approach (SMA), effective January 2025 in the EU under CRR3. It has two components:

Business Indicator Component (BIC): calculated from three sub-indicators derived from the bank's financial statements (interest/dividend income, services income, financial income), averaged over three years. Marginal coefficients increase with size: 12% for the first EUR 1 billion, 15% for EUR 1-30 billion, 18% above EUR 30 billion.

Internal Loss Multiplier (ILM): adjusts the BIC based on the bank's actual 10-year loss history. Banks with higher historical losses relative to their BIC face a multiplier above 1 (more capital). Banks with lower losses get a multiplier below 1. National regulators can override this and set ILM = 1 for all banks in their jurisdiction.

The EBA estimated the impact: non-AMA banks face an average 21% increase in operational risk capital, AMA banks face 29%. For CIB divisions, where business indicators are large and loss histories can be volatile, the capital impact is material.

The practical implication: loss data quality now has direct capital consequences. Clean, complete, well-categorised loss data covering 10 years can reduce your capital requirement. Poor data means you're stuck with a higher default.

Operational Risk in CIB

CIB creates specific operational risk challenges that don't exist — or exist at much lower intensity — in retail banking.

Transaction complexity. A plain-vanilla equity trade is relatively simple. An exotic equity derivative with knock-in barriers, autocall features, and multi-currency settlement touches pricing models, risk systems, confirmation platforms, collateral management, and regulatory reporting — each a potential failure point. Structured products can have lifecycles spanning years, generating ongoing operational risk from lifecycle events (coupon calculations, barrier monitoring, exercise decisions).

Speed. Markets move in milliseconds. A system outage during a volatile trading session doesn't just cause inconvenience — it causes direct financial loss from unhedged positions. When ICBC's US broker-dealer was hit by ransomware in November 2023, it disrupted settlement of over $9 billion in Treasury-backed securities. The bank had to send settlement details via USB stick by messenger across Manhattan.

Multi-entity complexity. A single CIB trade can be booked through one legal entity, risk-managed in another, and settled through a third — each in different jurisdictions with different regulatory requirements. Booking errors that put a trade in the wrong entity create regulatory, tax, and capital consequences.

Interconnected systems. A large CIB runs hundreds of applications. A single trade can touch 20+ downstream systems. When one system fails or produces bad data, the errors cascade. The JPMorgan London Whale loss started with an Excel error — a formula that divided by the sum of two numbers instead of their average — but the real failure was that no independent system caught and corrected it.

Conduct risk. CIB activities — trading, advisory, lending — create information asymmetries and conflicts of interest that don't exist in retail. Information barriers between banking and markets divisions are a regulatory requirement; breaching them has severe consequences. Market manipulation, insider trading, and benchmark rigging all sit in this category.

Third-party and cyber risk. CIB's reliance on external platforms (Bloomberg, exchanges, CCPs, SWIFT, cloud providers) creates concentration risk. The July 2024 CrowdStrike outage — a faulty security update that crashed 8.5 million systems globally — hit Bank of America, JPMorgan, Wells Fargo, and others. ORX now lists cybercrime as the #1 emerging risk for three consecutive years.

Automation opportunity

Automation Opportunity: Operational risk event collection in CIB is notoriously incomplete. Front-office teams under-report near misses because the reporting process is cumbersome — typically a form email to the risk team, followed by back-and-forth for categorisation. A simple intake tool with smart defaults (pre-populated desk, pre-suggested Basel category based on keywords, auto-routing) lowers the friction enough to improve capture rates. Better data in means better risk analysis out.

Operational Resilience: The New Dimension

Operational risk management asks: "what can go wrong and how do we prevent or mitigate it?" Operational resilience asks a different question: "when something does go wrong, can we keep delivering critical services?"

DimensionOperational RiskOperational Resilience
OrientationBackward-looking (losses, incidents)Forward-looking (disruption preparedness)
FocusWhat can go wrong?What must keep running?
Key conceptLoss events, risk appetiteImpact tolerances, important business services
ScopeInternal processes, people, systemsEnd-to-end delivery including third parties

Two regulatory frameworks now mandate operational resilience. The UK PRA/FCA framework (compliance deadline: March 2025) requires firms to identify Important Business Services, set impact tolerances for maximum acceptable disruption, and test against severe but plausible scenarios. The EU's DORA (effective January 2025) focuses specifically on digital operational resilience — ICT risk management, incident reporting within 24 hours, resilience testing including threat-led penetration testing, and third-party risk management with mandatory contract provisions.

DORA carries teeth: fines up to 2% of annual worldwide turnover for financial entities, and up to 1% of daily worldwide turnover for critical ICT providers. It also brings board-level accountability for ICT resilience — this is no longer an IT problem.

The two disciplines are converging. Banks that treat operational resilience as a separate workstream from operational risk end up duplicating effort and creating gaps. The Basel Committee's Principles for Operational Resilience (published March 2021 alongside the revised PSMOR) are designed to work together.

Common Pitfalls

Treating operational risk as a compliance exercise. If RCSA is just a form to fill in and incidents are just logs to maintain, you have a framework that ticks boxes but doesn't reduce risk. The OCC found in 2024 that over half of large US banks have weaknesses in their operational risk frameworks. The most common weakness: the framework exists on paper but doesn't drive decisions.

Under-investing in loss data. Under the new SMA, 10 years of clean loss data directly affects capital requirements. Banks that treated loss data collection as low-priority now face higher capital charges. Building the loss database is a multi-year effort — starting late is expensive.

Confusing risk identification with risk management. A beautiful risk register with 500 well-described risks is useless if nobody acts on it. The measure of an operational risk function is not how many risks it identifies but how effectively it drives mitigation. Ask: what changed as a result of this assessment?

Ignoring the interconnections. Operational risk events rarely stay contained. A system failure causes settlement fails, which cause counterparty exposure breaches, which trigger regulatory notifications, which consume management attention. Building systems that track individual risk types without mapping dependencies misses the cascade effects that cause the worst losses.

Sources

  • Basel Committee on Banking Supervision — BCBS 195 / d515: Principles for the Sound Management of Operational Risk (2011, revised 2021)
  • Basel Committee — Standardised Measurement Approach for Operational Risk (d355, 2017)
  • IIA — Three Lines Model (July 2020)
  • ORX — Annual Banking Operational Risk Loss Data Report 2025
  • EBA — RTS on Operational Risk Loss Event Taxonomy (2024)
  • Risk.net — Top 10 Operational Risks for 2025
  • McKinsey — Global Risk Productivity Survey (February 2026)
  • Credit Suisse Independent Review (Archegos), JPMorgan London Whale Senate Report, TD Bank DOJ/FinCEN enforcement actions

Every operational risk tool — RCSA, KRIs, loss data collection, scenario analysis, capital modelling — depends on one thing working first: a shared language for describing risks. That shared language is the taxonomy.

Without a consistent taxonomy, the same risk gets described differently across business lines. The trading desk calls it a "booking error." Operations calls it a "settlement exception." Finance calls it a "P&L adjustment." All three are describing the same event, but when you try to aggregate loss data across the bank, you get three separate entries that never get linked. You can't spot patterns in data you can't compare.

The taxonomy is the data model for operational risk. Get it wrong and everything built on top of it — the reporting, the analytics, the capital calculation — inherits the problem.

The Basel Taxonomy

Basel II established the foundational classification scheme that remains the industry standard. It operates on three levels:

  • Level 1: Seven event types (the top-level categories every bank uses)
  • Level 2: Subcategories within each event type (more specific classifications)
  • Level 3: Activity-level examples (illustrative, not regulatory requirements)

The seven Level 1 event types:

#Event TypeWhat Gets Classified Here
1Internal FraudUnauthorized activity, intentional misreporting, employee theft
2External FraudHacking, cheque fraud, forgery, cyber theft
3Employment Practices & Workplace SafetyDiscrimination, health & safety violations, labour disputes
4Clients, Products & Business PracticesMis-selling, suitability failures, market manipulation, AML violations
5Damage to Physical AssetsNatural disasters, terrorism, vandalism
6Business Disruption & System FailuresIT outages, infrastructure failures, utility disruptions
7Execution, Delivery & Process ManagementData entry errors, failed settlements, missed deadlines, incorrect reporting

The 56-Cell Matrix

Basel didn't stop at event types. It also defined eight business lines for CIB and retail banking:

  1. Corporate Finance
  2. Trading & Sales
  3. Retail Banking
  4. Commercial Banking
  5. Payment & Settlement
  6. Agency Services
  7. Asset Management
  8. Retail Brokerage

Cross these eight business lines with the seven event types and you get a 56-cell matrix. Every loss event must be mapped to exactly one cell — one business line, one event type. The principle is MECE: Mutually Exclusive, Collectively Exhaustive. No loss should fall into two cells. No loss should fall outside the matrix entirely.

This sounds straightforward. In practice, it's one of the most debated aspects of loss data collection. A rogue trader in the equities desk who mismarks positions to hide losses — is that Internal Fraud or Execution, Delivery & Process Management? The answer depends on intent, which isn't always clear at the time of discovery. A cyber attack that causes a trading platform outage — External Fraud or Business Disruption? Banks spend real time on these classification decisions because the answers affect capital calculations, trend analysis, and regulatory reporting.

How Losses Get Mapped

When a loss event occurs, the process typically follows this sequence:

  1. Detection — someone identifies the event (operations staff, automated exception report, audit finding, customer complaint)
  2. Initial classification — the reporting unit assigns a preliminary Basel category based on the event description
  3. Review — the 2nd line operational risk team validates the classification against taxonomy definitions
  4. Final mapping — event is assigned to a cell in the matrix (business line + event type), recorded with gross loss, net loss, root cause, and recovery information
  5. Aggregation — the event feeds into portfolio-level reporting and capital calculations

The quality of step 2 — initial classification — determines how much rework happens downstream. When the first-line reporter picks the wrong category, the 2nd line has to reclassify, which introduces delay and inconsistency. This is where taxonomy training and tooling matter most.

The ORX Reference Taxonomy

The Basel taxonomy was designed for regulatory capital purposes. It works for that job, but banks found it too coarse for operational risk management. Seven categories don't capture the nuances of modern risk landscapes.

ORX — the Operational Riskdata eXchange association, with 82+ global bank members — developed a Reference Taxonomy that expands the classification scheme significantly. The key difference: 14 Level 1 risk types instead of Basel's seven.

Why the expansion? Several risk categories that Basel lumps together or treats as subcategories have grown important enough to warrant standalone status:

  • Conduct risk — mis-selling, market manipulation, conflicts of interest. Basel puts this under "Clients, Products & Business Practices," but the regulatory and management attention it receives justifies its own category.
  • Cyber risk — Basel captures this partially under External Fraud and Business Disruption, but a ransomware attack that steals data, disrupts operations, and triggers regulatory fines doesn't fit neatly into either.
  • Third-party risk — vendor failures, outsourcing risks, fourth-party dependencies. Basel has no dedicated category. After the CrowdStrike outage in 2024 crashed 8.5 million systems globally and hit multiple major banks, the case for standalone tracking became hard to argue against.

The ORX taxonomy also includes a Cause and Impact Taxonomy that complements the event types. This lets banks capture not just what happened (the event) but why it happened (the cause) and what the consequences were (the impact). A trade booking error (event) caused by inadequate training (cause) that resulted in a financial loss and a regulatory breach (impacts) tells a much richer story than the event alone.

ORX also publishes a Process and Service Library — a standardised catalogue of banking processes and services that can be linked to the taxonomy. This bridges the gap between risk classification and process mapping, which we'll come back to when we discuss the process inventory.

EBA 2024 Taxonomy Updates

The European Banking Authority updated the operational risk event taxonomy in 2024 as part of the CRR3 implementation. The update retained Basel's seven Level 1 categories — no structural break — but introduced meaningful changes at Level 2:

  • 26 new Level 2 categories for greater granularity
  • Data privacy breach added as a standalone Level 2 category (previously buried under other classifications)
  • Financial crime given its own Level 2 category (reflecting the regulatory focus on AML/CFT — TD Bank's $3.09 billion fine in 2024 made the case emphatically)
  • Cyber flag ("ICT risk — Cyber") introduced as a cross-cutting marker. A loss event can be tagged as cyber-related regardless of which Level 1 category it falls under. This solves the classification problem where a cyber attack triggers multiple event types simultaneously.
  • Business disruption and system failures Level 2 categories were simplified and rationalised

The cyber flag is worth understanding. A ransomware attack might be classified as External Fraud (Level 1) because it involves a criminal act by a third party. But it also causes business disruption. The cyber flag lets you classify it under External Fraud while still capturing it in cyber risk reporting. It's a pragmatic solution to a real taxonomy problem.

The EBA consulted extensively with ORX during development, and the updated taxonomy aligns more closely with the ORX Reference Taxonomy than previous versions did.

The Process Inventory

A taxonomy tells you how to classify risks. A process inventory tells you where risks live.

The process inventory maps operational risks to specific business processes. Instead of asking "what risks does the bank face?" in the abstract, it asks "what can go wrong in this process, and what controls do we have?" It's the bridge between the risk taxonomy and the operational reality of how work gets done.

Three Related Concepts

These terms get used interchangeably, but they mean different things:

ConceptWhat It ContainsScope
Risk UniverseComprehensive list of all risks the organisation could faceBroadest — everything possible
Risk RegisterPrincipal repository of identified and assessed operational risksActive risks with owners, ratings, and controls
Emerging RisksThreats recognised as growing but not yet fully materialisedForward-looking, often monitored quarterly

The risk universe is your complete catalogue. The risk register is your working list of risks that matter now, with assessments and ownership assigned. Emerging risks are the ones you're watching.

What a Process Inventory Contains

A well-built process inventory typically captures:

ComponentDescription
Process nameSpecific business process (e.g., "OTC derivative trade booking")
Process ownerNamed individual accountable for the process
Risk(s) mappedOperational risks associated with the process, classified per taxonomy
Controls mappedControls that mitigate each risk, with design and effectiveness ratings
KRIs linkedKey Risk Indicators that monitor the process for early warning signals
Loss events linkedHistorical loss events associated with this process
Basel categoryMapping to the 56-cell matrix
Frequency/volumeHow often the process runs and at what volume

The power of the inventory comes from the linkages. When a loss event occurs, you can trace it back to the process, see which control failed, check whether the KRI was flashing amber, and determine whether the RCSA had flagged the risk. When you can't make those connections, you're investigating every incident from scratch.

Building a CIB Process Inventory

If you've read Module 1 on the front-to-back trade lifecycle, you already have the backbone for a CIB process inventory. The lifecycle stages — pre-trade, execution, capture, confirmation, clearing, settlement, post-trade — are your Level 1 processes. Each stage breaks down into activities, and each activity carries operational risks.

Here's how one process maps through the full chain:

Process: Trade booking (post-execution capture)

Risk: Booking error — trade details entered incorrectly (wrong notional, wrong counterparty, wrong trade date, wrong product type)

Basel category: Execution, Delivery & Process Management (Level 1) → Data Entry/Transaction Processing Error (Level 2)

Controls:

  • 4-eye check (maker-checker) on manual bookings
  • Automated validation rules rejecting trades with missing or inconsistent fields
  • End-of-day reconciliation between front-office and middle-office books

KRI: Booking amendment rate — percentage of trades requiring correction after initial booking. Thresholds set on a RAG basis: Green below 2%, Amber at 2-5%, Red above 5%.

Historical losses: Link to past events where booking errors resulted in financial loss (e.g., a trade booked in the wrong currency that settled at an unfavourable rate before the error was caught).

Now extend this across the lifecycle. Trade confirmation introduces risks around unmatched confirmations and unsigned ISDAs. Clearing introduces risks around CCP margin calculation errors. Settlement introduces risks around failed deliveries and cash breaks. Each stage gets the same treatment: identify the process, name the risks, map the controls, link the KRIs, and connect historical losses.

The practical approach: start with the highest-volume, highest-value processes. In most CIB operations, that means equity and fixed income trade booking, OTC derivative confirmations, margin management, and cash/securities settlement. Don't try to inventory everything at once. Build out from the core and expand to lower-volume processes over time.

One thing that trips up inventory projects: assuming the process boundaries are obvious. In CIB, a single client transaction can span multiple desks, systems, and legal entities. Where does "trade booking" end and "trade confirmation" begin? The answer depends on your operating model, and it's different at every bank. Spending time upfront to define clear process boundaries — with input from operations, technology, and the business — prevents rework later when you discover two teams have inventoried the same activity differently.

The Risk and Control Library

The risk and control library is the structured reference that sits alongside the process inventory. It defines each risk and each control in a standardised way so they can be reused across processes.

A risk entry typically includes:

  • Risk ID and name (e.g., R-042: Trade booking error)
  • Risk description (what can go wrong, in specific terms)
  • Risk category (taxonomy mapping)
  • Inherent risk rating (the risk level before any controls are applied — pure exposure)
  • Controls mapped (which controls mitigate this risk)
  • Residual risk rating (the risk level after controls — the exposure you actually carry)

A control entry includes:

  • Control ID and name (e.g., C-107: Maker-checker on manual trade bookings)
  • Control description (what the control does, specifically)
  • Control type (preventive vs. detective; manual vs. automated)
  • Design effectiveness (is it designed to address the risk?)
  • Operating effectiveness (does it work in practice?)
  • Testing frequency (how often the control is validated)

Inherent vs. Residual Risk

This distinction drives how banks prioritise mitigation. Inherent risk is the exposure assuming no controls exist — the raw danger of the activity. Residual risk is what remains after controls are applied. The gap between the two represents your control effectiveness.

A process with high inherent risk and effective controls can have low residual risk. That's fine — until a control fails. A process with moderate inherent risk and weak controls can have residual risk higher than a bank's appetite.

The RCSA process (covered in a later topic) systematically evaluates both. But the risk and control library provides the building blocks. Without standardised definitions, two business lines assessing the same risk will rate it differently — not because the risk is different, but because they're interpreting it differently.

Common Pitfalls

Inconsistent taxonomy across business lines. Global Markets uses one set of risk categories, Investment Banking uses another, and Transaction Banking uses a third. When the OpRisk function tries to aggregate risks at the firm level, the categories don't align. This happens more than you'd expect, especially after mergers or when business lines have built their own risk frameworks independently. The fix is governance: one taxonomy, centrally maintained, with a defined process for requesting changes.

Risks defined too broadly or too narrowly. "Technology risk" as a single entry in the risk register is too broad to manage — it covers everything from a server crash to a vendor going bankrupt. "Risk that the printer on floor 3 jams during end-of-day processing" is too narrow to be useful. The right granularity lets you assign a meaningful owner, attach specific controls, and track trends. A good test: can someone act on this risk description? If it's too vague to act on, it's too broad.

Process inventory that never gets updated. The inventory is built during a project, documented thoroughly, and then sits untouched for two years while the business changes around it. New products launch, systems are replaced, team structures shift — and the inventory becomes fiction. Maintenance has to be built into the operating rhythm, not treated as a one-off project. Tying updates to the RCSA cycle is one way to force it.

Orphaned controls. Controls that exist in the library but aren't linked to any current risk, or controls that are linked to a risk that no longer exists. This happens when the risk profile changes but the control inventory doesn't keep pace. Orphaned controls waste testing resources — the 2nd line is validating controls that don't actually mitigate anything relevant. Regular reconciliation between the risk register and the control library catches this.

Classification paralysis. Teams spend excessive time debating which Basel category a loss event belongs to, delaying reporting and creating a backlog. Clear taxonomy definitions with worked examples for each category reduce the debate. Decision trees — "Does the event involve intentional employee action? → Internal Fraud. Does it involve a third-party criminal act? → External Fraud" — help first-line reporters make consistent initial classifications.

Automation opportunity

Automation Opportunity: Taxonomy mapping is one of the most manual and inconsistent parts of loss data collection. A first-line reporter describes a loss event in free text, and someone — usually in the 2nd line — reads the description and picks the Basel category. This is slow, subjective, and inconsistent across reviewers. An NLP-based classification tool that reads the event description and suggests the most likely Basel Level 1 and Level 2 categories — with a confidence score — can cut classification time significantly and improve consistency. The model doesn't need to be perfect; it just needs to get the suggestion right often enough that the reviewer is confirming rather than deciding from scratch. Training data is readily available from historical loss event databases. The hard part isn't the model — it's getting the taxonomy definitions clean enough to train on.

Tying It Together

The taxonomy, process inventory, and risk and control library form a connected system. The taxonomy provides the classification structure. The process inventory maps where risks occur. The risk and control library defines what the risks are and what mitigates them. Together, they create the foundation that every other OpRisk tool relies on.

When you build a loss event reporting system, you need the taxonomy to classify events. When you design an RCSA campaign, you need the process inventory to scope it. When you create KRI dashboards, you need the control library to know what you're monitoring. When you calculate capital under SMA, you need all three working together to produce clean, well-categorised loss data covering 10 years.

None of this is glamorous work. Taxonomy maintenance, process inventory updates, control library reconciliation — these are the operational risk equivalent of data engineering. But skip them and you'll spend far more time later trying to make sense of data that doesn't fit together.

Sources

  • Basel Committee on Banking Supervision — BCBS 195 / d515: Principles for the Sound Management of Operational Risk (2011, revised 2021)
  • Basel Committee — Basel II Framework: 56-cell matrix (8 business lines x 7 event types)
  • ORX — Operational Risk Reference Taxonomy (14 Level 1 risk types)
  • ORX — Process and Service Library
  • EBA — Final Report on RTS Operational Risk Loss Event Taxonomy (2024)
  • OSFI — Guideline E-21: Operational Risk Management and Resilience (updated August 2024)
  • Open Risk — White Paper on Operational Risk Taxonomy
  • ORX — Annual Banking Operational Risk Loss Data Report 2025

RCSA is the process where business owners — the 1st line — systematically identify their own operational risks, evaluate whether their controls work, and document what they plan to do about the gaps. The 2nd line designs the framework, facilitates the campaigns, and challenges the results. The 3rd line validates that the whole thing is credible.

It sounds straightforward. In practice, RCSA is the most resource-intensive process in the entire operational risk toolkit. It consumes more 1st and 2nd line time than any other OpRisk activity — incident reporting, KRI monitoring, scenario analysis, all of it. And here is the uncomfortable part: despite consuming the most time, RCSA consistently ranks as delivering the least perceived value. Deloitte's 2025 analysis of RCSA practices put it bluntly — this paradox cannot continue.

The challenge isn't the concept. Self-assessment makes sense: the people closest to the work are best placed to identify what can go wrong. The challenge is execution. RCSA at a large CIB means coordinating assessments across hundreds of processes, dozens of desks, multiple geographies, and thousands of people who would rather be doing almost anything else.

The RCSA Process Step by Step

A full RCSA cycle has eight stages. Most banks run this annually or semi-annually, with event-driven updates when something significant changes — a new product launch, a major incident, a regulatory finding.

1. Scoping

Before anything else, you define what you are assessing. Which business lines? Which processes? At what level of granularity? A CIB might scope RCSA at the desk level (equities cash trading, rates derivatives, FX options) or at the process level (trade execution, trade confirmation, settlement, collateral management). The choice matters — too broad and you get generic risks that don't help anyone; too granular and the exercise becomes unmanageable.

Process-driven RCSA — scoping around end-to-end business processes rather than organisational units — is gaining traction. Deloitte's 2025 guidance advocates this approach because it captures handoff risks between teams that functional assessments miss.

2. Risk Identification

This is where you build the inventory of what can go wrong. Inputs come from multiple sources:

  • Internal loss data — what has actually gone wrong in the past
  • Audit findings — what internal audit flagged as deficient
  • Regulatory observations — what the regulators told you to fix
  • External loss data — what went wrong at other banks (ORX data, public enforcement actions)
  • Process analysis — walking through how a process works and identifying failure points
  • Workshops and interviews — structured conversations with process owners

The goal is a risk register: a list of risks specific enough to be actionable. "System failure" is not useful. "Settlement system outage during end-of-day processing causing failed trades above tolerance" is.

3. Risk Assessment — Inherent Risk

Each identified risk gets scored for inherent risk — the level of risk before considering any controls. This uses the likelihood-times-impact matrix described in detail below. Inherent risk matters because it tells you what you are exposed to if your controls fail. A risk with low inherent risk and weak controls might be acceptable. A risk with high inherent risk and weak controls is a problem even if losses have not materialised yet.

4. Control Identification and Assessment

For each risk, you identify the controls that mitigate it and assess two dimensions:

  • Design effectiveness — is the control designed to address the risk? A four-eyes check on trade entry is designed to catch booking errors. A monthly reconciliation is not designed to catch intraday pricing errors, no matter how well it runs.
  • Operating effectiveness — is the control actually working as designed? A four-eyes check that exists in policy but gets bypassed during busy periods has good design and poor operating effectiveness.

Control assessment is where most of the subjectivity lives. The 1st line rates their own controls, and the natural tendency is optimism. Nobody wants to admit their controls are weak — it creates work, invites scrutiny, and reflects poorly on the team.

5. Residual Risk Calculation

Residual risk = inherent risk adjusted for control effectiveness. If the inherent risk is high but controls are strong and well-operated, residual risk comes down. If controls are weak or partially effective, residual risk stays elevated.

Strong ControlsModerate ControlsWeak Controls
High Inherent RiskMedium residualHigh residualVery high residual
Medium Inherent RiskLow residualMedium residualHigh residual
Low Inherent RiskLow residualLow residualMedium residual

The inherent-to-residual movement is the whole point of the exercise. It tells you: are your controls earning their keep? If a risk scores as high inherent and stays high residual, your controls are not doing enough.

6. Action Planning

Every gap identified — a high residual risk, a control rated as weak, a risk without adequate controls — gets an action item. Each action needs an owner, a deadline, and a defined target state. This is where RCSA stops being an assessment and starts being a management tool. The action plan is the output that actually reduces risk.

7. Second-Line Review and Challenge

The 2nd line reviews every assessment. This is not rubber-stamping — it is credible challenge. Are the risk descriptions specific enough? Are the inherent risk scores calibrated consistently? Are control ratings backed by evidence, or are they the 1st line marking their own homework? The 2nd line pushes back where ratings look optimistic, where known issues are not reflected, or where loss data contradicts the self-assessment.

8. Committee Reporting

RCSA results flow up through the governance structure: business-level operational risk committees, the firm-wide operational risk committee, the executive risk committee, and ultimately the board risk committee. Reporting typically includes residual risk heat maps, movement since the prior assessment, material gaps, and action plan status.

The Likelihood x Impact Matrix

The matrix is deceptively simple. A 5x5 grid with likelihood on one axis and impact on the other. Each risk gets plotted, producing a heat map.

NegligibleMinorModerateMajorSevere
Almost CertainMediumHighHighVery HighVery High
LikelyMediumMediumHighHighVery High
PossibleLowMediumMediumHighHigh
UnlikelyLowLowMediumMediumHigh
RareLowLowLowMediumMedium

The challenge is not the grid — it is calibration. What does "likely" mean? Once a year? Once a month? What does "major impact" mean in dollar terms for a CIB desk generating $500 million in annual revenue versus a support function?

Banks typically define these scales in dollar terms and frequency ranges:

RatingLikelihood DefinitionImpact Definition (Illustrative)
1 — Rare/NegligibleLess than once in 10 years< $1M loss
2 — Unlikely/MinorOnce in 5-10 years$1M - $10M
3 — Possible/ModerateOnce in 1-5 years$10M - $50M
4 — Likely/MajorOnce a year or more$50M - $250M
5 — Almost Certain/SevereMultiple times per year> $250M

Calibration matters because RCSA results need to aggregate. If every desk defines "high" differently, the firm-wide risk picture is meaningless. You cannot compare residual risks across equities trading and loan operations if one desk calls $5 million losses "severe" and another calls them "minor." Consistent scale definitions, enforced by the 2nd line, are a prerequisite for useful RCSA output.

Impact should not be limited to financial loss. Banks typically assess multiple impact dimensions: financial, regulatory, reputational, and client impact. A small financial loss that triggers a regulatory investigation can be more consequential than a larger loss that stays internal.

Workshop vs. Questionnaire Approaches

An RMA/PwC survey found that 61% of banks use a mix of workshops and offline work for RCSA. The remaining banks split roughly evenly between pure workshops and pure questionnaire-based approaches.

Workshops bring process owners and risk specialists into the same room. The facilitator walks through each process, the group identifies risks collectively, and debate produces better-calibrated ratings. The discussion itself has value — it forces people to articulate assumptions, surface disagreements, and share information that would never make it into a form. The downside: workshops are time-intensive. Getting a senior trader to spend three hours in a risk workshop during a busy trading week is a hard sell. Multiply that across dozens of desks and you understand the resource problem.

Questionnaires scale. You send a structured form to 200 process owners and collect responses. But the form cannot replicate the discussion that happens in a well-run workshop. People fill in what seems reasonable, default to ratings from the prior year, and move on. This is where RCSA becomes a box-ticking exercise.

The trend is toward continuous, process-driven RCSA — replacing the annual campaign model with ongoing assessment embedded in daily operations. Instead of a once-a-year workshop, risks and controls are reviewed continuously, with triggers for reassessment when conditions change. This is harder to implement but produces assessments that actually reflect current reality.

Who Does What

RoleLineRCSA Responsibility
Process owners / desk heads1stOwn the assessment. Identify risks in their area, rate controls, develop action plans.
Operational risk officers2ndDesign the RCSA framework, facilitate workshops, maintain the risk taxonomy, challenge 1st-line ratings.
Compliance2ndInput on regulatory risks, compliance control effectiveness.
Internal audit3rdValidate RCSA accuracy. Test whether reported control ratings match reality.
Senior managementAbove the linesReview and approve RCSA results. Ensure action items are resourced.

The perennial challenge is 1st-line engagement. Front-office teams — traders, salespeople, structurers — are compensated on revenue. RCSA is overhead. It does not generate revenue, it does not close deals, and it takes time away from activities that do. Getting genuine engagement rather than grudging compliance requires one of two things: either a culture where risk management is genuinely valued (rare), or consequences for non-engagement that are real enough to change behaviour (also rare, but more common than it used to be).

Banks that get better results tend to embed operational risk officers within the business lines rather than running RCSA from a central function. An ORM specialist who sits on the trading floor, understands the products, and has relationships with the desk heads can facilitate a more honest conversation than someone from head office who visits once a year with a spreadsheet.

RCSA in CIB

CIB presents specific RCSA challenges that other parts of the bank do not face at the same scale.

Product diversity. A single CIB division might trade equities, fixed income, FX, commodities, and derivatives across all of those — each with different risk profiles, different systems, and different control requirements. An RCSA for equity cash trading looks nothing like an RCSA for OTC derivatives clearing.

Process volume. The trade lifecycle alone — from execution through booking, confirmation, settlement, reconciliation, and reporting — spans dozens of sub-processes. Each sub-process has its own risks and controls. A large CIB might have hundreds of distinct processes requiring assessment.

Cross-desk dependencies. Risks do not respect organisational boundaries. A booking error on the structured products desk creates a reconciliation break for operations, a P&L misstatement for finance, and a regulatory reporting error for compliance. RCSA needs to capture these handoff risks, but desk-level assessments tend to stop at the desk boundary.

Connection to the trade lifecycle. The most effective CIB RCSAs map directly to the trade lifecycle rather than treating it abstractly. For each stage — pre-trade, execution, post-trade processing, settlement, lifecycle management — the assessment identifies what goes wrong, what controls exist, and where the gaps are. This produces actionable output because it connects risks to specific operational steps that people can actually fix.

Automation opportunity

Automation Opportunity — Mapping RCSA to Process Mining

Most CIB RCSAs rely on process owners describing how their processes work. The descriptions are often outdated or idealised — they reflect how the process is supposed to work, not how it actually runs. Process mining tools analyse system logs to reconstruct actual workflows, including deviations, bottlenecks, and exception handling. Feeding process mining output into RCSA scoping gives you an evidence-based starting point: here is what actually happens, now assess the risks. Banks that have piloted this approach report that it surfaces risks the process owners did not mention — because the deviations had become normalised.

Common Failures

RCSA programmes fail in predictable ways. Recognising these patterns is the first step to avoiding them.

Box-ticking. The assessment becomes a compliance exercise rather than a management tool. People copy ratings from last year, update the date, and submit. The 2nd line accepts it because they are drowning in volume and do not have the capacity to challenge everything. The result is an RCSA that reflects the past, not the present.

Optimistic control ratings. When the 1st line rates their own controls, ratings skew positive. A control rated "effective" with no evidence other than the process owner's assertion is not an assessment — it is an opinion. Without independent testing or evidence requirements, RCSA control ratings are unreliable.

Disconnection from loss data. If a desk reports three settlement failures in a quarter but the RCSA rates settlement risk as "low" with "effective" controls, something is wrong. The most useful RCSAs cross-reference self-assessment ratings against actual loss data, KRI trends, and audit findings. Disconnections between the self-assessment and objective data are red flags that should trigger deeper review.

Stale assessments. An RCSA completed in January that does not get updated after a major system migration in June is worse than useless — it provides false comfort. Event-driven updates are essential, but they require a process for identifying triggers and a mechanism for rapid reassessment.

Under-reporting. Some risks never make it into the assessment because acknowledging them creates work. If a desk knows their reconciliation process has a manual workaround that is fragile, reporting it means someone will have to fix it. The incentive to omit is real, and only culture or governance can overcome it.

AI and Automation in RCSA

RCSA is ripe for automation because so much of the current process is manual, repetitive, and low-value. Only 20% of institutions use modern tools like AI for RCSA — the vast majority still rely on spreadsheets, email, and manual compilation. That is changing, though slowly.

GenAI for control description evaluation. One of the most time-consuming 2nd-line activities is reviewing thousands of risk and control descriptions for completeness and quality. Do the descriptions answer the basic questions — what is the risk, what triggers it, who owns it, what is the control, how often is it performed? IBM's approach uses large language models to evaluate descriptions against established standards (the "5 Ws") without requiring large training datasets. The model flags descriptions that are vague, incomplete, or internally inconsistent, letting the 2nd line focus their challenge efforts where they matter most.

AI-driven risk ratings. Rather than relying solely on subjective 1st-line scoring, AI models can suggest risk ratings based on historical loss data, KRI trends, external loss benchmarks, and the characteristics of the process being assessed. The human still makes the final call, but the starting point is data-driven rather than opinion-driven. This addresses the calibration problem — AI does not suffer from the anchoring bias that makes people default to last year's rating.

Continuous monitoring replacing periodic campaigns. The annual RCSA campaign model is giving way to continuous assessment. Automated monitoring of KRIs, control performance metrics, and process health indicators can flag when a risk profile has shifted — triggering a targeted reassessment rather than waiting for the next scheduled cycle. This is closer to how market risk works (daily VaR monitoring) and far more responsive than a once-a-year exercise.

Natural language processing for incident correlation. NLP tools can scan incident reports, audit findings, and regulatory communications to identify themes and patterns that should feed into RCSA risk identification. Instead of relying on workshop participants to remember relevant incidents, the system surfaces them automatically.

Automation opportunity

Automation Opportunity — The 2nd-Line Bottleneck

Second-line review and challenge is where RCSA campaigns stall. A team of fifteen operational risk officers reviewing RCSA submissions from two hundred desks does not scale. The math does not work. GenAI can serve as a first-pass reviewer: checking descriptions against quality standards, comparing ratings against loss data and KRI trends, flagging inconsistencies between related assessments, and identifying submissions that are likely copy-pasted from prior periods. This does not replace human challenge — it triages it. The 2nd line focuses their time on the assessments that genuinely need expert judgment rather than spending weeks on administrative review. McKinsey estimates 10-20% FTE productivity gains from this kind of risk automation.

Making RCSA Actually Useful

The banks that extract real value from RCSA tend to share a few characteristics. They connect RCSA output to decisions — staffing, investment, process redesign. They hold action plan owners accountable with the same rigour as financial targets. They invest in calibration so the numbers are comparable across business lines. And they treat the assessment as a conversation between the 1st and 2nd line, not a form submission exercise.

The Deloitte paradox — most effort, least value — is not inherent to RCSA as a concept. It is a symptom of how most banks run it: periodic, manual, disconnected from operational reality, and treated as overhead rather than management intelligence. The banks breaking out of this pattern are the ones investing in process-driven approaches, continuous monitoring, and technology that shifts the effort from data collection to analysis.

RCSA remains the foundation of operational risk management. Every other OpRisk process — KRIs, scenario analysis, capital modelling — depends on the risk and control inventory that RCSA produces. Getting it right is not optional.

Sources

  • Deloitte UK — Ten Steps to RCSA Redemption (2025)
  • RMA/PwC — How Banks Are Refining RCSA Programs (2024)
  • Forvis Mazars — RCSA Best Practices (2024)
  • ISACA — Executing a Well-Executed RCSA (2024)
  • IBM — Automate RCSA with GenAI: Enhance Risk Management with Generative AI
  • McKinsey — The Future of Operational Risk Management in Financial Services
  • McKinsey — Global Risk Productivity Survey (February 2026)
  • Basel Committee on Banking Supervision — Principles for the Sound Management of Operational Risk (BCBS 195/d515)
  • ORX — Annual Banking Operational Risk Loss Data Report 2025
  • Capco — Building Operational Resilience through RCSA
  • MetricStream — Six Critical Factors to Modernize Your RCSA

A CIB operation that has never had an incident is either very new or very bad at detecting them. Losses happen. Trades get booked wrong. Systems go down during volatile sessions. Sanctions screening misses a match. The question is not whether incidents occur — it is whether the bank captures them honestly, investigates them properly, and fixes the underlying cause before the next one hits.

This page covers three things that connect tightly in practice: how incidents are captured and classified, what the major CIB failures teach us about what goes wrong, and how controls are designed, tested, and maintained to keep the loss rate down.

What Counts as an Incident

Banks distinguish between three categories, and getting the boundaries right matters for data quality, capital calculation, and regulatory reporting.

Loss event. Something went wrong and the bank lost money. A settlement fail that required a correction trade at an unfavourable price. A regulatory fine. A fraud payout. The loss is measured in financial terms — the amount the bank would not have lost if the process, person, or system had worked as intended. Under Basel rules, the gross loss (before any recoveries from insurance or other sources) is the headline figure. Net loss — after recoveries — is tracked separately.

Near miss. Something went wrong but was caught before it caused a financial loss. A trade booked to the wrong entity, detected and corrected before settlement. A phishing email that reached a deal team inbox but was reported before anyone clicked. Near misses are free lessons. They reveal control weaknesses without the price tag. Banks that track them rigorously get early warnings about process degradation. Banks that ignore them wait for the loss event.

Operational incident. A broader category that includes loss events and near misses but also covers events like system outages, process failures, or policy breaches that may not result in a direct financial loss but still require investigation and remediation. A trading platform going offline for 20 minutes during a low-volatility session might not produce a measurable loss, but it exposes a vulnerability that could be catastrophic on a different day.

The boundary between near miss and loss event is not always clean. Consider a booking error caught during end-of-day reconciliation. If the correction is made before settlement, no money changes hands — near miss. If it is caught the next morning after an overnight settlement cycle has already processed, the bank may need a compensating trade at a new market price — loss event. The same root cause, the same breakdown in controls, separated by a few hours. This is why serious operational risk programmes track both with equal discipline.

The Incident Lifecycle

Every loss event and significant near miss follows a lifecycle. The rigour applied at each stage determines whether the bank actually learns from the event or just logs it.

Detection. The incident surfaces. This can be a control catching an exception (reconciliation break, limit breach alert), a staff member noticing something wrong, a client complaint, a regulatory inquiry, or an automated monitoring system flagging anomalous activity. Detection speed matters — an undetected rogue trade compounds daily.

Initial assessment. The first-line team evaluates severity. Is this a booking error affecting one trade, or a systemic issue affecting a product class? Initial assessments determine escalation paths and resource allocation. Getting this wrong — classifying a systemic problem as a one-off — delays the response.

Escalation. Banks define material thresholds that trigger escalation to senior management, the operational risk function, or regulators. These thresholds vary by institution but typically include: losses above a defined amount (often EUR 1 million for initial escalation, EUR 10 million for executive attention), events affecting multiple business lines, events involving regulatory breach, fraud, or client harm. Under DORA, major ICT-related incidents must be reported to supervisors within 24 hours of classification. Most jurisdictions have similar notification windows for significant events.

Investigation. The operational risk team, sometimes with compliance or internal audit, digs into what happened. This is not just chronology — it is about identifying why existing controls did not prevent or detect the event sooner.

Root cause analysis. The most important stage and the one most often shortchanged. Surface-level root causes ("the analyst made an error") are useless. Effective root cause analysis asks why repeatedly. Why did the analyst make an error? Because the booking interface allows free-text entry for counterparty names. Why does it allow free-text? Because the system was built before the current counterparty master was implemented. Why has it not been updated? Because the change request has been deprioritised for three consecutive quarters. Now you have something actionable.

Remediation. Fix the root cause, not just the symptom. If the root cause is a system limitation, the remediation is a system change — not retraining the analyst who made the error. Remediation actions need named owners and deadlines. Credit Suisse identified the Archegos concentration risk in September 2020 and recommended solutions, but set no deadlines and assigned no owners. Six months later, the bank lost $5.5 billion.

Closure and lessons learned. The incident is closed when remediation is verified as effective. Lessons learned are documented and, critically, shared across the organisation. A booking error in equities structured products may have implications for the fixed income desk running a similar process.

Automation opportunity

Automation Opportunity: The incident lifecycle — from detection through closure — runs on emails, spreadsheets, and calendar reminders at most banks. A workflow engine that assigns tasks at each stage, enforces escalation rules based on severity, tracks remediation deadlines, and flags overdue actions turns a process that depends on individual diligence into one that runs on system logic. The data captured along the way feeds directly into loss trend analysis and RCSA updates without manual re-entry.

Loss Data in Practice

Collecting loss data sounds straightforward: something goes wrong, you record the amount. In reality, it is one of the hardest data problems in operational risk.

Who reports. The first line — business and operations teams — is responsible for reporting incidents. The second line (operational risk function) provides the framework, challenges completeness, and maintains the loss database. The third line (internal audit) periodically tests whether the capture process is working. The weak link is almost always the first line. Reporting an incident means admitting something went wrong in your area. The incentives to under-report are obvious.

The completeness challenge. Every bank's loss database is incomplete. The question is how incomplete. Small losses — a few thousand euros from a correction trade — often go unreported because the effort of logging them exceeds their individual significance. But aggregate small losses reveal process weaknesses that large losses may not. A desk that generates fifty small booking errors per quarter has a systemic problem, even if no single error is material.

The 10-year requirement. Under the Standardised Measurement Approach (SMA), banks must maintain 10 years of internal loss data for capital calculation. Losses above EUR 10 million and EUR 100 million receive heavier weighting — the framework is designed to capture fat-tail risk. Banks that historically under-invested in loss data collection now face a direct capital consequence: incomplete data means a less favourable Internal Loss Multiplier, which means more capital.

Internal vs. external loss data. Internal data tells a bank about its own experience. External loss data — what happened at other banks — fills the gaps for low-frequency, high-severity events that a single institution may never have experienced. If your bank has never had a rogue trading loss, you still need to account for the possibility.

ORX. The Operational Riskdata eXchange association is the industry's primary loss data consortium. With 82 global banking members, it maintains the largest database of anonymised operational risk loss events in financial services. Member banks contribute their loss data and receive benchmarking in return. ORX also publishes the Reference Taxonomy (14 Level 1 risk types, expanded from Basel's 7), a Process and Service Library, and an annual loss trends report.

What ORX Loss Trends Tell Us

The ORX 2025 report (covering 2023 data from 82 member banks) revealed several patterns worth understanding.

Overall losses at a decade low. Financial losses across member banks dropped 32% year-on-year. This is encouraging, but one year does not make a trend — operational risk losses are inherently volatile, driven by a small number of large events.

Conduct losses dropped sharply. Conduct-related losses fell to EUR 3.2 billion, down from a five-year average of EUR 8.5 billion — a 63% decline. This likely reflects the tail end of the post-crisis conduct remediation cycle (PPI, FX benchmark, LIBOR) rather than a permanent improvement.

Transaction losses are the costliest category. Execution, delivery, and process management losses reached nearly EUR 8 billion — the most expensive operational risk type across the consortium. This is the daily grind of settlement fails, booking errors, and process breakdowns. It does not make headlines, but it dominates the loss register.

Low-severity fraud at record frequency. ORX recorded 36,811 fraud events in 2023, the highest count in its 22-year history. The individual amounts are small, but the volume is telling. Automated fraud at scale — account takeover, payment redirection, synthetic identity — produces many small losses rather than a few large ones. Traditional detection methods tuned for large anomalies miss the pattern.

Case Studies: What the Failures Teach

Public enforcement actions, independent reviews, and regulatory findings provide the raw material for understanding operational risk failures in CIB. Three cases from recent years illustrate distinct failure modes.

Credit Suisse and Archegos (2021) — $5.5 Billion

Archegos Capital Management, a family office run by Bill Hwang, defaulted on margin calls in March 2021. Credit Suisse had accumulated $24 billion in exposure to Archegos — more than four times its next largest hedge fund client, and over half of Credit Suisse Group's total equity.

The root cause was not a gap in the risk framework. Credit Suisse had limits, monitoring systems, and escalation procedures. The problem was that none of them were enforced. The bank's Counterparty Portfolio Oversight Committee identified the Archegos concentration risk in September 2020 and recommended corrective action. No deadlines were set. No owners were assigned. The issue was not raised at committee level again until March 2021, after the loss had occurred. When limits were breached, the response was to raise the limits rather than reduce the positions. The PRA fined Credit Suisse GBP 87 million — its largest-ever penalty — for breaching four Fundamental Rules. FINMA found serious and systematic violations.

The lesson: a risk framework is only as good as the culture that enforces it. Design effectiveness without operating effectiveness is worthless.

JPMorgan London Whale (2012) — $6.2 Billion

JPMorgan's Chief Investment Office (CIO) was supposed to hedge the bank's overall risk. Instead, it drifted into speculative proprietary trading. Trader Bruno Iksil accumulated outsized positions in credit default swaps. The loss, initially reported at $2 billion, ultimately exceeded $6 billion.

A now-infamous Excel error contributed to the understatement of risk: a VaR model formula divided by the sum of two hazard rates instead of their average, cutting the reported risk roughly in half. But the spreadsheet error was a symptom, not the cause. The CIO lacked an independent risk management function. Risk limit breaches — including VaR — occurred multiple times with no corrective action. The unit operated without a treasurer for five months. Management dismissed early reports as overstated. The result was $920 million in fines and a catalyst for the Volcker Rule.

The lesson: mandate drift without independent oversight is dangerous. Manual model processes (spreadsheets) in high-stakes environments create single points of failure that automated, independently validated systems would catch.

TD Bank AML Failures (2024) — $3.09 Billion

TD Bank became the first major US bank to plead guilty to money laundering, paying $3.09 billion in combined fines from the DOJ, FinCEN, and OCC. The scale of the failure was staggering: the bank's AML transaction monitoring covered only 8% of transaction volume and 26% of transaction value between 2018 and 2024. Monitoring systems had not been updated since 2014 — missing entire payment channels including Zelle and peer-to-peer transfers. The bank operated under a "zero expense growth" philosophy that kept compliance budgets flat while the business expanded. Bank employees were found to be actively laundering narcotics proceeds for bribes.

The lesson: compliance investment must scale with business growth. Technology debt in monitoring systems creates blind spots that compound over time. When the AML function lacks authority over the technology and operations teams it depends on, governance is structurally broken.

What Is a Control?

A control is any process, policy, device, practice, or action that modifies risk. In operational risk, controls sit between a risk event and a loss. Their purpose is to prevent the event from occurring, detect it quickly if it does occur, or correct its impact after the fact.

Preventive controls stop the risk event from happening. Segregation of duties between front office and back office prevents a trader from booking and confirming their own trades. Dual-authorization requirements on payments above a threshold prevent a single person from initiating a large transfer. Pre-trade limit checks prevent orders that would breach risk limits from reaching the market.

Detective controls identify risk events that have already occurred. Daily reconciliation between trading systems and the general ledger detects booking errors. Exception reports flag transactions outside expected parameters. Surveillance systems monitor trader communications for potential market abuse.

Corrective controls fix or mitigate the impact after an event is detected. Trade correction procedures. Disaster recovery and business continuity plans. Incident response playbooks.

Controls are also classified by execution method. Manual controls depend on a person performing a step — reviewing a report, comparing two data sources, approving a transaction. Automated controls are embedded in systems — hard limits that reject transactions, automated reconciliation matching, system-enforced segregation of duties through access controls.

Two dimensions of effectiveness matter. Design effectiveness asks whether the control, if operating as intended, would actually mitigate the risk. A four-eyes check on trade booking is well-designed for catching data entry errors. Operating effectiveness asks whether the control is actually being performed as designed. If the second pair of eyes rubber-stamps every booking without review, the control exists on paper but provides no mitigation.

Every control needs an owner — a named individual accountable for ensuring the control operates as designed. Without ownership, controls degrade. The person who designed the control moves to another role. The team gets busy and skips the daily check "just this once." The system generates so many false positives that analysts start dismissing alerts without investigation. Control degradation is gradual, invisible, and dangerous.

Control Testing

Control testing answers a specific question: is this control actually working? It sits between RCSA (which asks the business to self-assess its controls) and internal audit (which provides independent assurance). Understanding the differences matters.

AspectRCSAControl TestingInternal Audit
Who1st line (business), facilitated by 2nd line1st or 2nd line3rd line
NatureSelf-assessmentEvidence-based testingIndependent assurance
IndependenceLow — self-reportedModerateHigh
ScopeBroad risk and control identificationSpecific control effectivenessComprehensive risk-based review
FrequencyAnnual or semi-annualOngoing or periodicAudit plan cycle
OutputRisk ratings, action plansPass/fail on specific controlsAudit findings, recommendations

RCSA asks: "Do we believe our controls are effective?" Control testing asks: "Can we prove it?" Internal audit asks: "Is the whole framework — including RCSA and control testing — reliable?"

Testing Methods

Sampling. Select a representative set of transactions and verify that the control was performed. For a four-eyes check on trade amendments, pull 30 amendments from the past quarter and confirm each has documented evidence of a second review. The sample size depends on the control frequency and risk significance.

Transaction walkthroughs. Follow a single transaction end-to-end through the process, verifying each control point along the way. A trade from execution through booking, confirmation, settlement, and reporting. Walkthroughs reveal gaps between how a process is documented and how it actually runs.

Scenario testing. Simulate a risk event and observe whether the control responds as designed. Attempt to book a trade that breaches a limit. Send a test phishing email to measure detection rates. Introduce an intentional reconciliation break and check whether it gets flagged and escalated within the expected timeframe.

The critical distinction is between evidence-based testing and self-assessment. Self-assessment asks the control owner whether the control works. Evidence-based testing looks at the evidence — transaction logs, system screenshots, signed-off checklists, audit trails. Self-assessment is fast and cheap. Evidence-based testing is slower and more expensive. Only one of them gives you reliable data.

Control Testing in CIB

CIB operations provide clear examples of how controls work — and fail — in practice.

The four-eyes check on trade booking. The most common preventive control in CIB operations. After a trade is booked, a second person reviews the key fields — counterparty, notional, currency, settlement date, economic terms — before the trade is confirmed. Testing this control means sampling booked trades and verifying that the review was documented, performed by someone with appropriate authority, and completed before confirmation was sent. A control that is performed but not documented cannot be evidenced during testing or audit.

Automated limit monitoring. Pre-trade and post-trade systems check positions against approved limits — credit limits, market risk limits, position limits. The control is automated, but that does not mean it does not need testing. Testing verifies that limits are correctly configured, that breaches trigger the expected alerts, that alerts reach the right people, and that the system handles edge cases (multi-leg trades, cross-entity positions) correctly. Systems that have not been recalibrated after business changes may enforce stale limits.

Reconciliation as a detective control. Daily reconciliation between front-office systems, risk systems, and the general ledger catches discrepancies — missing trades, incorrect valuations, failed settlements. Testing examines whether reconciliation is performed daily as required, how quickly breaks are investigated, and whether aged breaks (unresolved items older than a defined threshold) are escalated. A reconciliation process that runs daily but carries 200 aged unresolved breaks is not an effective control.

Segregation of duties. Front office books trades; back office confirms and settles them. This separation prevents a single individual from both initiating and approving a transaction — a fundamental fraud prevention control. Testing involves reviewing system access permissions to confirm that no individual has conflicting roles, and that access reviews are performed periodically. In practice, emergency access grants ("firefighter" access) sometimes create temporary SoD violations that are never revoked.

Automation opportunity

Automation Opportunity: Control testing generates large volumes of evidence — screenshots, transaction samples, sign-off records, system logs. Most banks manage this evidence in shared drives and spreadsheets, making it difficult to trace a specific control test back to its evidence or to identify which controls have not been tested on schedule. A control testing platform that links each control to its test plan, schedules tests automatically, stores evidence in a structured repository, and reports on testing coverage and results gives the 2nd line real-time visibility into control health instead of a quarterly static report.

Common Pitfalls

Controls that exist on paper but are not performed. The procedure manual says the reconciliation is reviewed and signed off daily. In reality, the sign-off happens weekly when the team lead batch-approves a week of reconciliations without reviewing them. The control appears effective in RCSA because the owner rates it as such. It appears effective in reporting because the sign-off exists. It is only exposed when an auditor or tester examines the timestamps and finds that all five daily sign-offs were completed at 4:58 PM on Friday.

Over-reliance on detective controls. Detective controls find problems after they occur. Preventive controls stop them from occurring. A bank that relies heavily on end-of-day reconciliation to catch booking errors (detective) instead of investing in system validations that reject bad data at entry (preventive) will always be playing catch-up. Detective controls are necessary, but they should not be the primary defence where prevention is feasible.

Assuming automated controls always work. Systems are configured by people and maintained by people. A limit check that was correctly configured three years ago may not account for new products, new counterparties, or changes in the risk methodology. Banks that test manual controls rigorously but exempt automated controls from testing on the assumption that "the system handles it" discover the gap when the system does not handle it. The London Whale's VaR model ran in a spreadsheet with a formula error. It was automated. It was wrong.

Control fatigue. When everything is a control, nothing is. Banks that respond to every incident by adding a new control end up with hundreds of controls per process, many of which overlap or contradict each other. The operations team spends more time evidencing controls than performing the underlying work. Effectiveness drops because reviewers cannot meaningfully assess every item. The best control environments are lean — fewer controls, each well-designed, properly tested, and genuinely mitigating risk.

Not closing the loop. An incident reveals a control gap. A remediation action is agreed. The action is marked complete when the new control is designed. But nobody tests whether the new control actually works in production. Six months later, the same incident type recurs, and the bank discovers the remediation was a paper exercise. Effective incident management does not end with remediation — it ends with verification that the remediation is effective.

Sources

  • Basel Committee on Banking Supervision — BCBS 195 / d515: Principles for the Sound Management of Operational Risk (2011, revised 2021)
  • Basel Committee — Standardised Measurement Approach for Operational Risk (d355, 2017)
  • ORX — Annual Banking Operational Risk Loss Data Report 2025
  • EBA — RTS on Operational Risk Loss Event Taxonomy (2024)
  • DORA — EU Regulation 2022/2554, effective January 2025
  • Credit Suisse Independent Review (Archegos): Paul Weiss report, PRA enforcement notice, FINMA proceedings (2023)
  • JPMorgan London Whale: US Senate Permanent Subcommittee on Investigations report (2013)
  • TD Bank: DOJ, FinCEN, and OCC enforcement actions (2024); ORX News Deep Dive
  • BIS — Four Lines of Defence Model for Financial Institutions (FSI Occasional Paper, 2015)
  • IIA — Three Lines Model (July 2020)
  • Deloitte — Ten Steps to RCSA Redemption (2025)

A loss event tells you what already happened. A KPI tells you how the business performed. A Key Risk Indicator tells you what might be about to go wrong.

KRIs are metrics that provide early warning of changing operational risk exposure. They sit between the rear-view mirror of loss data and the forward-looking judgement of scenario analysis. Done well, they translate board-level risk appetite statements into numbers that operations teams can actually monitor day to day. Done poorly — and this is common — they become another set of charts in a committee pack that nobody acts on.

The distinction from KPIs matters. A KPI measures business performance: revenue per trader, cost-to-income ratio, client onboarding time. A KRI measures risk exposure: settlement fail rates, system downtime incidents, vacancy rates in key control functions. Some metrics serve both purposes (trade break volumes affect both operational efficiency and risk exposure), but the intent is different. KPIs ask "how are we doing?" KRIs ask "how exposed are we?"

Leading vs. Lagging Indicators

Lagging KRIs reflect exposure that has already materialised. Actual loss amounts, incidents reported last month, regulatory fines received — these confirm what went wrong but offer no advance warning.

Leading KRIs are forward-looking. Rising system downtime, increasing staff turnover in the middle office, growing volumes of unconfirmed trades — these signal heightened risk before a loss materialises. The value is in the lead time: if settlement fail rates are trending up over three weeks, you can investigate and intervene before a major fail costs real money or triggers a regulatory penalty.

Most banks track both, but the balance matters. A KRI programme dominated by lagging indicators is just a loss database with extra steps. The harder — and more valuable — work is identifying leading indicators that genuinely predict changing risk profiles.

Typical CIB KRIs by Domain

DomainKRIWhat It Signals
Trading & MarketsSettlement fail ratePost-trade processing breakdown, counterparty or funding issues
Trade break volumesBooking errors, system mismatches, reconciliation failures
Confirmation match rateUnconfirmed trades carry counterparty risk exposure
VaR limit breachesPosition risk exceeding approved thresholds
P&L attribution breaksModel risk — unexplained P&L means something is mispriced or misbooked
Technology & CyberSystem downtime (hours/incidents)Infrastructure instability, capacity issues
Attempted cybersecurity breachesThreat landscape intensity
Patch compliance rateVulnerability window — unpatched systems are open doors
Disaster recovery test results (pass/fail)Whether you can actually recover when it matters
PeopleTurnover in key rolesLoss of institutional knowledge, control gaps during transitions
Vacancy rates in critical functionsUnder-staffing in risk, compliance, or operations
Training completion rateRegulatory and control awareness gaps
Process & CompliancePolicy exceptions grantedWorkarounds accumulating — each one is a control bypass
Open regulatory findingsUnresolved regulator concerns, potential for enforcement
Open audit issues (and ageing)Known control weaknesses not being fixed
FraudSARs filedSuspicious activity detection volume
Internal fraud incidentsInsider threat exposure

Settlement fail rates deserve specific attention in CIB. Under the EU's CSDR penalty regime, fails on liquid equities cost 1 basis point per day. ESMA has recommended increasing penalties by 50-100%. At scale, these add up fast — and settlement fails tend to cluster during periods of market stress, exactly when you can least afford cascading problems.

Threshold Design

KRIs without thresholds are just data. The standard approach is RAG — Red, Amber, Green — and while it sounds simple, getting the calibration right is one of the hardest parts of the programme.

Green: metric is within acceptable limits. Business as usual. No escalation required.

Amber: metric has moved into a warning zone. Increased monitoring, root cause investigation begins, business line risk manager is notified. This is the early warning — the whole point of having KRIs.

Red: metric has breached risk tolerance. Immediate escalation to senior management and the relevant risk committee. Root cause analysis is mandatory. Remediation plan with named owners and deadlines. Depending on the KRI, this may also trigger regulatory notification.

How Thresholds Get Set

Three inputs drive threshold calibration:

Risk appetite. The board approves a risk appetite statement — typically qualitative ("we accept low to moderate operational risk") with quantitative boundaries. KRI thresholds are the operational translation. If the board says "we will not accept settlement fail rates that could materially impact client relationships or regulatory standing," the OpRisk function must convert that into specific numbers.

Historical data. What has the metric looked like over the past 12-24 months? Where does "normal" end and "concerning" begin? A settlement fail rate of 5% might be green for one desk and red for another, depending on their baseline.

Industry benchmarks. ORX data, peer bank disclosures, and regulatory expectations provide context. If your patch compliance rate is 85% and the industry norm is 95%, your green threshold needs adjustment.

Threshold Types

  • Cap: upper boundary. Settlement fail rate must not exceed X%.
  • Floor: lower boundary. Training completion rate must not fall below Y%.
  • Collar: range. Acceptable when between X and Y.

The Calibration Problem

Thresholds set too tight produce alert fatigue. If a KRI goes amber every month due to normal business variation, teams stop paying attention. The amber status becomes background noise, and real signals get lost.

Thresholds set too loose miss genuine deterioration. A settlement fail rate creeping from 4% to 7% over six months might never breach a 10% amber threshold, but that trend is exactly the kind of signal the programme should catch.

Recalibration should happen at least annually, and after any significant change in business volumes, systems, or operating model. In practice, many banks set thresholds once and forget them. Thresholds calibrated in 2019 for pre-pandemic transaction volumes may be meaningless in 2025.

Automation opportunity

Automation Opportunity — Dynamic Threshold Calibration

Static thresholds are the norm, but they don't have to be. A system that ingests KRI history and automatically adjusts amber/red bands based on rolling statistical measures (moving averages, standard deviations, seasonal patterns) would reduce both alert fatigue and missed signals. The logic doesn't need to be complex — even a basic statistical process control approach (flag when a metric exceeds two standard deviations from its 90-day rolling mean) outperforms a fixed number set two years ago. The harder part is governance: who approves the auto-adjusted thresholds, and how do you maintain the link to risk appetite?

From KRI Breach to Action

A KRI programme is only as good as the action it drives. Here is the typical escalation path:

Amber breach. The business line risk manager reviews the data. Is this a data quality issue or a genuine shift? If genuine, what is driving it? Increased monitoring frequency kicks in — a KRI normally reviewed monthly might move to weekly. Findings are documented and reported to the business line risk committee.

Red breach. Escalation to the executive risk committee. Root cause investigation is mandatory, typically with a named owner and a deadline. Remediation actions are tracked until closed. If the KRI relates to a regulatory threshold (for example, incident reporting timelines), the breach may trigger a regulatory notification.

The risk appetite link. This is where KRIs connect to governance. The board approves a risk appetite statement. The executive committee translates that into risk limits and tolerances. KRIs operationalise those limits at the business line level. When a KRI breaches red, it means the bank's actual risk exposure may have exceeded what the board said it would accept. That is why red breaches go to senior management — they need to know the bank is operating outside its stated appetite, and they need to decide whether to accept, mitigate, or escalate further.

The problem in practice: many banks have this escalation path on paper but don't follow it consistently. Amber breaches get noted but not investigated. Red breaches trigger a flurry of activity, then the root cause analysis stalls as people return to business as usual. The follow-through — tracking actions to closure, verifying the KRI returns to green, confirming the fix actually worked — is where most programmes fall short.

The Committee Structure

Operational risk governance in CIB follows a three-tier committee hierarchy. The names vary by bank, but the structure is consistent across the industry because regulators expect it.

TierCommitteeCompositionFrequencyKey Responsibilities
1 — BoardBoard Risk CommitteeIndependent (non-executive) directors, minimum 3 members with risk expertiseQuarterlyApproves risk appetite, oversees risk framework, reviews material incidents
2 — ExecutiveFirm-Wide OpRisk Committee (or Executive Risk Committee)CRO, business line heads, COO, Head of OpRisk, CISO, Head of ComplianceMonthlyReviews aggregate risk profile, KRI dashboards, material loss events, cross-business themes
3 — Business LineCIB OpRisk Committee, Markets OpRisk Committee, etc.Business line head, desk heads, 1st-line risk managers, 2nd-line OpRiskMonthly or more frequentlyReviews business-specific KRIs, RCSA progress, incidents, escalates issues to Tier 2

How Information Flows

Information flows up; authority flows down. Business line committees identify issues and escalate anything that exceeds their authority or crosses business boundaries. The executive committee makes decisions on firm-wide responses and resource allocation. The Board Risk Committee sets the appetite and holds senior management accountable.

In the US, 12 CFR 252.22 mandates a board risk committee for bank holding companies with $50 billion or more in total consolidated assets. The Basel Committee requires the board to approve and periodically review the operational risk framework. The CRO typically reports to both the CEO and the Board Risk Committee — dual reporting lines are a governance safeguard.

A trend worth noting: some banks have split technology risk governance from broader operational risk because combined committees were struggling to give adequate attention to both. Cyber risk, IT resilience, and third-party technology risk now have their own committee structures at several large banks, with a dotted line back into the main OpRisk governance.

What Gets Reported

A typical committee pack at the executive level covers:

  • RCSA results: residual risk ratings, newly identified risks, control effectiveness changes
  • Loss event trends: losses by category, business line, and severity; trends vs. prior periods
  • KRI dashboard: RAG status across all monitored KRIs, breaches since last meeting, trend lines
  • Risk appetite utilization: how actual exposure compares to approved appetite — are we within bounds?
  • Action item status: open remediation actions from prior incidents, audit findings, regulatory issues — how many are overdue?
  • Emerging risks: new or changing threats identified through horizon scanning (cyber threats, regulatory changes, geopolitical developments)
  • Scenario analysis updates: results from recent stress scenarios, capital implications

Board-level reporting condenses this further. The board does not need 50 KRI charts. It needs the five that matter most, the three loss events that exceeded materiality thresholds, and a clear view of whether the bank is operating within its stated risk appetite.

Reporting Frequency

AudienceFrequencyWhat They Get
Board Risk CommitteeQuarterlyAggregate risk profile, material incidents, risk appetite utilization, emerging risks
Executive Risk CommitteeMonthlyFull KRI dashboard, loss event analysis, RCSA status, action tracking
Business Line CommitteesMonthly (some weekly for high-risk areas)Desk-level KRIs, local incidents, control testing results
Incident ReportingReal-time / within 24 hours for material eventsIndividual event details, immediate impact assessment, initial response actions

Material event reporting is where speed matters most. Under DORA, significant ICT incidents must be reported to regulators within defined timeframes. The UK PRA is also tightening incident reporting requirements. A bank that discovers a material operational risk event on a Friday afternoon cannot wait until the Monday committee meeting to escalate it.

The Reporting Challenge in CIB

Here is the reality that anyone building reporting tools for CIB operational risk needs to understand: the data comes from dozens of systems, spans multiple legal entities, and uses different risk taxonomies across desks and geographies.

Settlement fail data sits in one system. Trade break data sits in another. HR turnover data comes from a third. Cyber incident data from a fourth. Each system has its own data model, its own update frequency, and its own definition of what counts as an "incident." Pulling this together into a coherent committee pack is a manual exercise at many banks — analysts spend days each month extracting data, normalising it, building charts, writing commentary, and getting sign-off.

McKinsey's Global Risk Productivity Survey found that operational risk headcount increased 11% annually between 2020 and 2023, even as credit risk FTEs declined 7% per year. A significant portion of that growth is people doing data aggregation and report production, not risk analysis.

The multi-entity problem is particularly acute. A single CIB division might operate through ten legal entities across five jurisdictions. Each entity has its own regulatory reporting requirements. Aggregating risk data across entities while maintaining the entity-level detail that regulators require is a data engineering challenge that most banks have not fully solved.

Automation opportunity

Automation Opportunity — Committee Pack Generation

The monthly cycle of pulling KRI data from source systems, building dashboards, writing narrative commentary, and assembling the committee pack is a prime automation target. The technical components — automated data extraction, threshold-based RAG colouring, trend chart generation — are straightforward. The harder part is narrative: committees need context, not just data. GenAI can draft first-pass commentary on KRI movements ("Settlement fail rate increased from 4.2% to 5.8%, driven by a 30% volume increase following the index rebalance on March 15"). A human risk manager reviews, edits, and approves. McKinsey estimates 10-20% FTE productivity gains from this kind of risk automation. For a team that spends five days a month on report production, that is real capacity returned to actual risk management.

ML for KRI Monitoring

Machine learning applied to KRI data is early-stage in most banks but shows clear potential in three areas.

Anomaly detection. Traditional threshold monitoring catches breaches against fixed limits. ML models can detect unusual patterns — a KRI that hasn't breached its threshold but is behaving differently than its historical norm. A settlement fail rate sitting at 5% (green) but with a volatility pattern that has never been seen before might indicate a system issue that static thresholds would miss.

Cross-KRI correlation. KRIs are typically monitored individually, but operational risk events rarely have a single cause. ML can identify correlations across KRIs that precede loss events — for example, rising trade break volumes combined with increasing staff turnover in the middle office and a recent system change might be a pattern that historically preceded significant operational losses. Humans may not spot these multi-dimensional patterns in monthly committee packs.

Pattern recognition over time. ML models can learn seasonal and cyclical patterns in KRI data and distinguish genuine deterioration from predictable variation. Quarter-end and year-end spikes in trade breaks are normal; the same spike in the second week of February is not. Models that learn these patterns reduce false alerts and improve signal quality.

The barriers are practical, not theoretical. KRI data is often incomplete, inconsistently defined across business lines, and stored in different systems. Building the integrated dataset that ML needs is the first challenge. The second is explainability — regulators expect banks to explain why a risk indicator was flagged, and black-box models that say "the algorithm detected an anomaly" are not sufficient.

Common Pitfalls

KRIs that nobody acts on. The most common failure. The programme tracks 200 KRIs, produces monthly dashboards, but when a KRI breaches amber, nothing happens. No investigation, no root cause analysis, no remediation. The dashboard becomes wallpaper. This usually reflects a governance problem — no clear ownership of the escalation path, or a culture where breaches are treated as normal.

Committee packs that are 100 pages of data and zero pages of insight. A committee that receives a 100-slide pack 24 hours before the meeting will not read it. Effective packs lead with the five things that matter — material KRI breaches, significant incidents, overdue actions — and put the supporting detail in appendices. The committee's job is to make decisions and direct action, not to review data. If the pack doesn't clearly state what decisions are needed, it has failed its purpose.

Thresholds that haven't been recalibrated in years. Business volumes change. Systems change. Risk profiles change. Thresholds set during a low-volatility period become meaningless when markets shift. Annual recalibration should be a formal step in the KRI programme, not something that happens only when someone notices the dashboard has been green for 18 straight months.

Too many KRIs, not enough key ones. There is a natural tendency to add KRIs over time and never retire any. A programme tracking 300 metrics across the bank has diluted the concept of "key." If everything is a KRI, nothing is. Effective programmes maintain a core set of 15-25 KRIs at the firm level, with more granular metrics available for business-line deep dives.

Lagging indicators masquerading as leading ones. Counting last month's incidents and calling it a KRI does not provide early warning — it provides a history lesson. The discipline of identifying genuinely predictive metrics is harder but essential.

Sources

  • IOR — Key Risk Indicators Sound Practice Guidance
  • Basel Committee on Banking Supervision — BCBS 195: Principles for the Sound Management of Operational Risk (2011, revised 2021)
  • ORX — Annual Banking Operational Risk Loss Data Report 2025
  • McKinsey — Global Risk Productivity Survey (February 2026)
  • McKinsey — The Future of Operational Risk Management in Financial Services
  • Deloitte/ORX — CTRL + Operational Risk: Perspectives on AI in Banking (December 2025)
  • Deutsche Bank Research — Breaking the Settlement Failure Chain
  • JPMorgan Chase — Risk Committee Charter; Morgan Stanley — Risk Committee Charter
  • OCC — Corporate and Risk Governance Handbook
  • 12 CFR 252.22 — US Board Risk Committee Requirements
  • ESMA — CSDR Penalty Regime Review

Here is a number that should bother anyone working in risk technology: between 2020 and 2023, operational risk headcount at banks grew 11% per year while credit risk headcount declined 7% per year. Credit risk teams shrank because they automated. Operational risk teams grew because they didn't. IT risk, cyber risk, third-party risk, and data risk teams expanded 40–130% over the same three years. The function is scaling with people, not with technology.

Most operational risk teams still run on spreadsheets, email workflows, and periodic manual assessments. RCSA campaigns are coordinated over email. KRI dashboards are compiled monthly in Excel. Incident reports arrive as form emails that get re-keyed into a tracking system. Committee packs are assembled by hand from data pulled across half a dozen sources. A 2024 RMA/PwC survey found that only 20% of institutions use modern tools like AI for RCSA — the process that consumes the most first-line and second-line resource time of any OpRisk tool.

This is changing. But understanding where the function is today — and where the real leverage points are — matters more than chasing the latest GenAI headline.

The Analytics Maturity Curve

Data analytics maturity follows a well-established progression. Four stages, each building on the previous one.

Stage 1 — Descriptive: What happened? Backward-looking reporting on losses, incidents, and KRI trends. Monthly risk reports. Historical loss analysis. This is where most operational risk functions live today. The data exists, but it sits in spreadsheets and disconnected systems. Assembling it into a coherent picture requires manual effort.

Stage 2 — Diagnostic: Why did it happen? Root cause analysis. Correlation between incident types and business changes. Understanding which processes generate the most losses and why. Some banks do this well for major incidents (post-mortems, lessons learned) but rarely apply it systematically across the full loss database.

Stage 3 — Predictive: What will happen? Statistical models that forecast risk based on leading indicators. ML anomaly detection on KRI data. Pattern recognition across incident narratives. Very few operational risk functions have reached this stage — though credit risk teams have been doing it for years.

Stage 4 — Prescriptive: What should we do? Automated recommendations for control adjustments, resource allocation, and risk mitigation based on predicted outcomes. This stage requires the previous three to be working. Almost nobody in OpRisk is here yet.

The gap between where OpRisk sits (stages 1–2) and where credit risk sits (stages 3–4) is not about technology availability. The same ML and NLP tools that power credit decisioning are commercially available. The gap is about data, culture, and investment priority.

Why OpRisk Data Is Hard

Data scientists entering operational risk for the first time tend to expect the kind of datasets they're used to — high-volume, structured, frequently updated. OpRisk data is none of those things.

Loss data is sparse. Operational risk follows a fat-tailed distribution: many small events, very few catastrophic ones. A bank might record tens of thousands of low-severity incidents per year but only a handful of losses above EUR 10 million. Training predictive models on a dataset where the events you most care about almost never occur is a fundamental statistical challenge.

Assessment data is subjective. RCSA ratings — the backbone of the risk profile — are self-reported by the first line. A business unit rates its own controls as "effective" or "needs improvement." These ratings reflect judgement, organisational culture, and sometimes politics, not objective measurement. Two teams with identical processes can produce different RCSA ratings depending on who fills in the form.

Data is fragmented. Loss events live in one system. RCSA results in another. KRIs in a spreadsheet. Audit findings in an audit tool. Control evidence in shared drives and email. Regulatory findings in yet another tracker. Getting a unified view of operational risk across these silos requires manual data integration — which is exactly what the monthly committee pack assembly process does, painfully and slowly.

Categorisation is inconsistent. Even with Basel's seven event types and ORX's 14-category reference taxonomy, banks apply these classifications differently. The same event might be categorised as "Execution, Delivery & Process Management" in one business unit and "Business Disruption & System Failures" in another. Inconsistent taxonomy undermines any analysis that depends on comparing risk profiles across the organisation.

None of this means analytics can't work in OpRisk. It means the foundational data work has to happen before the models can deliver value.

AI Use Cases by OpRisk Process

The practical applications of AI in operational risk map directly to the core processes. Some are already in production at large banks. Others are emerging from pilot.

RCSA

GenAI can evaluate control descriptions against established standards (the "five Ws" — who, what, when, where, why) without requiring large training datasets. An LLM reads a control description and flags whether it specifies the control owner, the frequency of execution, and the evidence of operation. This addresses a persistent quality problem: control descriptions that are too vague to be testable.

Beyond description quality, AI can auto-suggest risk ratings by comparing a unit's risk profile against loss history, audit findings, and KRI trends. The human still makes the final call, but the AI provides a data-informed starting point instead of a blank form.

The longer-term trajectory: continuous RCSA. Instead of an annual campaign, the risk profile updates as new data arrives — an incident, a KRI breach, a regulatory finding. The assessment stays current without requiring a formal reassessment cycle.

Incidents

NLP is effective at auto-categorising incident reports. An incident narrative describes what happened in free text; an NLP model assigns the Basel event type, the affected business line, and a severity estimate. This eliminates the back-and-forth between the reporter and the risk team over classification.

More valuable: pattern detection across incident narratives. NLP can surface clusters of similar events that individually look minor but collectively signal a systemic issue. A string of small booking errors in the same product type across different desks might indicate a system defect or a training gap that wouldn't be visible from the individual reports.

KRIs

ML anomaly detection flags unusual patterns in KRI time series that rule-based thresholds miss. A KRI that drifts slowly upward over six months might never breach its red threshold, but the trend itself is a warning sign. Correlation analysis across KRIs can identify leading indicators — for example, rising staff turnover in a specific operations team predicting an increase in processing errors three months later.

Predictive breach forecasting uses historical KRI data to estimate the probability of a threshold breach in the next period, giving the first line time to intervene before the breach occurs rather than after.

Controls

Automated evidence collection pulls control evidence from source systems — approval logs, reconciliation outputs, access certifications — without requiring manual screenshots or email attachments. Continuous control monitoring runs automated tests against live data (for example, verifying that four-eyes approval is present on every trade above a threshold) and flags exceptions in real time rather than waiting for periodic testing.

Reporting

This is where GenAI has the most immediate, measurable impact. Moody's reported that its GenAI research assistant reduces the time to write comprehensive risk summaries by more than 60%. Applied to OpRisk, GenAI can generate first-draft committee narratives from structured data — pulling in loss trends, KRI breaches, open action items, and emerging risk assessments and assembling them into a readable report. The risk manager reviews and edits rather than writing from scratch.

A Celent survey found 59% of financial services firms implementing or testing GenAI for risk and compliance functions. The reporting use case has the lowest risk (a human reviews the output before it reaches the committee) and the highest time savings.

Automation opportunity

Automation Opportunity — The Quick Win: The monthly OpRisk committee pack at a typical CIB division takes 3–5 days to assemble. An analyst pulls KRI data from one system, loss data from another, open action items from a third, and writes the narrative connecting them. A dashboard that integrates these data sources and auto-generates a draft narrative can compress this to hours. This isn't AI — it's data integration and templated reporting. But it frees the team to spend time on analysis instead of assembly. Banks that have made this shift report 10–20% FTE productivity gains from risk automation alone.

GRC Platforms

The shift from spreadsheets to integrated Governance, Risk, and Compliance (GRC) platforms is underway but far from complete. These platforms centralise RCSA, incident management, KRI tracking, control testing, and regulatory reporting into a single system.

PlatformStrengthsBest ForAI Capabilities
MetricStreamConnected GRC data model linking enterprise risk, OpRisk, compliance, audit, third-party risk. Leader in IDC MarketScape 2025.Banking, regulated industriesAI-powered analytics, low-code configuration
IBM OpenPagesModular architecture — deploy operational risk, IT governance individually or as suite. Watson AI for predictive analytics.Large banks, complex deploymentsWatson AI, predictive analytics, automated reporting
ServiceNow GRCIntegrates with existing IT operations management. Strong workflow automation.IT-centric organisationsAI-driven workflows, real-time insights
Archer IRMHighly customisable. Strong in controlled compliance processes. Modular design.Complex enterprises, governmentCustomisable dashboards and reporting

MetricStream and IBM OpenPages dominate banking-specific deployments because of their deep regulatory compliance features — Basel categorisation, loss event management, and multi-jurisdiction reporting are built in rather than bolted on. ServiceNow's strength is integration: if the bank already runs IT operations on the Now Platform, extending to GRC avoids building new integrations.

The market is consolidating around AI. All major vendors now embed ML and NLP capabilities. The differentiator is increasingly the data model — whether the platform connects risk, compliance, audit, and third-party data into a unified view or treats them as separate modules that happen to share a login screen.

A word of caution: buying a GRC platform does not solve an operational risk problem. It solves a data and workflow problem. If the underlying risk culture treats RCSA as a box-ticking exercise and incidents go unreported, the same problems will exist on the new platform — just with better formatting.

Emerging Risks

The operational risk landscape is shifting. Four categories dominate the forward-looking view.

Cyber risk remains the number one concern. ORX has ranked cybercrime as the top emerging risk for three consecutive years. Ransomware incidents targeting financial workloads hit 78% in 2024. Cloud environment intrusions rose 75% between 2022 and 2023. The IMF estimates that extreme cyber losses have quadrupled since 2017, reaching roughly $2.5 billion in 2023. The ICBC ransomware attack in November 2023 demonstrated how a single cyber event at one institution can disrupt Treasury market settlement worth $9 billion.

AI model risk is the new entrant — and the most complex. AI creates operational risks that didn't exist five years ago: hallucination (models producing confident but wrong outputs), model poisoning (corrupted training data producing biased results), deepfake attacks (Deloitte's 2024 survey found 26% of executives had already experienced deepfake incidents), and prompt injection (adversarial inputs that bypass model guardrails). The EU AI Act, published July 2024 with provisions applying from February 2025, introduces regulatory requirements for high-risk AI systems including those used in creditworthiness assessment and risk management.

Third-party and cloud concentration risk jumped 11 spots in Risk.net's 2024 survey. The CrowdStrike outage in July 2024 — a faulty security update that crashed 8.5 million systems globally — exposed how much of the financial system depends on a handful of technology providers. The Bank of England has warned that cloud concentration among a few providers could threaten financial stability. Basel published new third-party risk management principles in December 2025.

Geopolitical risk affects CIB operations through sanctions (rapidly changing lists requiring real-time screening), supply chain disruption to technology vendors, and data sovereignty requirements that constrain where systems can operate and where data can be stored.

The AI Paradox

AI is simultaneously the biggest emerging operational risk and the best available tool for managing operational risk. This is not a contradiction — it's a design constraint.

A bank deploying AI for fraud detection faces model risk from the AI itself: false negatives that miss real fraud, false positives that block legitimate transactions, adversarial attacks that exploit model weaknesses. A bank deploying GenAI for risk narrative generation faces hallucination risk — the model might produce plausible but inaccurate loss trend analysis. A bank deploying NLP for incident categorisation faces bias risk if the training data reflects historical categorisation errors.

The Deloitte/ORX report from December 2025 maps AI-related risks directly to the ORX operational risk taxonomy. The conclusion: existing OpRisk frameworks — risk identification, control assessment, incident management, KRI monitoring — apply to AI deployments. Banks don't need a fundamentally new framework for AI risk. They need to apply the existing framework rigorously to a new risk source.

This means:

  • AI models need RCSAs — what can go wrong, what controls are in place, what is the residual risk
  • AI failures need incident reporting — when the model produces a wrong output or an unexpected outcome, it gets logged and categorised like any other operational risk event
  • AI deployments need KRIs — model drift, accuracy degradation, data quality metrics, latency
  • AI development needs control testing — validation of training data, testing of model outputs against known benchmarks, monitoring of production performance

The paradox resolves when you stop treating AI as something separate from the operational risk framework and start treating it as another process that generates operational risk — one that also happens to be unusually powerful at reducing operational risk in other processes.

Automation opportunity

Automation Opportunity — The Harder Win: AI-driven continuous control monitoring represents a step change from periodic testing. Instead of sampling 25 transactions per quarter to verify that dual-approval controls are working, an automated system checks every transaction in real time and flags exceptions as they occur. The setup cost is significant — you need clean data feeds from source systems, well-defined control logic, and an exception handling workflow. But the payoff is moving from "we tested a sample and it looked fine" to "we verified 100% of transactions and found three exceptions, all of which have been remediated." Regulators notice the difference.

Operational Resilience and Technology

Operational resilience — the ability to keep delivering critical services through disruption — is driving a wave of technology investment. Two regulatory frameworks set the pace: the UK PRA/FCA operational resilience framework (compliance deadline March 2025) and the EU's Digital Operational Resilience Act (DORA, effective January 2025).

Both frameworks require capabilities that most banks could not deliver manually: mapping dependencies for important business services across internal systems and third-party providers, setting and monitoring impact tolerances, running severe-but-plausible disruption scenarios, and reporting on resilience status to regulators.

Technology requirements flowing from these frameworks include vendor dependency mapping tools that trace which third parties support which critical services, impact tolerance monitoring dashboards that track whether recovery times are within acceptable limits, and scenario testing platforms that simulate disruptions across the end-to-end service chain. McKinsey's vision goes further — digital twins of critical operations that allow banks to run synthetic simulations of cyberattacks, vendor failures, or infrastructure outages before they happen.

For a deeper discussion of operational resilience, how it differs from traditional operational risk management, and the specific regulatory requirements, see the What is Operational Risk? page.

What to Build: Practical Guidance for Data and Engineering Teams

If you're a data engineer, product manager, or developer entering the operational risk space, here is where to focus — in order.

Start with data quality. Clean loss data with consistent taxonomy matters more than any model. If the same type of event is categorised three different ways across business units, no dashboard or ML model will produce reliable results. Invest in data standardisation, validation rules, and taxonomy enforcement before building analytics. One global bank eliminated approximately 35,000 investigative hours by using ML to fix AML data quality — their false positive rate had been running at 96%.

Build dashboards before AI. Most operational risk teams lack real-time visibility into their own risk profile. They assemble the picture once a month for the committee meeting. A well-designed dashboard that integrates loss data, KRI trends, RCSA results, and open action items — updated daily — delivers more value than a predictive model running on incomplete data. Visibility first. Prediction later.

Automate collection before analysis. The single biggest time sink in operational risk is manual data gathering. RCSA responses collected via email. KRI data pulled from source systems and re-entered into spreadsheets. Control evidence saved in shared drives. Automate the collection layer — pull data from source systems via APIs, build intake forms that write directly to the risk database, auto-populate KRI values from operational systems. This reduces errors, saves time, and creates the clean data foundation that analytics requires.

Then layer in intelligence. Once the data is clean, integrated, and flowing, the AI use cases become viable. NLP for incident categorisation. Anomaly detection on KRI trends. GenAI for narrative generation. Predictive models for breach forecasting. Each of these requires the foundation to be in place.

McKinsey quantifies the benefit: 10–20% FTE productivity gain from risk automation, 15–25% cost reduction through well-structured risk transformation. A realistic timeline is 18–24 months to move from spreadsheets to an integrated platform with basic analytics, and another 12–18 months to layer in predictive capabilities.

Common Pitfalls

Building AI models on bad data. If loss events are inconsistently categorised, incident narratives are sparse, and RCSA ratings are subjective, an ML model trained on this data will produce confident but unreliable outputs. The model learns the noise, not the signal. Fix the data before building the model.

Jumping to prediction without mastering reporting. A risk function that cannot produce an accurate, timely description of its current risk profile has no business deploying predictive models. Descriptive analytics — reliable, automated, real-time reporting — is not a stepping stone to skip. It's the foundation.

Buying a GRC platform and expecting it to solve cultural problems. A GRC platform solves data fragmentation, workflow inefficiency, and reporting complexity. It does not solve a culture where the first line treats RCSA as a tick-box exercise, incidents go unreported to avoid scrutiny, and risk management is seen as overhead rather than value. The platform will surface these cultural problems — which is useful — but it won't fix them.

Ignoring the regulatory dimension of AI. The EU AI Act classifies AI systems used in creditworthiness assessment and risk management as high-risk, triggering requirements around transparency, human oversight, accuracy, and documentation. Banks deploying AI in operational risk functions need to account for these requirements from the design phase, not as an afterthought. Regulatory expectations for AI explainability conflict directly with black-box deep learning models — a tension that affects model selection.

Underestimating integration complexity. OpRisk touches every part of the bank. Building a system that integrates with trading platforms, HR systems, IT monitoring tools, compliance databases, and regulatory reporting requires sustained engineering effort and organisational alignment. The technology is rarely the hardest part — getting business units to agree on data standards and adopt new workflows is.

Sources

  • McKinsey — Global Risk Productivity Survey (February 2026)
  • McKinsey — The Future of Operational Risk Management in Financial Services
  • McKinsey — How Generative AI Can Help Banks Manage Risk and Compliance (March 2024)
  • Deloitte/ORX — CTRL + Operational Risk: ORX Perspectives on AI in Banking (December 2025)
  • Deloitte — Using Predictive Operational Risk Analytics
  • ORX — Operational Risk Horizon 2024
  • ORX — Annual Banking Operational Risk Loss Data Report 2025
  • Celent — GenAI in Risk and Compliance Survey (2024)
  • Moody's — GenAI Research Assistant
  • Risk.net — Top 10 Operational Risks for 2025
  • IBM — Automate RCSA with Generative AI
  • EU AI Act (published July 2024)
  • Basel Committee — Third-Party Risk Management Principles (December 2025)
  • UK PRA — PS21/3 Operational Resilience; EU DORA (January 2025)
  • RMA/PwC — RCSA Survey (2024)
  • IMF — Global Financial Stability Report (cyber risk data, 2023)
  • CrowdStrike / ICBC incident public reporting