What Breaks If This Data Is Wrong

A single diagnostic question cuts through every data governance framework, every data quality initiative, and every ownership debate. What would break if this data was wrong? The answer tells you whether a piece of data is critical, and whether your organisation actually governs it.

There is a question that cuts through every data governance framework, every Critical Data Element register, every data quality initiative, and every ownership debate. It is not about completeness scores or SLA thresholds or metadata fields. It is this:

What would break if this data was wrong?

If nothing meaningful would break, the data is not tied to a critical decision. It may be useful. It may be interesting. But it is not critical, and spending governance resources on it is a misallocation.

If something material would break, but no one in the room can name who owns the risk, then governance is weak. Not incomplete. Not immature. Weak. The data is making real decisions and no one is accountable for its correctness.

This question is not rhetorical. It is a diagnostic instrument. Applied systematically, it separates critical data from decorative data, and it exposes governance gaps faster than any maturity assessment framework.


Why most data governance misses the point

The standard approach to data governance begins with inventory. Catalogue everything. Assign stewards. Define data quality dimensions: accuracy, completeness, consistency, timeliness, uniqueness, validity. Measure them. Build a dashboard.

This is not wrong. But it is incomplete in a way that matters.

The problem is that measuring quality across all data treats every field as equally important. It produces high scores for columns that feed no decision, and low scores for columns that feed every decision. The governance effort spreads thin across the entire estate rather than concentrating on the handful of elements that actually determine outcomes.

80% of data governance initiatives fail. Not because of tools, not because of frameworks, but because the business is not involved and no one agrees on what data truly matters. The "what breaks" question forces that agreement. It is the mechanism by which a data quality programme earns business credibility instead of being treated as an IT housekeeping exercise.

Critical Data Elements are specific data points vital for core business functions. Unlike the traditional "all data matters" approach, CDEs recognise a key truth: some data elements are much more important than others. They are the backbone of operations, the foundation of decision-making, and often the difference between business success and failure.

The "what breaks" question is simply the fastest way to find them.


The two-by-two reality

Plot any piece of data across two axes: whether something material breaks if it is wrong, and whether there is a named owner who accepts accountability for it.

                      NAMED OWNER?
                   No             Yes
               ┌──────────────┬──────────────────┐
 MATERIAL  Yes │  GOVERNANCE  │   GOVERNED CDE   │
 BREAKS?       │  GAP ⚠       │   (target state) │
               ├──────────────┼──────────────────┤
           No  │  DECORATIVE  │  OVER-GOVERNED   │
               │  DATA        │  DATA            │
               └──────────────┴──────────────────┘

Governed CDE (top-right): A named person or team owns it. Quality rules are defined. Monitoring is automated. If it degrades, someone gets paged. This is the target state for every field that feeds a material decision.

Governance gap (top-left): This is the most dangerous quadrant. The data is live in production. It is driving pricing models, credit decisions, regulatory reports, or trading signals. But nobody owns the risk if it is wrong. When something breaks, the post-mortem begins with three hours of people pointing at each other before anyone investigates the data.

Decorative data (bottom-left): Nobody owns it and nothing breaks if it is wrong. This is the majority of most organisations' data estates. It should be low-priority for governance investment.

Over-governed data (bottom-right): Someone owns it, there are quality rules, monitoring is in place, but nothing material depends on it. This represents misallocated governance effort. The owner's time would be better spent on the governance gap quadrant.

The single hardest conversation in any governance programme is telling a business unit that their carefully maintained dataset is in the bottom row. The second hardest is telling a data engineering team that their proudly instrumented pipeline is in the top-left.


What "material" actually means

In financial services, the word material has a precise meaning. A misstatement is material if a reasonable person would have made a different decision knowing the correct figure. The same standard applies to data quality.

There are four categories of materiality in a data-driven organisation.

Financial materiality: The data directly feeds a number that flows into financial statements, regulatory capital calculations, or risk-weighted asset computations. If the NAV calculation for a fund pulls from a corrupted pricing feed, the reported NAV is wrong. Investors make redemption and subscription decisions on a wrong number. That is financial materiality.

Regulatory materiality: The data feeds a submission to a regulator. Under BCBS 239, banks are required to produce timely, accurate, complete, and integrated risk data for both normal and stress conditions. Under MiFID II best execution requirements, transaction data must be accurate enough to demonstrate that execution was genuinely in the client's interest. If the data is wrong and the submission is wrong, the institution is non-compliant regardless of intent.

Operational materiality: The data controls a real-time system. In a trading context, this includes reference data (instrument identifiers, contract specifications, settlement dates), position data (current holdings), and market data (prices, rates, volatility surfaces). If any of these are wrong, the system makes wrong decisions continuously until someone notices. The damage accumulates with every transaction. According to BaseCap Analytics, bad data can cause businesses to lose up to 15% of their revenue. In the financial services industry, where precision is crucial, the cost of bad data extends beyond financial loss and impacts decision-making, customer trust, and compliance.

Model materiality: The data is a feature in a model that makes decisions at scale. A credit scoring model trained on a systematically biased data source does not produce one bad decision. It produces biased decisions across every applicant evaluated using that model until the bias is detected and corrected. The EU AI Act, passed in 2024, classifies financial services models used in credit scoring, KYC, and fraud detection as high-risk, triggering stricter requirements on training data governance and audit logging.

The "what breaks" test works across all four categories. If the answer to "what breaks" sits in any of these categories, the data is a Critical Data Element.


Mapping it to the quant stack

For practitioners building systematic trading infrastructure, the "what breaks" question has concrete answers at every layer of the pipeline.

DATA FIELD                  WHAT BREAKS IF WRONG
─────────────────────────────────────────────────────────────────────────────
available_at timestamp      Look-ahead bias. Backtest Sharpe is fictional.
                            Live system trades on information it did not have.

instrument identifier       Orders route to the wrong security. Fills land
                            in the wrong position. P&L attribution is wrong.

position quantity           Risk calculations understate or overstate exposure.
                            Guardian risk checks pass when they should fail.

settlement date             Cash management fails. Margin calls arrive on the
                            wrong day. Collateral is in the wrong account.

market price (mid)          MM quotes are stale. Stat-arb spread z-scores are
                            wrong. Entry and exit signals fire incorrectly.

VaR threshold               Guardian allows orders that breach actual limits.
                            Risk is taken that the system believes it is not.

slippage_bps (fill record)  Backtest cost model is never calibrated to reality.
                            Strategy appears profitable but is not in live trading.

feature computed_at         Research and production use different definitions.
                            Live Sharpe diverges from backtest Sharpe silently.
─────────────────────────────────────────────────────────────────────────────

Every field in the top section of this table has a clear, material answer to "what breaks." Every one of them should have a named owner and automated quality monitoring. The available_at timestamp on the tick table is arguably the single most important field in the entire data model, precisely because the consequences of it being wrong are invisible until a strategy fails in live trading and someone traces it back to inflated backtest performance.

Data quality underpins core processes in banking. It spans domains, enabling functions that range from credit-risk modelling and operational-risk controls to accurate underwriting and pricing. Yet even when data quality is not the stated focus, it still plays a major role in decision-making, compliance and financial outcomes.


What governance weakness actually looks like

Governance weakness is not the absence of policies. Most organisations have policies. Governance weakness is the gap between policy and accountability.

Consider a scenario that plays out in financial services organisations regularly. A regulatory report contains an error. The regulator identifies it. The institution begins an internal investigation. The investigation reveals that the figure in the report was produced by an automated pipeline that reads from a database table. The table is populated by an ETL job. The ETL job was written two years ago and the original engineer has left. The table has no data owner in the governance catalogue because the governance catalogue was implemented after the table was created and the migration of legacy assets was never completed. The business unit that uses the report believes the data team owns the quality. The data team believes the business unit defines the requirements and therefore owns the quality. Both are partially right. Neither owns the risk.

Many companies encounter significant barriers to developing a comprehensive and consistent view of data risk. These challenges include a lack of accountability within the enterprise risk framework, unclear ownership of data risks, limited understanding of data risks within the first line of defence, and low visibility into risk.

This is governance weakness. The data was making real decisions. Nobody owned the risk. When it broke, nobody knew whose problem it was.

The "what breaks" question, asked prospectively, prevents this scenario. If a field feeds a regulatory report, someone in that conversation should be able to say: if this field is wrong by five percent, our submission is non-compliant, and the person accountable for that submission is this named individual. That individual owns the CDE. Not because a governance policy says so, but because they are the person who would have to explain the error to the regulator.

Accountability follows consequence. The "what breaks" question makes the consequence visible, which surfaces the accountability.


The BCBS 239 lesson

BCBS 239, the Basel Committee on Banking Supervision's principles for effective risk data aggregation and risk reporting, is the most explicit regulatory articulation of the data quality problem in financial services. It was published in 2013 in direct response to the 2008 financial crisis, during which many institutions discovered that they could not aggregate their own risk exposures quickly or accurately enough to inform decisions.

The core finding was not that the data was wrong. It was that nobody knew how wrong it was, and nobody was accountable for finding out.

Principle 2 of BCBS 239 requires data integrity: risk data should be accurate and reliable. Principle 3 requires completeness. Principle 4 requires timeliness. Principle 5 requires adaptability under stress conditions. But threading through all of them is an implicit demand that sits in the "what breaks" question: these requirements only have teeth if someone is responsible for meeting them and held accountable when they are not met.

Regulatory initiatives like BCBS 239 highlight banks' obligation to achieve timely, accurate, complete and integrated risk data for both normal and stress conditions. Meeting these standards enables faster risk aggregation and sound capital decisions, while failure exposes firms to supervisory findings, financial penalties and slower crisis response.

One major bank received a fine of $400 million from the Office of the Comptroller of the Currency in 2020 for inadequate data governance, internal controls, and risk management. Additional fines of $136 million followed in 2024 for insufficient progress on the same issues, specifically around regulatory reporting. The total remediation and penalty cost substantially exceeded what a properly funded governance programme would have cost over the same period.

The lesson is not that data governance is expensive. It is that the absence of governance is more expensive.


Applying the test in practice

The "what breaks" test works as both a prospective governance tool and a retrospective diagnostic.

Prospectively, when onboarding a new data source: Before a new data feed or table enters production, the review should include this question explicitly. What decisions does this data inform? If those decisions are material, who owns the quality of this source? What quality rules need to be defined before it can be used in a production system? This is not bureaucracy. It prevents the "no owner" scenario from being inherited into the production estate.

When reviewing an existing data estate: Apply the two-by-two matrix. For each dataset, identify whether anything material depends on it and whether there is a named owner. The top-left quadrant items are the immediate priority. Assign owners, define quality rules, and automate monitoring. The bottom-left quadrant items can be documented and deprioritised.

When investigating a data quality incident: Start from the consequence and trace backwards. What broke? What data caused it? What were the quality rules in place for that data? Who owned those rules? The answers expose the governance gap precisely, without requiring a full estate audit to find the source of the problem.

When assessing a vendor or third-party data source: The standard vendor assessment focuses on uptime, latency, and price. It should also include: what happens to your decisions if this data is wrong for four hours? For four days? If the answer is material, the contract should include explicit data quality SLAs, lineage documentation, and incident response obligations. Many institutional data contracts do not.


What the test does not do

The "what breaks" question is necessary but not sufficient.

It does not tell you how to fix bad data. It tells you which bad data is urgent to fix. The remediation work, the data quality rules, the validation pipelines, the stewardship processes, these are separate problems that require separate solutions.

It does not replace technical data quality measurement. You still need profiling, completeness monitoring, freshness checks, and anomaly detection. What the test does is determine which datasets those tools should be applied to first.

It does not resolve the technical complexity of ownership in a distributed system. In a microservices architecture with dozens of teams producing and consuming data, ownership can be genuinely ambiguous at the boundary of two systems. The test surfaces that ambiguity. Resolving it requires explicit conversation and organisational decisions about team boundaries.

It does not account for the accumulation of risk over time. A dataset that feeds a low-stakes decision today may feed a high-stakes decision in two years because a new model was built on top of it. Governance is not a one-time audit. It is a continuous process of re-asking the same question as the data estate evolves.


A practical schema for ownership

Once the test identifies a Critical Data Element, the ownership record needs to be minimal, unambiguous, and kept current. Elaborate governance catalogues with dozens of metadata fields are often filled in once and never maintained. The essential fields are four.

CREATE TABLE critical_data_elements (
    cde_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    field_path      TEXT NOT NULL,         -- e.g. 'ticks.available_at'
    decision_impact TEXT NOT NULL,         -- what breaks if this is wrong
    business_owner  TEXT NOT NULL,         -- named individual, not a team
    quality_rule    TEXT NOT NULL,         -- one-line definition of "correct"
    monitor_query   TEXT,                  -- SQL or alert definition
    last_reviewed   DATE NOT NULL,
    status          TEXT NOT NULL          -- 'governed' | 'gap' | 'remediation'
);

field_path identifies the data precisely. decision_impact records the answer to the "what breaks" question, which is what justifies the element being in this table. business_owner must be a person, not a team name. Teams do not receive pages at 2am. People do. quality_rule should be expressible in one sentence: the value must be a valid ISIN, the timestamp must not be in the future, the balance must reconcile to the settlement instruction within one basis point. If it cannot be expressed in one sentence, it is not specific enough to be monitored automatically.

# Automated quality check pattern in Polars
import polars as pl
from dataclasses import dataclass
from typing import Callable

@dataclass(frozen=True)
class CDECheck:
    cde_id: str
    description: str
    owner: str
    rule: Callable[[pl.LazyFrame], pl.LazyFrame]   # returns failing rows

def run_cde_checks(df: pl.LazyFrame, checks: list[CDECheck]) -> dict:
    results = {}
    for check in checks:
        failing = check.rule(df).collect()
        results[check.cde_id] = {
            "owner":       check.owner,
            "description": check.description,
            "failures":    len(failing),
            "passed":      len(failing) == 0,
        }
    return results

# Example check definitions
checks = [
    CDECheck(
        cde_id      = "tick.available_at",
        description = "available_at must not be earlier than ts (event time)",
        owner       = "data.engineering@firm.com",
        rule        = lambda df: df.filter(pl.col("available_at") < pl.col("ts")),
    ),
    CDECheck(
        cde_id      = "orders.fill_price",
        description = "fill_price must be positive and within 10% of mid at submission",
        owner       = "execution.desk@firm.com",
        rule        = lambda df: df.filter(
            (pl.col("fill_price") <= 0) |
            (
                (pl.col("fill_price") - pl.col("arrival_mid")).abs()
                / pl.col("arrival_mid") > 0.10
            )
        ),
    ),
]

The check functions return failing rows, not boolean flags. Returning the rows means the alert contains enough information for the owner to investigate without querying the database again.


The governance maturity signal

The final diagnostic value of the "what breaks" question is that it reveals governance maturity without requiring a formal assessment.

Organisations in the early stages of governance maturity cannot answer the question at all. Data is not mapped to decisions. Nobody knows which datasets feed which processes. When asked what breaks if a particular table is wrong, the answer is a meeting to find out.

Organisations in the middle stages can answer the first half. They know what breaks. But the second half, who owns the risk, surfaces a gap. They have data catalogues and stewardship roles on paper. But when pressed, the steward owns the metadata, not the quality, and the quality has no named owner who would be accountable for a failure.

Organisations with mature governance can answer both halves immediately. The answer exists in a living document that was reviewed in the last quarter. The owner is a named individual who has accepted accountability for the quality rule and whose performance objectives include it.

In an FSI forum 2025 survey, data quality was cited as the main challenge for AI implementation by 55% of financial-services respondents, far ahead of regulatory complexity, skills gaps, and budget constraints. The reason it ranks so highly is not that data quality is technically unsolvable. It is that in most organisations, the governance structures that would make quality a matter of accountability rather than aspiration are not in place. The "what breaks" test is the beginning of building those structures.


For the data engineer specifically

The "what breaks" question is often framed as a business concern. This is a mistake. It is equally a technical concern, and data engineers are often better positioned than business stakeholders to identify the answers.

A data engineer building a pipeline knows exactly which downstream tables depend on their output. They know which fields are joins, which are filters, which are model features, and which are stored in a summary table that feeds the CEO's dashboard. They know where the data quality checks are and where they are not.

The problem is that this knowledge is in people's heads rather than in governance artefacts. When that engineer leaves, the knowledge leaves with them. The orphaned pipeline running on a cron job with no owner is the natural end state of governance that exists only as institutional memory.

Data lineage tooling, whether embedded in an orchestrator like Airflow or Dagster, or managed through a dedicated catalogue, converts that implicit knowledge into explicit documentation. But the lineage graph only becomes a governance instrument when it is annotated with the "what breaks" information: which nodes in the graph are CDEs, who owns them, and what the quality rules are.

# Dagster asset annotation pattern

from dagster import asset, AssetIn

@asset(
    ins={"ticks": AssetIn()},
    metadata={
        "cde": True,
        "decision_impact": "feeds VaR calculation and Guardian position checks",
        "business_owner": "risk.team@firm.com",
        "quality_rule": "no null fill_price, slippage_bps within -50 to 200 bps",
    },
    tags={"governed": "true", "criticality": "high"},
)
def positions(ticks):
    """Derived positions table. Tagged CDE: any quality issue here
    propagates directly to live risk calculations."""
    ...

Annotating assets with CDE metadata at the definition level means the governance information travels with the code rather than living in a separate catalogue that is never consulted during development.


References and further reading

For the foundational data management framework that contextualises CDE governance within a broader data management function: DAMA International, DAMA-DMBOK: Data Management Body of Knowledge, 2nd edition (2017). The canonical reference for data governance roles, responsibilities, and practice areas including data quality management and metadata management.

On the regulatory requirements for risk data governance in banking: Basel Committee on Banking Supervision, Principles for effective risk data aggregation and risk reporting (BCBS 239), January 2013. Available at bis.org. The most precise articulation of what regulators expect from financial institutions regarding data quality, ownership, and aggregation capability.

For a practitioner account of what governance failure costs in financial services: the OCC consent order against a major US bank (2020) and the subsequent $136 million supplement (2024) for insufficient remediation progress, specifically around regulatory reporting data quality. Public documents available through the OCC enforcement actions archive.

On Critical Data Elements and governance automation in 2025: Alation, Critical Data Elements Best Practices for Data Governance (2025), available at alation.com. Useful for the implementation detail around CDE catalogues, ownership assignment patterns, and quality monitoring integration.

On the broader cost of poor data quality in financial services: SAP Fioneer, The true cost of poor data quality in banking and insurance (2025). The FSI forum 2025 survey cited data quality and integration as the main barriers to enabling and scaling AI, with 55% of financial-services respondents identifying it as the primary challenge.

On data risk management in financial services: Deloitte, Managing Data Risk in Financial Services. Useful for the enterprise risk framework perspective on how data risk sits alongside credit, market, and operational risk.