Building a Centralised ICD-10-CM Dataset: Daily Extraction from CMS

Clinical data is built on codes. Nearly every hospital admission, outpatient visit, insurance claim, and research dataset in the United States uses ICD-10-CM — the International Classification of Diseases, 10th Revision, Clinical Modification — as its primary language for describing what is wrong with a patient. If you are working in healthcare analytics, population health, actuarial modelling, or medical NLP, you will encounter these codes constantly.

The problem is that the authoritative source — the CMS (Centers for Medicare & Medicaid Services) annual release — is published as a ZIP file containing fixed-width flat text files. It is accurate and complete, but it is not particularly easy to query. Each year a new release drops (effective October 1), and any downstream datasets have to be refreshed to stay current.

This post describes how we extract the full ICD-10-CM code set into a structured, centralised JSON dataset, and automate that extraction on a daily schedule so it tracks the current fiscal year release without manual intervention.

What is ICD-10-CM

ICD-10-CM is the US clinical modification of the World Health Organization's ICD-10 standard. It has a three-level hierarchy:

Chapter — the broadest grouping, defined by a letter-number range. There are 22 chapters in the current release:

Chapter	Range	Title
I	A00–B99	Certain infectious and parasitic diseases
II	C00–D49	Neoplasms
IX	I00–I99	Diseases of the circulatory system
X	J00–J99	Diseases of the respiratory system
XIX	S00–T88	Injury, poisoning and certain other consequences of external causes
XXII	U00–U85	Codes for special purposes

Chapter XXII is worth noting: it was added specifically to accommodate COVID-19 (U07.1) and post-COVID condition (U09.9), illustrating how the classification evolves with clinical reality.

Block / category — a three-character code that groups related conditions. A00 is cholera, I10 is essential hypertension, J45 covers asthma. Some categories are non-billable headers; they exist to organise the hierarchy but cannot appear on a claim.

Billable code — a four-to-seven character code (with a decimal after the third character) representing a specific diagnosis. A00.0 is cholera due to Vibrio cholerae 01 biovar cholerae. I48.91 is unspecified atrial fibrillation. S72.001A is a closed fracture of the neck of the right femur — initial encounter.

The full FY2026 release contains approximately 72,750 codes, of which roughly 70,000 are billable leaf codes. The remaining ~2,750 are category headers.

The Source: CMS Official Release

CMS publishes ICD-10-CM data annually on their coding page. The release contains several files; the one we care about is the tabular-order flat file, typically named icd10cm_order_YYYY.txt. This file has a fixed-width layout:

Positions  Content
---------  -------
0–4        Sequence number (5 chars)
5          Space
6–12       ICD-10-CM code, space-padded to 7 chars
13         Space
14         Billable flag (1 = billable, 0 = non-billable header)
15         Space
16–75      Long description (60 chars)
76         Space
77+        Short description

A few rows from the file look like this:

00001 A0000 1 Cholera due to Vibrio cholerae 01, biovar cholerae       Cholera d/t V cholerae 01, cholerae
00002 A0001 1 Cholera due to Vibrio cholerae 01, biovar eltor           Cholera d/t V cholerae 01, eltor
00003 A000  0 Cholera, unspecified                                       Cholera, unspecified

Notice that codes are stored without the decimal point — A0000 rather than A00.0. The decimal is inserted at position 3 when formatting for display or storage.

Extraction Approach

The notebook uses a cascading strategy with three levels:

Level 1 — Scrape the CMS landing page

The CMS ICD-10 coding page lists the current year's download links. We scrape it with requests and look for ZIP file URLs matching known patterns:

def _scrape_cms_zip_url() -> str | None:
    r = requests.get(
        "https://www.cms.gov/medicare/coding-billing/icd-10-codes",
        headers=HEADERS, timeout=30
    )
    patterns = [
        r'(https://www\.cms\.gov/files/zip/[^"\'\ ]*icd.10.cm[^"\'\ ]*\.zip)',
        r'href="(/files/zip/[^"\'\ ]*icd.10[^"\'\ ]*\.zip)',
    ]
    for pat in patterns:
        hits = re.findall(pat, r.text, re.IGNORECASE)
        if hits:
            url = hits[0]
            if url.startswith("/"):
                url = "https://www.cms.gov" + url
            return url
    return None

This is deliberately URL-agnostic: if CMS changes their URL structure (which they do periodically), the scraper adapts automatically.

Level 2 — Try known URL patterns

If the scrape fails, we try a set of known URL templates for the current and adjacent fiscal years:

for year in [fy, fy - 1, fy + 1]:
    candidates += [
        f"https://www.cms.gov/files/zip/{year}-icd-10-cm-code-descriptions-tabular-order.zip",
        f"https://www.cms.gov/files/zip/{year}-icd-10-cm-codes.zip",
    ]

The fiscal year logic accounts for the October 1 rollover:

now = datetime.now(timezone.utc)
fy = now.year if now.month >= 10 else now.year

Level 3 — Static fallback

If all network sources fail — for example, in a CI environment with restricted egress — we load a curated static dataset that covers all 22 chapters with representative codes. The meta.partial flag is set to true in the output so downstream consumers know this is a sample, not the full release.

The static fallback is maintained directly in the notebook and serves as both a safety net and a documentation layer: it shows exactly what the code structure looks like without requiring a live download.

Parsing and Enrichment

Once the flat file is downloaded and extracted from the ZIP, we parse it into a pandas DataFrame:

def _parse_order_file(raw: bytes) -> pd.DataFrame:
    lines = raw.decode("latin-1").splitlines()
    records = []
    for line in lines:
        if len(line) < 16:
            continue
        code_raw  = line[6:13].strip()
        billable  = line[14] == "1"
        long_desc = line[16:76].strip()
        records.append({
            "code_raw":    code_raw,
            "code":        format_icd10(code_raw),  # inserts the decimal
            "billable":    billable,
            "description": long_desc,
        })
    return pd.DataFrame(records)

The decimal insertion follows a simple rule: take the first three characters as the category, then append a dot and the remainder:

def format_icd10(raw: str) -> str:
    raw = raw.strip()
    if len(raw) <= 3:
        return raw        # category header — no decimal
    return f"{raw[:3]}.{raw[3:]}"

After parsing, we enrich each row with:

chapter_num — derived by comparing the 3-character prefix against the chapter range table (e.g., A00 through B99 maps to Chapter 1)
chapter_range — the human-readable range string (A00-B99)
block_code — the 3-character category prefix
level — raw code length (3 = category header, 4–7 = billable subcategory)

The chapter definitions are hardcoded in the notebook. The chapter structure itself is stable across annual releases; what changes between fiscal years is the set of billable codes within each chapter (new codes added, old ones deleted or revised).

Output Schema

The notebook writes to data/analytics/icd10-data-extraction.json following the same pattern as the other analytics reports in this site. The file has three top-level keys:

meta — provenance and methodology:

{
  "title": "ICD-10-CM Code Dataset",
  "slug": "icd10-data-extraction",
  "kind": "reference-data",
  "data_source": "CMS ICD-10-CM FY2026 tabular-order ZIP (live download)",
  "fiscal_year": "FY2026",
  "effective_date": "2025-10-01",
  "coverage": "72,750 codes across 22 chapters",
  "partial": false,
  "generated_at": "2026-05-15T01:02:34.567890+00:00"
}

summary — aggregate statistics:

{
  "total_codes": 72750,
  "billable_codes": 70012,
  "header_codes": 2738,
  "chapters": 22,
  "blocks": 2103,
  "sample_size": 990
}

chapters — per-chapter statistics with a code sample:

[
  {
    "num": 1,
    "range": "A00-B99",
    "title": "Certain infectious and parasitic diseases",
    "total_codes": 1247,
    "billable_codes": 1209,
    "header_codes": 38,
    "unique_blocks": 112,
    "sample_codes": [
      { "code": "A00.0", "description": "Cholera due to Vibrio cholerae 01, biovar cholerae" },
      ...
    ]
  }
]

results — a stratified sample of billable codes (45 per chapter, ~990 total):

[
  {
    "code": "A00.0",
    "description": "Cholera due to Vibrio cholerae 01, biovar cholerae",
    "chapter_num": 1,
    "chapter_range": "A00-B99",
    "block_code": "A00",
    "billable": true,
    "level": 4
  }
]

The results array is deliberately capped. The full 72,000-code dataset at ~120 bytes per record would produce an ~8.7 MB JSON file — too large for a static page load. The chapter statistics in chapters are complete; the results sample is for illustration and search-index purposes. If you need the full dataset, run the notebook locally and drop the MAX_SAMPLE_CODES limit.

Daily Refresh via GitHub Actions

The extraction runs nightly at 01:00 UTC as part of the existing analytics.yml workflow. The step is identical in structure to the other notebook steps:

- name: Run ICD-10 data extraction notebook
  continue-on-error: true
  run: |
    jupyter nbconvert \
      --to notebook \
      --execute \
      --ExecutePreprocessor.timeout=1200 \
      --output icd10-data-extraction.executed.ipynb \
      notebooks/analytics/icd10-data-extraction.ipynb

continue-on-error: true ensures a transient network failure (e.g., CMS servers returning a 503 at 01:00 UTC) does not block the other screeners from running or prevent the commit step from pushing the JSON that was refreshed. The meta.errors array in the output captures any failures so they are visible in the data, not just lost in CI logs.

When the notebook completes, the workflow commits the updated data/analytics/icd10-data-extraction.json to main. The build.yml workflow watches data/analytics/**.json and triggers a full Astro rebuild, so the analytics page is updated within minutes of the nightly run.

Querying the Dataset

The most common operations against this dataset in analytics work:

Look up a single code:

import json
data = json.load(open("data/analytics/icd10-data-extraction.json"))
code_index = {r["code"]: r for r in data["results"]}
code_index.get("I10")
# {'code': 'I10', 'description': 'Essential (primary) hypertension', ...}

Filter by chapter:

chapter9 = [r for r in data["results"] if r["chapter_num"] == 9]

Find all codes in a block:

dm_codes = [r for r in data["results"] if r["block_code"] in ("E10", "E11", "E13")]

Chapter distribution:

import pandas as pd
df = pd.DataFrame(data["chapters"])
print(df[["num", "range", "title", "billable_codes"]].sort_values("billable_codes", ascending=False))

For the full code set (not just the sample), load the notebook runtime output from a local execution or use the full CMS ZIP directly. The notebook's parse_cms_zip function returns a complete DataFrame with all 72,000+ rows — you can write that to a local CSV or SQLite database for heavy querying.

Why Daily?

ICD-10-CM is updated once per fiscal year (October 1). So why run daily?

A few reasons:

Catch mid-year addenda. CMS occasionally publishes addenda (typically April 1) with corrections or new codes for emerging conditions. The notebook picks these up automatically on the next nightly run.
Detect source changes. If CMS changes the URL structure, file format, or adds a second archive, the scraping layer notices on the next run and logs an error rather than silently serving stale data.
Idempotent operations are cheap. If nothing has changed, the notebook runs in under two minutes, produces an identical JSON (modulo generated_at), and the git diff is empty — no commit, no rebuild. The cost of checking is low; the cost of missing a mid-year update is high.
Consistency with the other screeners. The Shariah screener and Islamic finance heatmap also run daily. A unified nightly schedule is operationally simpler than per-dataset cron expressions.

Key Statistics: FY2026 at a Glance

The current release illustrates how the code set is distributed across clinical domains:

The injury and poisoning chapter (XIX, S00-T88) is by far the largest, reflecting the combinatorial explosion of laterality (left/right), encounter type (initial/subsequent/sequela), and fracture specificity that the clinical modification introduces. A single fracture type can generate a dozen or more codes once all these dimensions are combined.

The neoplasm chapter (II, C00-D49) is the second largest, driven by site specificity: each primary site has codes for malignant, benign, in situ, uncertain, and unspecified behaviour.

Contrast this with the special-purposes chapter (XXII, U00-U85), which has fewer than 20 codes in the current release. It exists for rapid addition of new disease entities — COVID-19 was added here in FY2021 with a two-week effective date, skipping the normal annual cycle.

Next Steps

A few extensions worth exploring from this foundation:

ICD-10-PCS — the Procedure Coding System, used in inpatient settings alongside ICD-10-CM. Also published annually by CMS. Its structure is entirely different (7-character alphanumeric codes with a positional meaning system), but the same download + parse approach applies.

Cross-walk tables — CMS publishes mapping files between ICD-9-CM and ICD-10-CM for historical comparability. Adding these to the dataset would allow queries to span pre-2015 and post-2015 records.

Frequency weighting — joining the code set to public claims data (e.g., CMS Medicare Part B public use files) to annotate each code with actual utilisation volume. The static code list tells you what codes exist; the claims data tells you which ones actually matter in practice.

Embedding + semantic search — ICD-10-CM descriptions are short, structured natural-language strings. A small text embedding model can cluster semantically related codes across chapters and power a "find similar codes" search that goes beyond prefix matching.

All of these build on the same foundation: a clean, current, centralised ICD-10-CM code dataset — which is what this notebook produces nightly.

Data source: CMS ICD-10-CM FY2026 official release.