The dominant narrative in AI right now is scale. More parameters. More compute. More data. Bigger context windows. The implicit assumption is that intelligence or at least capability is a function of size.
CERN is running a quiet counter-experiment to that assumption, and the results are extraordinary.
At the Large Hadron Collider in Geneva, CERN is deploying AI models so small they can be physically compiled into the circuit logic of a silicon chip. These models make decisions in under 50 nanoseconds not milliseconds, not microseconds, nanoseconds on one of the most extreme data streams ever produced by human technology. And they do it with no GPU in sight.
I find this genuinely fascinating. Not just as a curiosity from particle physics, but as a signal about where real-world AI engineering is going and what it tells us about the gap between what the industry builds and what production systems actually need.
The Scale of the Problem
The LHC generates approximately 40,000 exabytes of raw data per year.
Let me put that in perspective. The entire current internet every webpage, video, message, database, and cloud storage bucket in existence is estimated at roughly 120 to 150 exabytes of traffic per day. The LHC produces about a quarter of that every year, compressed into a particle detector the size of a small building.
At peak luminosity, the data rate hits hundreds of terabytes per second.
Inside the 27-kilometre ring, proton bunches cross each other roughly every 25 nanoseconds. Each hard collision between protons generates a particle shower that produces several megabytes of detector data. The problem is that storing or even transmitting all of this is physically impossible not impractical, not expensive, impossible.
So CERN has to throw almost all of it away. Only about 0.02% of all collision events are retained for analysis. The rest is discarded permanently, in real time, at the detector level.
The question the only question that matters is: which 0.02%?
The Level-1 Trigger: Where AI Meets Physics at Nanosecond Speed
The first filtering stage is called the Level-1 Trigger. It consists of approximately 1,000 FPGAs field-programmable gate arrays, reconfigurable silicon chips that can be compiled into arbitrary logic circuits arranged around the detector.
These chips have a budget of less than 50 nanoseconds to decide whether a collision event is worth keeping. If they say no, the data is gone forever. There is no second chance, no human review, no recovery.
The algorithm running on these chips is called AXOL1TL. It's a neural network but not anything you'd recognise from the standard deep learning playbook. It's a highly compressed anomaly detection model, trained to identify collision signatures that deviate from known physics in ways that might indicate something new and scientifically valuable.
What makes it possible to run a neural network in 50 nanoseconds on an FPGA?
The answer is the toolchain: an open-source framework called HLS4ML (High-Level Synthesis for Machine Learning). It takes a model written in PyTorch or TensorFlow and compiles it not into CUDA kernels or TensorFlow graphs, but into synthesisable C++ code that describes the model as a circuit. That circuit is then physically implemented in the FPGA's reconfigurable logic fabric.
The model doesn't run on a processor. It is the processor. Every layer, every activation function, every weight implemented as a combination of logic gates, multipliers, and lookup tables burned into silicon.
And a critical architectural choice: CERN doesn't use all the available chip area for the neural network itself. A substantial portion is reserved for precomputed lookup tables essentially a cache of common input patterns and their outputs. For the vast majority of typical detector signals, the chip never performs a calculation. It looks up the answer. That's how you get to nanosecond latency.
The Full Filtering Pipeline
After the Level-1 Trigger does its brutal first cut from hundreds of terabytes per second down to something manageable the surviving data moves to the second stage: the High-Level Trigger.
This is a conventional computing farm: 25,600 CPUs and 400 GPUs that process terabytes per second of pre-filtered data. At this stage, the latency budget is measured in milliseconds rather than nanoseconds, so conventional deep learning inference becomes viable.
After both stages, the retained data is approximately one petabyte per day the scientifically valuable fraction of everything the LHC produces. That petabyte gets stored, distributed to research institutions worldwide, and analysed over months and years by thousands of physicists.
The architecture, end to end, looks like this:
LHC Detector Output
↓ hundreds of TB/sec
↓ ~25 ns decision window
[ Level-1 Trigger ] ← 1,000 FPGAs running AXOL1TL
↓ 99.98% discarded
↓ ~microsecond to millisecond processing
[ High-Level Trigger ] ← 25,600 CPUs + 400 GPUs
↓ further reduction
↓
[ ~1 petabyte/day retained ]
↓
Physics analysis
What strikes me about this architecture is how familiar the pattern is even from a data engineering perspective. It's a tiered filtering pipeline. Bronze is everything; gold is the tiny fraction that actually matters. The challenge is doing the triage correctly at extreme speed, because you can't afford to store bronze.
The LHC just has to do that triage in 50 nanoseconds instead of overnight.
HLS4ML: The Compiler That Makes It Possible
For anyone building data systems, the most interesting part of CERN's stack is HLS4ML.
The workflow is:
- Train a model normally in PyTorch or TensorFlow
- Run it through HLS4ML, which applies aggressive quantisation (reducing weights from FP32 to INT4 or INT8), pruning, and layer fusion
- HLS4ML generates synthesisable C++ (using Vivado HLS or similar tools)
- The C++ is synthesised into an FPGA bitstream the actual circuit configuration
- The bitstream is loaded onto the chip, where the model literally becomes the hardware
The resulting models are tiny by any conventional standard. The AXOL1TL architecture is a variational autoencoder a model designed to reconstruct "normal" collision events and flag anomalies where reconstruction fails. Implementing this in FPGA logic with 50-nanosecond latency requires aggressive compression of everything: the network architecture, the numerical precision, and the inference path.
The precomputed lookup table approach is particularly interesting. In standard neural network inference, every forward pass involves floating-point multiplications and additions. At nanosecond timescales, even a handful of sequential multiplications is too slow. By precomputing outputs for common input patterns, CERN converts inference from computation to memory lookup trading silicon area for speed.
This is a principle that applies far beyond FPGAs. The same trade-off precompute at rest, serve at speed underlies database materialised views, feature stores in ML systems, and lookup-table-based scoring in fraud detection. CERN is just doing it at extreme resolution.
What Comes Next: The High-Luminosity LHC
The current LHC is scheduled for a major upgrade in 2031: the High-Luminosity LHC (HL-LHC). This upgrade will increase luminosity by roughly a factor of ten meaning ten times more collision events, significantly larger event sizes, and a proportionally larger data stream.
CERN is already building the next generation of its FPGA AI pipeline to handle this. The challenge is not just "more of the same" it's a qualitative step change. The Level-1 Trigger will need to become even faster, or more selective, or both. The models will need to be even smaller and even more accurate.
This is where the research frontier sits: can you build a neural network that fits in under 1,000 logic gates, infers in under 25 nanoseconds, and still catches the signals that matter?
The answer, apparently, is yes but it requires rethinking what a neural network is and how it's deployed.
Why This Matters Outside Physics
I want to step back and say something about why I'm writing about a particle physics experiment on a data engineering blog.
The insight that CERN is demonstrating is not about particle physics. It's about what happens when you take latency and resource constraints seriously as first-class engineering requirements not afterthoughts, not optimisation pass candidates, but hard constraints that shape the entire system design from the ground up.
The AI industry's instinct, when faced with a hard problem, is to add compute. More GPUs. Bigger clusters. Longer training runs. This works up to a point. But it produces systems that are fragile outside their intended deployment envelope. A model that requires an A100 GPU to run can't run at the edge. A model that needs 200 milliseconds can't run in a real-time trading system. A model that needs 16GB of VRAM can't run on embedded hardware.
CERN's approach is the opposite instinct: define the constraint first, then build toward it. 50 nanoseconds. 1,000 logic elements. Fit the model into that envelope or don't deploy it.
In high-frequency trading, the same constraint logic applies. Latency in order execution is measured in microseconds. Risk models that flag anomalous positions need to evaluate in nanoseconds, not seconds. FPGAs are already used extensively in HFT infrastructure for exactly this reason and the CERN work suggests that learned models, not just handcrafted rules, can live in that envelope.
In insurance and financial services risk systems, real-time fraud detection and claims triage systems are increasingly under pressure to make decisions at transaction speed. The current generation is mostly rule-based or uses lightweight ML models running on CPUs. The CERN architecture tiny learned models compiled to hardware points toward a future where more sophisticated anomaly detection can happen at the same speed as the transaction itself.
In medical imaging and diagnostics, real-time analysis during procedures ultrasound, intraoperative imaging requires sub-millisecond inference on embedded hardware with limited power budgets. The same FPGA-compiled neural network approach is being explored here, and CERN's toolchain (HLS4ML is open source) provides a ready starting point.
In edge IoT and industrial sensing, the constraint is often power, not just speed. A model compiled into an ASIC or FPGA consumes orders of magnitude less power than the same model running on a GPU. For battery-powered sensors or remote monitoring equipment, this is the difference between viable and unviable.
The Architectural Lesson I Keep Coming Back To
There's a pattern in how the best data systems are designed that the CERN architecture exemplifies cleanly:
Understand what you actually need to keep, then build the minimum system that keeps it.
This sounds obvious. It isn't. The default engineering instinct is to keep everything store it all, filter later, decide downstream. The LHC makes that impossible. You have 25 nanoseconds. You keep 0.02%. You build a system that makes that decision correctly, at that speed, or you lose the signal forever.
Most data systems I work with don't face anything remotely approaching those constraints. We have time. We have storage. We can afford to be sloppy at ingest and clean later. The medallion architecture exists precisely because of that luxury land raw, refine downstream.
But there are contexts real-time fraud scoring, market risk calculation, claims triage where the luxury evaporates. Where the event happens once and the decision has to be made now. And in those contexts, the design philosophy CERN has been forced to adopt small, fast, hardware-first, decision at the edge is the right one.
The LHC is an extreme case. But it's extreme in a direction that more systems are going to find themselves moving toward, as real-time data volumes grow and the expectation of immediate intelligent responses increases.
Closing Thought
The most counterintuitive thing about CERN's AI work is this: they are running some of the most consequential AI inference in the world literally deciding which signals from the fundamental structure of matter are worth keeping on models so small they fit in a chip floorplan you could draw by hand.
Meanwhile, elsewhere in the AI industry, the conversation is about whether GPT-6 will need a nuclear power plant.
Both directions are real. Both have their applications. But CERN's work is a reminder that the most impressive engineering is often not the biggest it's the most precisely fit to the problem.
40,000 exabytes in. One petabyte out. 50 nanoseconds per decision. Zero second chances.
That's a data engineering problem. And they've solved it with a neural network smaller than a thumbnail.
Source: "CERN Uses Tiny AI Models Burned into Silicon for Real-Time LHC Data Filtering" TheOpenReader, March 28, 2026. Primary technical sources: AXOL1TL V5 architecture (CERN Twiki), arXiv:2411.19506 (Real-time Anomaly Detection at the L1 Trigger of CMS Experiment).