Conformal E-Prediction | Matt Y. Cheung

If you deploy machine learning models in the real world, you know a fundamental truth: data is rarely static. The environment your model was trained in is almost never the exact environment it operates in months later. Traditional Conformal Prediction (CP) is an incredible tool for quantifying uncertainty and providing reliable error bars around AI predictions. However, it relies heavily on a fragile, mathematically rigid assumption of exchangeability—the idea that your future, incoming data looks exactly like your past calibration data.

In practice, exchangeability is routinely violated. A computer vision model trained on sunny summer driving data will inevitably face winter blizzards. An algorithmic trading bot trained in a bull market will suddenly encounter a crash. If you peek at your data stream early to check on performance, or if your environment fundamentally shifts, the mathematical guarantees of classical CP completely fall apart, leaving you with false confidence.

A recent breakthrough paper, “E-Values Expand the Scope of Conformal Prediction” by Etienne Gauthier, Francis Bach, and Michael I. Jordan, offers a powerful, elegant solution to this exact problem. By replacing traditional, rank-based p-values with a dynamic concept called e-values, they have built a framework that allows engineers to monitor uncertainty dynamically, stop experiments at any arbitrary time, and rigorously handle non-stationary data streams.

Here is a deep dive into the math, the statistical mechanics, and the practical engineering of this new paradigm.

The Problem: Why “Peeking” Breaks Traditional Statistics

In traditional statistics and classical Conformal Prediction, uncertainty is quantified using p-values. To understand why they fail in streaming environments, you must realize that a p-value is like a finish-line photo in a marathon. Its mathematical validity relies entirely on calculating it at a single, pre-determined stopping point defined before the experiment even began.

Imagine you are running an A/B test for a new e-commerce checkout flow. You decide to run the test for 10,000 visitors. If you wait until exactly 10,000 visitors have passed and then calculate your p-value, your error guarantees are perfectly intact.

But what if you monitor a real-time dashboard? At visitor 400, by pure random luck, the new checkout flow happens to look incredibly successful. The p-value temporarily dips to $0.04$. If you say, “Aha! Statistical significance! Let’s stop the test and deploy,” you are committing a statistical sin known as p-hacking, optional stopping, or data snooping.

Because data is inherently noisy, random walks will fluctuate wildly. If you check an experiment continuously, you give yourself multiple chances to get lucky with that random noise. A famous statistical theorem, the Law of the Iterated Logarithm, dictates that if you wait long enough and check often enough, a random walk will eventually cross almost any boundary. If you check your test 20 times, your true chance of accidentally finding a “significant” result shoots past your intended $5\%$ target to roughly $25\%$. Traditional CP simply cannot handle dynamic, real-time decision-making because moving the finish line breaks the foundational math.

The Mathematical Foundation: Filtrations and $\sigma$-algebras

To fix the peeking problem, the authors completely change the mathematical foundation, borrowing heavy machinery from the study of sequential probability and stochastic processes.

Filtrations and $\sigma$-algebras

In everyday life, information accumulates. We don’t learn everything at once; we learn sequentially. In measure-theoretic probability, we define “what is knowable” at a specific moment using a $\sigma$-algebra (denoted as $\mathcal{F}_t$). Think of $\mathcal{F}_t$ as a rigorous, mathematical database of all the events and outcomes your current state of knowledge allows you to measure with a definitive “Yes” or “No”.

As time moves forward, your database of facts grows. A filtration is the mathematical representation of this accumulating, unchangeable history:

\[\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \dots\]

Why does the set of $\sigma$-algebras increase over time? Because time is an arrow that accumulates facts. At $t=2$, the past information from $t=1$ hasn’t changed—it is permanently frozen—but your overall knowledge base has expanded to include the new data from $t=2$. Your field of vision simply gets wider.

Martingales and Supermartingales

A martingale represents a perfectly fair casino game. It models a sequence of random variables where your expected future wealth is exactly your current wealth, given everything you know today:

\[\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = M_t\]

In this paper (specifically Theorem 2.5), the authors define their tracking metric not as a strict martingale, but as a supermartingale, meaning the expected future value is equal to or less than the current value ($\leq M_t$). Why use a supermartingale?

The Safety Cushion: A supermartingale represents a fair or slightly disadvantageous game (like betting against the house). If you can prove that mathematical guarantees hold for a disadvantageous game, they automatically and perfectly protect you against worst-case scenarios in the real world.

Asymmetric Error Control: In Conformal Prediction, making your intervals too wide is annoying, but making them too narrow is catastrophic (you lose your coverage guarantee). If the supermartingale drifts downward faster than expected, it mathematically forces the prediction intervals to widen safely, guaranteeing you never accidentally exceed your target error rate $\alpha$.

By framing uncertainty as a supermartingale, the authors unlock Ville’s Inequality. While standard statistics uses Markov’s inequality for single snapshots, Ville’s Inequality is a sequential law proving that the probability of a gambler’s wealth ever multiplying by a factor of $k$ by pure luck across an entire timeline is strictly less than $1/k$. This is the magic bullet: it means you can peek at your data a million times, and your statistical safety guarantee holds flawlessly forever.

The Engine: The E-Variable Formulation

Instead of ranking data from 1 to $N$ to create p-values, the authors use e-values (expectation values). An e-value is a non-negative random variable that measures evidence against a null hypothesis in the form of a betting payout.

While there are many ways to construct e-values (like likelihood ratios), this paper relies on a beautiful, tuning-free formulation (Equation 4) that evaluates a proposed candidate label $y$:

\[E_{\text{test}}(y) = \frac{S(X_{\text{test}}, y)}{\frac{1}{n+1} \left( \sum_{i=1}^n S(X_i, Y_i) + S(X_{\text{test}}, y) \right)}\]

This equation simply asks: “How large is the error score ($S$) of my current guess compared to the average error of everything we’ve seen so far, including this new guess?”

This specific formulation is brilliant for three reasons:

Scale-Invariance: Because it is a relative ratio, it is totally immune to arbitrary scaling. If you suddenly switch your neural network’s loss function to output values 100x larger, the 100 cancels out in the numerator and denominator.

No Hyperparameters: It requires absolutely no tuning or parameter guessing.

Guaranteed Validity: By including the test point’s score in the denominator, the math forces a beautiful symmetry. Under the null hypothesis of exchangeable data, the expected value of this fraction across all points is guaranteed to average out to exactly $1$ or less.

Building Prediction Sets (Corollary 2.6)

How do we actually build an “anytime-valid” prediction set, $\hat{C}t$, around a completely fresh data point? We use a baseline martingale ($M{t-1}$) that tracks our historical performance across all previous time steps.

For a new data point, we perform a hypothetical trial for every possible candidate label $y$ in our label space. We calculate what the e-value $E_t(y)$ would be if that label were the truth, and multiply it by our accumulated history. If this hypothetical product breaches our safety wall (defined by our error tolerance $\alpha$), we reject the label from the set:

\[\hat{C}_t(X_t) = \left\{ y \in \mathcal{Y} : M_{t-1} \times E_t(y) < \frac{1}{\alpha} \right\}\]

For example, if you want a $95\%$ confidence set, your $\alpha$ is $0.05$, making your threshold $1/0.05 = 20$. If multiplying your historical martingale by the candidate e-value pushes the total above 20, that label is deemed statistically implausible and discarded.

Why Sets Stay Constant While Martingales Move

If you run code simulations of this math, you will notice something fascinating: the baseline martingale $M_t$ fluctuates wildly at every single step, but the prediction set $\hat{C}_t$ often remains completely frozen for long periods.

This is not a bug; it is the threshold effect. The prediction set only changes if the combination of a new candidate and the history crosses the strict $1/\alpha$ boundary.

The Burn-In Phase: During the early steps of a stream (e.g., $t < 30$), $M_{t-1}$ is hovering near its starting value of 1. Based on Equation 4, with so few historical samples in the denominator, it is mathematically impossible for any single $E_t(y)$ to spike high enough to cross a threshold like 20. As a result, the set conservatively accepts all labels. Practitioners must be aware that e-conformal systems require a warm-up period.

The Stable Phase: Once the model proves itself over hundreds of steps, the true e-values hover slightly below 1, causing $M_{t-1}$ to shrink to a tiny fraction (e.g., $0.01$). False labels generate high e-values, but a high e-value multiplied by $0.01$ still won’t cross the massive gap back up to 20. The sets lock into a highly informative, constant size (often just 1 or 2 labels) until a major data shock occurs.

Optimizing the System: Online Gradient Descent

As noted in the paper’s Remark 2.7, simply multiplying e-values together ($M_t = \prod E_s$) gives every single data point equal weight. This is a blunt instrument. A bad prediction from three months ago carries the same weight as a great prediction from yesterday, making the system sluggish to react to shifting environments.

To maximize efficiency, modern sequential statistics borrows heavily from Game Theory and Online Machine Learning by applying Online Gradient Descent (OGD) to martingale design. By introducing a tunable betting fraction ($\lambda_t$) for each time step, the martingale updates as:

\[M_t = M_{t-1} (1 + \lambda_t (E_t - 1))\]

Instead of passively accepting the e-value, we are acting like a gambler choosing how much of our bankroll ($\lambda_t$) to bet on the model’s accuracy. We can treat the log-growth of this martingale as a standard convex optimization problem.

OGD acts as a dynamic betting engine: if the model is predicting accurately, OGD aggressively increases $\lambda_t$ to maximize the martingale’s growth, which shrinks prediction intervals rapidly. If the data suddenly gets noisy or the model degrades, the gradients instantly force $\lambda_t$ back toward zero. This adaptive “learning rate” protects the mathematical guarantees from being wiped out by sudden, heavy-tailed shocks.

Real-World Systems: PID Control vs. E-Values

In advanced, enterprise-grade ML production pipelines, engineers often use Conformal PID Control—a feedback loop that continuously widens or narrows prediction intervals based on recent error rates, operating exactly like a household thermostat.

If PID control successfully adapts to distribution shifts automatically, why do we need the complexity of e-values and martingales at all?

Think of it this way: PID control is a thermostat, but an E-Value martingale is a fire alarm.

PID Control smoothly adapts to keep coverage stable. If an automated loan approval model slowly degrades over six months, the PID controller will quietly widen the confidence intervals to maintain $95\%$ accuracy. However, to the business, the system appears to be working fine, masking the underlying model decay. It cannot tell you if the furnace is broken.

E-Value Martingales act as statistical watchdogs. While the PID controller manages the day-to-day fluctuations, the martingale monitors the foundational null hypothesis. Because of their rigorous, mathematically proven Type-I error bounds, e-values provide the definitive authorization required to trigger expensive, high-stakes interventions—like halting automated trading, waking up an on-call engineer at 3:00 AM, or triggering a massive GPU cluster to completely retrain a foundation model.

Innovative Applications

Where does this complex math actually pay off in practice? The paper highlights several areas where classical CP structurally fails:

Streaming Data with Sudden Shifts: In cybersecurity or fraud detection, adversaries constantly change tactics. E-values naturally absorb these distribution shifts. When the model’s assumptions fail, the e-values organically spike, triggering automated circuit breakers.

Ambiguous Ground Truth: In tasks like medical imaging, you rarely have a single clean label. Three doctors might diagnose a scan as “malignant,” while two say “benign.” Traditional rank-based CP breaks down because there is no single scalar to rank. E-values, being continuous ratios, can seamlessly integrate soft, probabilistic labels, weighting the numerator and denominator by expert consensus.

Fixed-Hardware Constraints: Imagine deploying an ML model to an edge device, an IoT sensor, or an automated lab pipette machine that physically holds exactly 5 candidate drugs. Classical CP cannot guarantee a set size of exactly 5 without destroying its error bounds. E-values allow you to flip the problem: you fix the set size to 5, and the algorithm outputs a dynamically shifting confidence score ($\alpha_t$) for that specific batch, scaling intelligently based on how difficult the current data point is.

Pros and Cons of Conformal E-Values

The Pros

True Anytime-Validity: You can stop experiments, pause data streams, peek at results, and intervene at any millisecond without ever corrupting your statistical safety.

Dynamic Budgeting: It allows for highly efficient data-dependent coverage, scaling your statistical confidence based on the inherent difficulty of the example.

Model Agnostic & Structurally Flexible: It makes absolutely no assumptions about the underlying machine learning model, and handles non-stationary streams and probabilistic labels natively.

The Cons

The “Conservativeness” Penalty: There is no free lunch in statistics. Because Ville’s Inequality has to protect against an adversary stopping the game at any moment across an infinite horizon, conformal e-value intervals are inherently wider and more conservative than classical CP when deployed in perfectly static, clean environments.

The Burn-in Requirement: The sequential nature requires a warm-up period. Practitioners must endure $t > 1/\alpha$ steps of entirely uninformative prediction sets before the historical martingale baseline drops enough to provide tight bounds.

Computational Overhead: For classification tasks with massive label spaces (like the vocabulary of a Large Language Model), calculating a hypothetical denominator for every single possible candidate token at every time step introduces significant computational complexity compared to simple p-value sorting.

Conclusion

The Gauthier, Bach, and Jordan paper is a landmark contribution to uncertainty quantification. It proves that e-values aren’t meant to simply beat classical Conformal Prediction in static, perfectly controlled laboratory settings. Instead, e-values are the key to expanding CP’s scope, bringing rigorous, mathematically unshakeable uncertainty quantification into the messy, dynamic, and non-stationary realities of production machine learning. As AI systems are deployed in increasingly high-stakes environments, the transition from rigid p-values to dynamic e-values will become a foundational pillar of trustworthy AI.

Enjoy Reading This Article?