The Confidence Problem: Why Your AI Needs to Learn to Say "I Don't Know"

Your worklist shows a notification: "Possible PE detected—prioritize for review." You click on the study. The AI triage tool has flagged it, but provides no probability, no confidence score, no indication of which images triggered the alert. Just a binary flag: positive or negative. Finding present or not.

This isn't a research prototype. This is the reality of most FDA-cleared AI devices deployed in radiology today. They're triage tools—designed to flag studies for priority review or route them to specific worklists. And the vast majority provide nothing beyond a binary decision because that is what the FDA clearance allows.

When the AI is wrong—and it will be wrong—you have no way to know how confident it was about that wrong answer. A growth plate or nutrient canal gets misclassified as a fracture gets the same binary "fracture detected" flag as a obvious displaced femur fracture. The triage system treats them identically.

"An AI that can't communicate uncertainty is like a doctor who says you are disease free or not—whether they're certain, concerned, a bit puzzled or just guessing."

The Binary Triage Problem

Most FDA-cleared AI devices function as triage tools. They analyze imaging studies in the background and generate binary alerts:

"Intracranial hemorrhage detected"
"Large vessel occlusion"
"Pneumothorax detected"
"Rib fracture identified"

The study moves up on your worklist but critically, they provide no indication of confidence. You don't get a probability. You don't get an uncertainty measure. You often don't even get to see which image or which finding triggered the alert.

Behind the scenes, the model computed some internal score that crossed a threshold. But that threshold is hidden from you. A case that barely exceeded the threshold (model internal score: 51%) gets the same "detected" flag as an obvious case (model internal score: 99%).

Result: Every positive alert is treated as equally urgent. You have no way to know which alerts are reliable and which are edge cases. When the AI flags a growth plate as a fracture, you get the same urgent notification as you do for a displaced femur fracture.

Why This Matters Clinically

Consider two scenarios with a pneumothorax triage tool:

Case A: Large tension pneumothorax on an upright chest X-ray. Obvious to any human observer. The AI model's internal score: 0.95 (highly confident).
Case B: Skin fold artifact on a supine portable chest X-ray that vaguely resembles a pneumothorax. The AI model's internal score: 0.52 (barely above threshold, highly uncertain).

What you see in both cases: "Pneumothorax detected—exam moved to top of list."

Without uncertainty information, you treat both identically. You might immediately interrupt a procedure to review Case B, only to find it's artifact. As a result, alert fatigue from false positives means you might delay reviewing Case A.

What Uncertainty Quantification Actually Means

Uncertainty quantification (UQ) is the AI equivalent of a radiologist saying "I'm not sure about this one."

Instead of a simple binary flag, a UQ-enabled triage system could provide:

Uncertainty represents the trustworthiness of the model's decision. High uncertainty means the model doesn't have enough information to make a reliable call. Low uncertainty means the prediction is likely trustworthy.¹

Most importantly, UQ would allow triage systems to communicate: "I flagged this, but I'm not certain—please review carefully" versus "I flagged this and I'm highly confident—this is urgent."

Two Types of Uncertainty

Understanding where uncertainty comes from helps us reduce it:

Aleatoric uncertainty (data uncertainty): Irreducible noise in the data itself. Some chest X-rays genuinely look identical but have different diagnoses based on patient history. No amount of training data will eliminate this—it's inherent ambiguity.
Epistemic uncertainty (knowledge uncertainty): Uncertainty due to gaps in the model's training. The model hasn't seen enough diverse examples. This can be reduced by training on more varied data.¹

When a model flags high epistemic uncertainty, it's telling you: "I haven't seen enough cases like this to be confident." That's valuable information—it tells you exactly when to intervene.

Beyond the Black Box: A 2026 Perspective on Uncertainty Quantification in AI

1. The Reliability Crisis: Why "Accuracy" Is No Longer Enough

For the better part of a decade, the artificial intelligence community has been locked in a singular race: the pursuit of state-of-the-art (SOTA) accuracy. Whether pushing ImageNet scores by a fraction of a percentage or dominating the MMLU leaderboard, the metric of success was predictive performance.

As we settle into 2026, the focus is shifting and I believe that is critical for adoption in healthcare. The deployment of Large Language Models (LLMs) and autonomous vision systems into a high-stakes environment like healthcare has exposed a critical shortcoming. These models are very good at some tasks and sound convincing, but they are also confident sycophants. They frequently hallucinate with convincing language but fail to recognize when data lies outside their training distribution, and struggle to communicate "I don't know."

This post provides a deep dive into the current state of Uncertainty Quantification (UQ). Drawing from the latest research, it explores how the field is moving beyond simple probability scores toward rigorous, distribution-free guarantees and semantic-aware reliability.

2. The Computer Vision Revolution: Conformal Prediction Takes Center Stage

In the realm of computer vision, the era of heuristic calibration (like Temperature Scaling) is fading. A strong candidate for addressing certainty is Conformal Prediction (CP). Unlike traditional Bayesian methods, which often rely on unprovable priors, CP offers mathematically rigorous, finite-sample coverage guarantees.

However, standard CP relies on the exchangeability assumption—the idea that tomorrow's test data will look statistically identical to yesterday's calibration data. In the real world, this assumption rarely holds.

Breaking the Exchangeability Barrier: WQLCP

One recent approach to solving this "distribution shift" problem is Weighted Quantile Loss-scaled Conformal Prediction (WQLCP).¹

Standard Conformal Prediction fails when the test distribution drifts (e.g., a self-driving car moving from sunny California to snowy Toronto). WQLCP addresses this by dynamically scaling the prediction sets based on the "weirdness" of the input.

Why Calibration Isn't Enough

Some might argue: "Just calibrate the model. If it says 80%, make sure 80% of those cases are actually positive." Calibration is valuable—but it's not sufficient. Here's why:

Calibration ensures probabilities match observed frequencies in the calibration data set. If your model predicts 70% pneumonia for 100 cases, and 70 of those actually have pneumonia, the model is calibrated.

But calibration is population-level. It doesn't tell you whether this specific prediction is reliable. Even a perfectly calibrated model can be highly uncertain or very confident about individual cases and that does not match the probability.¹

Consider two well-calibrated pneumonia models:

Model A: Trained on many cases of Klebsiella pneumonia. Sees a Klebsiella case, predicts 70% pneumonia with low uncertainty (0.05).
Model B: Never trained on Klebsiella pneumonia. Sees the same case, predicts 70% pneumonia with high uncertainty (0.6).

Both are calibrated. Both output 70%. But Model B is far less reliable for this particular case because it had not seen Klebsiella before. Only uncertainty quantification reveals that.

Calibration Depends on Prevalence

Here's another problem: calibration isn't portable. A model calibrated on a population with 20% disease prevalence needs recalibration if you deploy it in a setting with 60% prevalence.

This means you can't just trust vendor claims about calibration. You need to validate calibration on your own population—and even then, you still need uncertainty quantification to know which individual predictions are trustworthy.

How Uncertainty Quantification Works

There are several approaches to quantifying uncertainty in deep learning models. The most common include:

1. Ensemble Methods

Train multiple models (with different initializations or architectures) on the same task. For each new image, pass it through all models. If they all agree (e.g., all predict 80-85% pneumonia), uncertainty is low. If they disagree widely (predictions range from 30% to 90%), uncertainty is high.

Pros: Intuitive, effective
Cons: Computationally expensive—you need to train and run multiple models

2. Bayesian and Drop Out Methods (Monte Carlo Dropout)

Run the same model multiple times on the same image, but randomly "drop out" different neurons each time. This simulates having an ensemble without training multiple models. The variance in predictions reflects uncertainty.

Pros: Single model, relatively efficient
Cons: Still requires multiple forward passes at inference time¹

3. Evidential Deep Learning

Train the model to collect "evidence" for each class from the image features. The more evidence collected, the lower the uncertainty. This method can output uncertainty in a single forward pass.

Pros: Fast at inference, theoretically principled
Cons: More complex to implement, less tested in medical imaging¹

Practical Applications of Uncertainty Quantification

Flagging Cases for Expert Review

The most straightforward use: if uncertainty exceeds a threshold, route the case to a radiologist for careful review. High-certainty cases can be auto-validated or given lower priority.

This creates a hybrid workflow where AI handles routine cases confidently, but defers to human expertise for edge cases—exactly the collaboration we want.

Improving Model Performance

Uncertainty quantification can directly improve diagnostic accuracy. For segmentation tasks, knowing which pixels the model is uncertain about allows for refinement.

One study showed that incorporating uncertainty into brain tumor segmentation improved Dice coefficients by 3.15% for enhancing tumor and 0.58% for necrotic tumor—clinically meaningful improvements for treatment planning.³

Efficient Active Learning

As new data becomes available, you want to retrain your model. But labeling data is expensive. Which new cases should radiologists annotate?

Answer: The ones the model is most uncertain about. These are the cases that will teach the model the most. UQ enables efficient, targeted data collection instead of randomly labeling everything.¹

Detecting Bias and Distribution Shift

High uncertainty can reveal when a model is being applied outside its training distribution—which often indicates bias.

Example: A pediatric pneumonia model deployed on adult patients will show high epistemic uncertainty on adult cases. That's a red flag that the model is unreliable for this population. Without UQ, you'd only discover the problem through clinical errors.

As discussed in our AI bias article, models trained on non-representative populations propagate that bias downstream. UQ can help detect when a model is uncertain specifically because it's seeing a demographic it wasn't trained on.

What to Demand from Vendors

If you're evaluating AI triage tools for clinical use, here are essential questions about uncertainty quantification:

1. Does the triage tool provide any confidence or uncertainty information with alerts?

Current FDA-cleared devices don't. They provide binary output: "positive" or "negative."

Minimum acceptable: Confidence levels (high/medium/low) displayed with each alert
Better: Quantitative uncertainty scores that you can use to set routing rules
Best: UQ methods that also show which image features drove the alert

Red flag: Vendor says "the model has 95% sensitivity, and all alerts are equally urgent." That's population-level performance. It doesn't tell you which individual alerts are reliable.

2. How is uncertainty quantified?

If they provide uncertainty at all, ask which method they use (ensemble, Bayesian, evidential, etc.). Often the vendor will equate the output of the penulatimate layer as a probability (it isnt), or they will refer to the calibrated probability as the confidence (it isn't). Be sure youunderstand this.

3. What's the positive predictive value stratified by confidence level?

Don't accept overall PPV. Ask: among "high confidence" alerts, what's the PPV? Among "low confidence" alerts? If they can't provide this, they haven't properly validated their uncertainty quantification.

Example acceptable answer: "High confidence alerts have 65% PPV, low confidence have 12% PPV in our validation study."

4. Can I configure alert behavior based on confidence?

Ideally, you should be able to set rules like: "High confidence alerts → urgent notification. Low confidence → add to worklist but don't interrupt workflow." Vendors should provide tools to customize this for your practice.

6. How does the tool handle out-of-distribution cases?

Ask: if the model sees a pediatric study but was trained on adults, does it flag high uncertainty? If it encounters artifact or unusual positioning, does uncertainty increase?

If the vendor says "the model handles all cases equally well," that's impossible and suggests they haven't thought about uncertainty at all.

The Path Forward: Building Trust Through Uncertainty

The paradox of AI in medicine is that models need to be both accurate and humble. An AI that confidently misdiagnoses is worse than useless—it's dangerous and trust-destroying.

Uncertainty quantification solves this by giving AI systems a way to communicate doubt. When the model says "I'm uncertain," it's not admitting failure—it's demonstrating trustworthiness. It's saying: "This case is outside my expertise. A human should look at this."

That kind of honest communication is what allows AI to integrate into clinical workflows sustainably. Radiologists don't expect perfection—they expect reliability and transparency. UQ provides both.

What This Means for Deployment

As we discussed in our article on AI deployment failures, most medical AI never makes it to the bedside. One major reason: lack of trust.

Models that can't communicate uncertainty fail in deployment because:

They generate too many false positives (eroding trust)
Radiologists can't distinguish reliable from unreliable predictions
There's no mechanism to route uncertain cases appropriately
Each confident error damages credibility

UQ-enabled models, by contrast, can:

Identify their own limitations proactively
Route cases intelligently based on confidence
Maintain trust by acknowledging uncertainty
Enable targeted review of edge cases

The Bottom Line

Most FDA-cleared AI triage tools today provide binary alerts: "detected" or "not detected." No confidence scores. No uncertainty measures. Often, not even a display of what was detected. Every alert is treated as equally urgent, whether the model is highly confident or barely crossed the detection threshold.

This binary approach is fundamentally flawed. When you can't distinguish confident predictions from uncertain guesses, alert fatigue is inevitable. Radiologists learn to ignore flags or treat them all with equal skepticism, undermining the tool's value.

Uncertainty quantification is the solution. It's technically feasible—the methods exist and have been validated in research. What's needed is adoption: demanding that vendors add confidence levels to triage alerts, and refusing to tolerate systems that treat every flag as equally reliable.

A triage tool that says "ICH detected—highly confident" versus "ICH detected—low confidence" enables smart workflow routing. High-confidence alerts get immediate attention. Low-confidence alerts are reviewed but don't trigger urgent interruptions. Trust is preserved because the AI acknowledges its limitations.

The most dangerous AI isn't the one that's sometimes wrong—it's the one that's wrong with the same conviction it uses when it's right. The best triage tools are the ones that know when to say "I'm not sure about this one."

Key Takeaways

Most Triage Tools Are Binary: FDA-cleared AI devices typically provide only "detected" or "not detected" flags—no probabilities, no confidence scores, often no image display.
Binary Alerts Treat All Cases Equally: A barely-above-threshold detection (internal score: 51%) gets the same urgent flag as an obvious case (internal score: 99%). You can't tell which to trust.
Alert Fatigue Is Risk: When 85% of flags are false positives and all look equally urgent, radiologists learn to ignore or deprioritize alerts, undermining the tool's value.
UQ Would Enable Smart Triage: Confidence levels (high/medium/low) would let you prioritize truly urgent alerts while treating uncertain flags appropriately.
Two Types of Uncertainty: Aleatoric (data ambiguity, irreducible) and epistemic (knowledge gaps like out-of-distribution cases, reducible with more diverse training data).
Trust Requires Humility: A triage tool that can flag "high confidence" vs. "low confidence" is more useful than one that treats every alert identically. The best AI knows when it's uncertain.
Demand Relevant Performance Data from Vendors: Ask: Does it show confidence? Can I see what was detected? What's the PPV stratified by confidence level? Can I configure alert behavior?

References & Further Reading

Faghani S, Moassefi M, Rouzrokh P, Khosravi B, Baffour FI, Ringler MD, Erickson BJ. Quantifying Uncertainty in Deep Learning of Radiologic Images. Radiology. 2023;308(2):e222217. doi:10.1148/radiol.222217
Dohopolski M, Chen L, Sher D, Wang J. Predicting Lymph Node Metastasis in Patients with Oropharyngeal Cancer by Using a Convolutional Neural Network with Associated Epistemic and Aleatoric Uncertainty. Phys Med Biol. 2020;65(22):225002. doi:10.1088/1361-6560/abc2d0
Lee J, Shin D, Oh SH, Kim H. Method to Minimize the Errors of AI: Quantifying and Exploiting Uncertainty of Deep Learning in Brain Tumor Segmentation. Sensors (Basel). 2022;22(6):2406. doi:10.3390/s22062406
Guo C, Pleiss G, Sun Y, Weinberger KQ. On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning. 2017;1321-1330.
Abdar M, Pourpanah F, Hussain S, et al. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Inf Fusion. 2021;76:243-297. doi:10.1016/j.inffus.2021.05.008
Ozdemir O, Russell RL, Berlin AA. A 3D Probabilistic Deep Learning System for Detection and Diagnosis of Lung Cancer Using Low-Dose CT Scans. IEEE Trans Med Imaging. 2020;39(5):1419-1429. doi:10.1109/TMI.2019.2947595
Rajaraman S, Ganesan P, Antani S. Deep Learning Model Calibration for Improving Performance in Class-Imbalanced Medical Image Classification Tasks. PLoS One. 2022;17(1):e0262838. doi:10.1371/journal.pone.0262838