Survey of Conformal Predictions for LLMs

The field has split into four main directions.

  1. Discrete-answer settings (multiple-choice QA): CP looks closest to classical classification
  2. Open-ended generation/factuality filtering: output is free-form text and the guarantee is usually on whether the final response, or its claims, are correct
  3. Decoding-time/token- or sequence-level CP, where CP is integrated into sampling or beam search itself
  4. Deployment-oriented extensions: domain shift, abstention, OOD detection, RAG, and evaluation of LLM judges.

MCQA

  1. Conformal Prediction with Large Language Models for Multi-Choice Question Answering (Kumar et al., 2023), ICML 2023 Neural Conversational AI TEACH workshop:
    • Motivation: UQ is needed for LLM MCQA
    • Contribution: First work to apply CP to MCQA
    • Method: Treats the answer choices as a finite label set and constructing prediction sets of options with coverage guarantees like in classical classification problems. Uses model scores over answer choices, calibrate on held-out QA instances, and return a set of plausible choices rather than one answer.
    • Experiments: Calibrated on same topic and on OOD datasets.
    • Findings: Softmax outputs are reasonably calibrated on average, but models suffer from underconfidence and overconfidence, especially at the tail ends of probability distribution.
  2. API Is Enough: Conformal Prediction for LLMs Without Logit-Access (Su et al., 2024), EMNLP 2024:
    • Motivation: Many APIs only provide responese and not logits, and logits are known to be miscalibrated anyways.
    • Contribution: First CP work dedicated to LLMs without logit-access that provides a coverage guarantee for the prediction set.
    • Method: Propose to sample responses for a certain number of times (e.g., 30) for each input and then utilize the frequency of each response as a coarse-grained uncertainty notion to reduce sampling requirements. Formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Propose two additional fine-grained uncertainty notions: normalized entropy (NE) that measures prompt-wise self-consistency to alleviate concentration issues across different prompts, and semantic similarity (SS) that measures response-wise similarity to the most frequent response within the same prompt, to mitigate internal concentration issues specific to the prompt. NE measures uncertainty or diversity in the model’s predictions when generating responses to a given prompt (entropy of frequency). SS semantically assesses the similarity between each non-top1 response and the top-1 response within a prompt. The non-conformity score is a weighted sum of -frequency and NE and SS measures.
    • Findings: Superior performance compared to logit-based and logit-free baselines.
  3. Prune ’n Predict: Optimizing LLM Decision-making with Conformal Prediction (Vishwakarma et al., 2025), ICML 2025
    • Motivation: A commonly taught strategy for human test takers to solve multi-choice questions (MCQs) is the process of elimination (pruning) of incorrect (distractor) answer choices. Expect LLMs to be more accurate on revised questions with fewer choices.
    • Contribution: Uses CP not only for uncertainty quantification but to improve downstream task accuracy. One of the clearest examples where CP becomes part of a decision loop, not just an uncertainty wrapper.
    • Method(s): Propose two different methods. Propose conformal revision of questions (CROQ), which revises the question by narrowing down the available choices to those in the prediction set and asking the LLM the revised question, then applying standard split CP. But commonly used logit scores often lead to large sets, diminishing CROQ’s effectiveness. Propose CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Cast set size and coverage in differentiable form like conformal risk training and add regularization. To find the flexible score function, use a data driven method using optimization - input is logits and the penultimate layer’s representations, model is simple MLP, and ouptut is . Then calibration done on the thresholds.
    • Findings: Reducing the number of response options leads to an improvement in accuracy, and this improvement is very nearly monotone. Empirical evaluation shows that this approach consistently improves accuracy compared to prompting the LLM with the original MCQ.
  4. Do Large Language Models Know When Not to Answer in Medical QA?, UncertaiNLP Workshop ACL 2025
    • Contribution: Early results suggest a positive link between uncertainty estimates and abstention decisions, with this effect amplified under higher difficulty and adversarial perturbations.
    • Method: Tests three different types:
      • Abstention: Introduces an explicit abstention option to each question, allowing the model to refrain from answering when uncertain.
      • No-Abstention+Perturb: Aims to assess the model’s confidence when essential information is missing
      • Abstention+Perturb: Combines both abstention and perturbation. Model presented with questions where some necessary information has been removed, along with the option to abstain from answering.
    • Findings: strong, positive uncertainty–abstention relationship and a consistently negative association between both APS/LAC and accuracy.
    • Follow up benchmark work: Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty (Machcha et al., 2026). Experiments on open and closed source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain with uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement.

Open-ended generation, factuality, and claim filtering

  1. Conformal Language Modeling (Quach et al. 2024), ICLR 2024
    • Contribution: Landmark paper for open-ended generation. Contribution is to extend CP from finite labels to sets of generated sequences by calibrating a stopping rule for sampling candidate responses from an LM, together with a rejection rule for noisy candidates. Guarantee is that the sampled candidate set contains at least one acceptable response with high probability.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Learn-then-Test
  • Split Conformal Prediction and Non-Exchangeable Data
  • Transformers 1-100: From Seminal Papers to Modern Standard Practice
  • Transformers 0: A Simple Mental Model
  • Exchangeability, Symmetry, and Conformal Prediction