Probability calibration is a fundamental step in modern machine learning pipelines, ensuring that predicted probabilities faithfully reflect observed frequencies. While classical methods such as Platt scaling and isotonic regression are widely adopted, they come with important limitations, including restrictive assumptions and vulnerability to overfitting, particularly when calibration data are scarce. This comprehensive review examines the evolution of calibration methods, from early histogram binning approaches to modern neural calibration techniques and conformal prediction methods. We highlight the theoretical foundations, practical challenges, and empirical performance of various approaches, with particular attention to Venn–ABERS predictors, which provide interval-based probability estimates with provable validity guarantees. The article also covers evaluation metrics, domain-specific applications, and emerging trends in uncertainty quantification, providing practitioners with a thorough guide to selecting and implementing appropriate calibration methods for their specific contexts.
1. Introduction
1.1 The Calibration Problem
Machine learning models often produce probability estimates as part of their output. For example, a binary classifier may predict that a sample belongs to the positive class with probability 0.8. For downstream tasks such as decision-making, ranking, cost-sensitive learning, or risk assessment, it is crucial that this probability is calibrated: across many such predictions with confidence 0.8, approximately 80% of the samples should indeed be positive.
Formally, a predictor is said to be calibrated if, for any predicted probability p, the conditional probability of the positive class given the prediction equals p:
P(Y = 1 | f(X) = p) = p |
where f(X) represents the model’s probability prediction for input X, and Y is the true binary label.
1.2 The Miscalibration Crisis
Unfortunately, many modern learners, particularly high-capacity models such as gradient-boosted trees, random forests, and deep neural networks, are known to be poorly calibrated out of the box (Guo et al., 2017; Minderer et al., 2021). This miscalibration manifests in several ways. Overconfidence is common in modern neural networks, especially those with high capacity, which tend to produce overly confident predictions with predicted probabilities clustering near 0 and 1. Conversely, underconfidence can occur in some scenarios, particularly with ensemble methods or when using techniques like dropout, where models may be systematically underconfident. Additionally, non-monotonic miscalibration can arise where the relationship between predicted probabilities and actual frequencies becomes complex and non-linear.
1.3 Why Calibration Matters
Proper calibration is essential for numerous applications. In medical diagnosis, probability estimates guide treatment decisions and risk stratification of patients. Financial risk assessment relies on accurate probability estimates for loan approval and investment decisions. Autonomous systems require well-calibrated uncertainty estimates for safe operation in unpredictable environments. Scientific discovery applications benefit from honest uncertainty quantification that reflects true epistemic limitations. Additionally, model selection and ensemble weighting processes can leverage calibrated probabilities to enable better meta-learning approaches.
2. Theoretical Foundations
2.1 Perfect Calibration vs. Refinement
The quality of probabilistic predictions can be decomposed into two components: calibration and refinement. Calibration measures how well predicted probabilities match observed frequencies, while refinement captures how far the conditional class probabilities P(Y | p) deviate from the base rate. |
The Brier score, a popular proper scoring rule, decomposes as: Brier Score = Reliability - Resolution + Uncertainty
where Reliability measures miscalibration, Resolution measures refinement, and Uncertainty is the inherent randomness in the data.
2.2 Types of Calibration
Marginal Calibration: The overall predicted probabilities match observed frequencies across the entire dataset.
Conditional Calibration: Probabilities are calibrated within subgroups defined by additional covariates or features.
Distributional Calibration: The entire predictive distribution is well-calibrated, not just point estimates.
3. Classical Calibrators and Their Limitations
3.1 Histogram Binning
The earliest calibration method involves partitioning predictions into bins and replacing each prediction with the empirical frequency within its bin.
Algorithm:
- Sort calibration data by predicted probability
- Divide into B equal-width or equal-frequency bins
- Replace predictions in each bin with the bin’s empirical frequency
Advantages: Histogram binning is simple, non-parametric, and highly interpretable, making it accessible to practitioners across different domains. Pitfalls: The method is sensitive to bin boundaries and requires careful selection of the number of bins. It is also prone to overfitting with small datasets, where individual bins may not contain sufficient samples for reliable frequency estimation.
3.2 Platt Scaling
Platt scaling (Platt, 1999) applies a logistic regression transformation to the uncalibrated scores of a base classifier:
P_calibrated = 1 / (1 + exp(A × s + B))
where s represents the uncalibrated score, and A, B are learned parameters.
Advantages: Platt scaling is simple and computationally efficient, making it well-suited for support vector machines and other margin-based methods. It provides smooth, monotonic calibration curves and requires minimal calibration data to achieve reasonable performance.
Pitfalls: The method suffers from an overly restrictive assumption of a sigmoid shape for the calibration function. Underfitting occurs when the true calibration curve deviates significantly from sigmoid behavior, and the method may not perform well with highly non-linear miscalibrations. Additionally, it assumes the uncalibrated scores follow a particular distributional form, which may not hold in practice.
3.3 Isotonic Regression
Isotonic regression (Zadrozny & Elkan, 2002) offers a non-parametric alternative, learning a monotonic stepwise function that maps scores to probabilities using the Pool Adjacent Violators (PAV) algorithm.
Algorithm:
- Sort calibration data by predicted probability
- Apply PAV algorithm to find the isotonic regression
- Use the resulting step function for calibration
Advantages: Isotonic regression is flexible and capable of capturing arbitrary monotonic calibration functions. Being non-parametric, it makes no distributional assumptions about the underlying data. The method is guaranteed to produce monotonic outputs and can handle complex calibration curves that deviate significantly from simple parametric forms.
Pitfalls: The method is prone to overfitting, particularly with small calibration sets, and may produce sharp discontinuities that degrade generalization performance. A phenomenon known as “step overfitting” occurs where small fluctuations in calibration data cause large changes in the learned mapping function. Furthermore, isotonic regression provides no theoretical guarantees about the quality of the resulting calibration.
3.4 Temperature Scaling
Temperature scaling (Guo et al., 2017) has gained popularity in deep learning. It rescales logits by a learned temperature parameter τ before applying the softmax:
P_calibrated = softmax(z / τ)
where z represents the model’s logits.
Advantages: Temperature scaling is extremely simple, requiring only one parameter to learn, and proves particularly effective for neural networks trained with cross-entropy loss. The method preserves the relative ordering of predictions and does not change the model’s accuracy since argmax predictions remain unchanged. It is also fast to compute and apply in production settings.
Pitfalls: Like Platt scaling, temperature scaling is parametric and limited in flexibility, assuming a uniform miscalibration across all confidence levels. The method is unsuitable for highly non-linear or class-dependent miscalibrations and may not work well for models not trained with cross-entropy loss or other standard objective functions.
3.5 Matrix and Vector Scaling
Extensions of temperature scaling include:
Matrix Scaling: Applies a linear transformation to logits: softmax(Wz + b) Vector Scaling: Uses class-dependent temperatures: softmax(z ⊙ t + b)
These methods offer more flexibility while maintaining computational efficiency.
4. Advanced Calibration Methods
4.1 Beta Calibration
Beta calibration (Kull et al., 2017) assumes that the calibration curve follows a beta distribution, offering more flexibility than Platt scaling while maintaining parametric efficiency.
The method fits three parameters (a, b, c) to model: P_calibrated = sigmoid(a × sigmoid^(-1)(p) + b)
where p is the uncalibrated probability.
4.2 Spline-Based Calibration
Spline calibration uses piecewise polynomial functions to create smooth, flexible calibration curves. This approach balances the flexibility of isotonic regression with the smoothness of parametric methods.
4.3 Bayesian Binning into Quantiles (BBQ)
BBQ (Naeni et al., 2015) provides a Bayesian approach to histogram binning, automatically selecting the optimal number of bins and providing uncertainty estimates about the calibration function itself.
4.4 Ensemble Temperature Scaling
For ensemble models, specialized calibration methods account for the correlation structure between ensemble members, typically requiring separate temperature parameters for each ensemble component.
5. Neural Network Calibration
5.1 Why Neural Networks Are Miscalibrated
Several factors contribute to poor calibration in neural networks. Overfitting in high-capacity models leads to memorization of training data, resulting in overconfident predictions. Model size plays a role, as larger networks tend to be more miscalibrated. Training procedures including modern practices like data augmentation and batch normalization can affect calibration properties. Furthermore, different architecture choices exhibit varying calibration characteristics, with some designs being inherently better calibrated than others.
5.2 Regularization-Based Approaches
Label Smoothing: Replaces hard targets with soft distributions, reducing overconfidence during training.
Dropout: Using dropout at inference time can improve calibration by providing uncertainty estimates.
Batch Normalization: Can affect calibration properties, sometimes requiring specialized handling.
5.3 Multi-Class Calibration
Extending calibration to multi-class settings presents additional challenges:
One-vs-All: Apply binary calibration to each class separately Matrix Scaling: Learn a full linear transformation of logits Dirichlet Calibration: Model the calibration function as a Dirichlet distribution
6. Overfitting and Miscalibration in Small Data Regimes
6.1 The Small Data Challenge
The problem becomes particularly acute when calibration must be performed with limited data. This scenario is common in medical applications with rare diseases, industrial quality control with few defects, specialized scientific domains with expensive data collection, and real-time systems requiring quick adaptation to new conditions.
6.2 Manifestations of Small Data Problems
Isotonic Regression Overfitting: Small fluctuations in the calibration set can induce disproportionate changes in the learned mapping, creating jagged, unreliable calibration curves.
Bin Selection Sensitivity: Histogram binning becomes extremely sensitive to the number and boundaries of bins.
Parameter Instability: Even simple methods like Platt scaling can exhibit high variance in parameter estimates.
6.3 Diagnostic Tools
Bootstrap Analysis: Assess calibration stability by resampling the calibration set Cross-Validation: Use nested cross-validation to select calibration hyperparameters Learning Curves: Plot calibration performance vs. calibration set size
7. Venn–ABERS Predictors
7.1 Theoretical Foundations
Venn–ABERS predictors are rooted in the framework of Venn prediction theory (Vovk et al., 2015) and conformal prediction. Unlike classical calibrators that map scores directly to single probabilities, Venn predictors partition the calibration data into categories and compute conditional probabilities within each partition.
The method is based on several key principles:
- Exchangeability: Assumes calibration data and test data are exchangeable
- Validity: Provides theoretical guarantees about long-run calibration
- Efficiency: Aims to provide tight probability intervals
7.2 The ABERS Algorithm
ABERS (Adaptive Beta with Exchangeable Random Sampling) works as follows:
- Partitioning: Divide calibration data based on the rank of their scores
- Beta Calculation: For each partition, compute empirical probabilities
- Interval Construction: Create probability intervals using adjacent partitions
- Aggregation: Combine intervals to produce final estimates
7.3 Interval-Valued Probabilities
A distinctive feature is that Venn predictors output two probabilities, effectively an interval estimate [p_lower, p_upper]. This interval provides several benefits:
- Uncertainty Quantification: Wider intervals indicate greater uncertainty
- Validity Guarantees: The true probability lies within the interval with high probability
- Risk-Aware Decisions: Decision-makers can account for uncertainty in their choices
The interval can be reduced to a single point prediction (e.g., the average) when required, but the interval itself provides valuable uncertainty information.
7.4 Validity Properties
Venn–ABERS predictors guarantee that the long-run average of predicted probabilities matches the observed frequencies under the exchangeability assumption. Specifically, they provide marginal validity where E[Y | p ∈ [p_lower, p_upper]] ∈ [p_lower, p_upper], and conditional validity where the property holds within subgroups defined by the Venn predictor’s partitioning. |
7.5 Empirical Performance
Studies have demonstrated that Venn–ABERS calibration yields superior robustness compared to isotonic regression and Platt scaling, particularly in small-sample regimes where performance degrades gracefully as calibration data decreases, imbalanced datasets where the method maintains calibration quality even with severe class imbalance, scenarios with distribution shift where it shows better robustness to changes in data distribution, and noisy label conditions where it demonstrates less sensitivity to label noise in calibration data.
Performance gains are often quantified through lower Expected Calibration Error (ECE), improved Brier scores, better reliability diagram behavior, and more stable performance across different data splits.
8. Evaluation Metrics for Calibration
8.1 Expected Calibration Error (ECE)
ECE measures the weighted average of the absolute differences between accuracy and confidence:
ECE = Σ (n_b / n) | acc(b) - conf(b) |
where b indexes bins, n_b is the number of samples in bin b, acc(b) is the accuracy in bin b, and conf(b) is the average confidence in bin b.
Variants:
- Static ECE: Uses fixed bin boundaries
- Adaptive ECE: Uses quantile-based binning
- Class-wise ECE: Computes ECE separately for each class
8.2 Maximum Calibration Error (MCE)
MCE measures the worst-case calibration error across all bins:
MCE = max_b | acc(b) - conf(b) |
8.3 Brier Score
The Brier score measures the mean squared difference between predicted probabilities and binary outcomes:
BS = (1/n) Σ (p_i - y_i)²
It decomposes into reliability (miscalibration), resolution (refinement), and uncertainty components.
8.4 Reliability Diagrams
Visual representations plotting predicted probability vs. observed frequency, typically with bins. Well-calibrated models should show points near the diagonal.
8.5 Calibration Plots and Histograms
Calibration Plots: Show the relationship between predicted and actual probabilities Confidence Histograms: Display the distribution of predicted probabilities Gap Plots: Visualize the difference between predicted and observed frequencies
8.6 Proper Scoring Rules
Log Loss: Measures the negative log-likelihood of predictions Brier Score: As described above Spherical Score: Geometric mean-based proper scoring rule
8.7 Statistical Tests
Hosmer-Lemeshow Test: Chi-square test for goodness of calibration Spiegelhalter’s Z-test: Tests calibration in probabilistic models Bootstrap Confidence Intervals: Provide uncertainty estimates for calibration metrics
9. Practical Implementation Guidelines
9.1 Standard Calibration Pipeline
- Data Splitting: Reserve a held-out calibration set (typically 10-20% of available data)
- Base Model Training: Train the primary model on the training set
- Calibration Method Selection: Choose appropriate calibration method based on data characteristics
- Hyperparameter Tuning: Use nested cross-validation for calibration hyperparameters
- Final Calibration: Apply chosen method to calibration set
- Evaluation: Assess calibration quality on a separate test set
9.2 Method Selection Guidelines
Small Datasets (< 1000 samples):
- Venn–ABERS predictors
- Platt scaling for stable results
- Avoid isotonic regression
Medium Datasets (1000-10000 samples):
- Isotonic regression often performs well
- Temperature scaling for neural networks
- Beta calibration for complex curves
Large Datasets (> 10000 samples):
- All methods typically viable
- Isotonic regression and spline methods excel
- Consider computational efficiency
Neural Networks:
- Start with temperature scaling
- Consider ensemble temperature scaling for ensembles
- Matrix/vector scaling for multi-class problems
9.3 Implementation Considerations
Cross-Validation Strategies: Use stratified k-fold CV to maintain class balance across folds
Computational Efficiency: Consider the trade-off between calibration quality and inference speed
Memory Requirements: Some methods (like isotonic regression) require storing calibration data
Hyperparameter Sensitivity: Assess robustness to hyperparameter choices using bootstrap sampling
9.4 Common Pitfalls and Best Practices
Data Leakage: Ensure strict separation between training, calibration, and test sets
Insufficient Calibration Data: Reserve adequate data for calibration (at least 100-1000 samples when possible)
Evaluation Bias: Use proper cross-validation schemes to avoid overly optimistic calibration estimates
Method Overfitting: Don’t over-optimize calibration hyperparameters on the test set
10. Venn–ABERS: Detailed Implementation
10.1 Step-by-Step Algorithm
Input: Calibration set {(x_i, y_i, s_i)} where s_i is the uncalibrated score
Step 1: Ranking and Partitioning
Sort calibration examples by score s_i
Create partitions based on score ranks
Step 2: Probability Calculation
For each test example with score s:
Find adjacent partitions in calibration set
Compute p0 = proportion of negatives in lower partition
Compute p1 = proportion of positives in upper partition
Return interval [p0, p1]
Step 3: Point Estimate (if needed)
Return (p0 + p1) / 2 as single probability estimate
10.2 Hyperparameter Selection
Partition Strategy: Various options for creating partitions
- Equal-width partitions in score space
- Equal-frequency partitions
- Adaptive partitioning based on score distribution
Aggregation Method: Different ways to combine interval endpoints
- Simple averaging: (p_lower + p_upper) / 2
- Weighted averaging based on partition sizes
- Conservative/optimistic approaches using interval bounds
10.3 Software Implementations
Open-source implementations are available in multiple languages:
- Python:
venn-abers
package, scikit-learn integration - R: CRAN packages for Venn prediction
- Julia: Native implementations in MLJ.jl ecosystem
Example Usage (Python):
from venn_abers import VennAbersPredictor
# Initialize predictor
va_predictor = VennAbersPredictor()
# Fit on calibration data
va_predictor.fit(cal_scores, cal_labels)
# Predict intervals
p_lower, p_upper = va_predictor.predict_proba(test_scores)
# Get point estimates
p_point = (p_lower + p_upper) / 2
11. Domain-Specific Applications
11.1 Medical Diagnosis
In healthcare, calibrated probabilities are crucial for:
- Risk Stratification: Patients with different risk scores should have genuinely different risks
- Treatment Decisions: Probability thresholds guide intervention choices
- Regulatory Approval: Medical devices must demonstrate calibration for FDA approval
Case Study: COVID-19 severity prediction models required careful calibration to guide ICU admission decisions during the pandemic.
11.2 Financial Risk Assessment
Credit Scoring: Loan default probabilities must be accurate for regulatory capital calculations Algorithmic Trading: Portfolio optimization relies on well-calibrated return predictions Insurance: Premium calculations require accurate probability estimates
Regulatory Considerations: Basel III requirements mandate calibrated risk models for banks.
11.3 Autonomous Systems
Safety-Critical Decisions: Self-driving cars must have well-calibrated confidence in their perceptions Anomaly Detection: Industrial systems need calibrated uncertainty for maintenance scheduling Human-AI Collaboration: Calibrated confidence enables appropriate reliance on AI systems
11.4 Scientific Applications
Climate Modeling: Uncertainty quantification in climate predictions Drug Discovery: Calibrated models for molecular property prediction Astronomy: Calibrated probabilities for celestial object classification
12. Multi-Class and Structured Calibration
12.1 Multi-Class Extensions
Extending calibration to multi-class problems introduces additional complexity:
One-vs-Rest Approach: Apply binary calibration to each class separately
- Simple to implement
- May not preserve probability simplex constraints
- Can lead to probabilities that don’t sum to 1
Matrix Scaling: Learn a full transformation matrix for logits
- Maintains simplex constraints
- More parameters to learn
- Better theoretical properties
Dirichlet Calibration: Model the calibration as a Dirichlet distribution
- Principled probabilistic approach
- Natural handling of multi-class case
- Computational complexity for parameter estimation
12.2 Regression Calibration
For continuous outputs, calibration focuses on prediction intervals rather than point probabilities:
Quantile Regression: Calibrate specific quantiles of the predictive distribution Conformal Prediction: Provide prediction intervals with coverage guarantees Distributional Calibration: Ensure the entire predictive distribution is well-calibrated
12.3 Structured Output Calibration
For complex outputs (sequences, trees, graphs):
Sequence Calibration: Calibrate probabilities for entire sequences in NLP tasks Hierarchical Calibration: Handle structured label spaces with taxonomic relationships Graph Calibration: Calibrate edge and node predictions in graph neural networks
13. Theoretical Advances and Recent Developments
13.1 Conformal Prediction
Conformal prediction provides a general framework for uncertainty quantification with finite-sample guarantees:
Coverage Guarantee: Prediction sets contain the true label with probability ≥ 1-α Distribution-Free: Works under minimal assumptions about data distribution Efficiency: Aims to produce small prediction sets while maintaining coverage
Connection to Calibration: Conformal methods naturally produce calibrated probability estimates.
13.2 Bayesian Calibration
Bayesian approaches to calibration treat the calibration function as uncertain:
Gaussian Process Calibration: Model calibration function as a GP Bayesian Neural Networks: Account for weight uncertainty in neural calibration Hierarchical Models: Share information across related calibration tasks
13.3 Online and Adaptive Calibration
For streaming data scenarios:
Online Isotonic Regression: Update calibration mappings as new data arrives Adaptive Temperature Scaling: Continuously adjust temperature parameters Change Point Detection: Identify when recalibration is needed
13.4 Fairness and Calibration
Intersection of calibration with algorithmic fairness:
Group-Wise Calibration: Ensure calibration within demographic groups Multi-Calibration: Satisfy calibration constraints for multiple overlapping groups Trade-offs: Balance between overall calibration and fairness constraints
14. Challenges and Future Directions
14.1 Current Limitations
Distribution Shift: Most calibration methods assume training and test distributions are identical Label Noise: Noisy calibration labels can severely degrade calibration quality Computational Scalability: Some methods don’t scale to very large datasets Multi-Task Calibration: Limited work on calibrating across multiple related tasks
14.2 Emerging Research Directions
Adversarial Calibration: Robustness to adversarial perturbations Meta-Learning for Calibration: Learning to calibrate across multiple tasks Causal Calibration: Accounting for causal relationships in calibration Quantum Machine Learning Calibration: Calibration methods for quantum algorithms
14.3 Standardization Efforts
Benchmark Datasets: Need for standardized calibration evaluation benchmarks Metric Standardization: Consensus on preferred calibration evaluation metrics
Regulatory Guidelines: Development of calibration requirements for regulated industries
14.4 Integration with Modern ML
AutoML Integration: Automated calibration method selection Neural Architecture Search: Architectures designed for better calibration Foundation Model Calibration: Calibrating large pre-trained models Federated Learning Calibration: Distributed calibration across multiple parties
15. Software Ecosystem and Tools
15.1 Python Libraries
scikit-learn: Basic calibration methods (Platt scaling, isotonic regression) netcal: Comprehensive calibration library with multiple methods and metrics uncertainty-toolbox: Toolkit for uncertainty quantification and calibration calibration-library: Research-focused library with recent methods
15.2 R Packages
pROC: ROC analysis with calibration diagnostics CalibrationCurves: Comprehensive calibration curve analysis rms: Regression modeling strategies with calibration tools
15.3 Specialized Frameworks
TensorFlow Probability: Uncertainty quantification for neural networks Pyro/NumPyro: Bayesian probabilistic programming with calibration MAPIE: Conformal prediction library with calibration methods
15.4 Evaluation Platforms
Calibration Benchmark: Standardized benchmarks for calibration evaluation OpenML: Open machine learning platform with calibration tasks Papers with Code: Tracking state-of-the-art calibration methods
16. Conclusion
Probability calibration remains a critical component of trustworthy machine learning systems. While classical methods like Platt scaling and isotonic regression have provided practical solutions for over two decades, their limitations–rigid assumptions, vulnerability to overfitting, and lack of validity guarantees–pose real challenges in modern settings.
The evolution from simple histogram binning to sophisticated methods like Venn–ABERS predictors represents significant progress in addressing these challenges. Modern calibration methods offer:
- Theoretical Rigor: Methods with provable guarantees about calibration quality
- Robustness: Better performance in challenging scenarios (small data, distribution shift, class imbalance)
- Uncertainty Quantification: Explicit modeling of calibration uncertainty
- Scalability: Methods that work efficiently with large-scale data
As machine learning continues to be deployed in safety-critical and high-stakes domains, the importance of well-calibrated uncertainty estimates will only grow. Future developments in calibration will likely focus on:
- Adaptive Methods: Calibration techniques that automatically adjust to changing conditions
- Multi-Modal Integration: Calibration across different data modalities and tasks
- Real-Time Calibration: Methods suitable for online and streaming scenarios
- Fairness Integration: Ensuring calibration while maintaining algorithmic fairness
Venn–ABERS predictors and other conformal prediction methods represent a promising direction: theoretically grounded, empirically validated, and robust to the challenging conditions common in real-world applications. However, the choice of calibration method should always be guided by the specific characteristics of the problem domain, available data, and performance requirements.
The field continues to evolve rapidly, with new theoretical insights, practical methods, and application domains emerging regularly. Practitioners are encouraged to stay current with developments and to evaluate multiple calibration approaches for their specific use cases, always with careful attention to proper evaluation methodology and the unique requirements of their application domains.
References
-
Bella, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2010). Calibration of machine learning models. Handbook of Research on Machine Learning Applications and Trends.
-
Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643), 1512-1519.
-
Dawid, A. P. (1982). The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379), 605-610.
-
DeGroot, M. H., & Fienberg, S. E. (1983). The comparison and evaluation of forecasters. Journal of the Royal Statistical Society, 32(1), 12-22.
-
Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B, 69(2), 243-268.
-
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
-
Hébert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018). Multicalibration: Calibration for the computationally-identifiable masses. In International Conference on Machine Learning (pp. 1939-1948).
-
Kull, M., Silva Filho, T., & Flach, P. (2017). Beta calibration: A well-founded and easily implemented improvement on Platt scaling for binary SVM classification. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (pp. 623-631).
-
Kumar, A., Liang, P. S., & Ma, T. (2019). Verified uncertainty calibration. In Advances in Neural Information Processing Systems (pp. 3787-3798).
-
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (pp. 6402-6413).
-
Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., … & Lucic, M. (2021). Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems (pp. 15682-15694).
-
Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P. H., & Dokania, P. K. (2020). Calibrating uncertainties in object localization task. In Advances in Neural Information Processing Systems (pp. 15334-15345).
-
Naeni, L. S., Cooper, G., & Hauskrecht, M. (2015). Bayesian binning for calibration of binary classifiers. Machine Learning, 98(1-2), 151–182.
-
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019). Measuring calibration in deep learning. In CVPR Workshops (pp. 38-41).
-
Nouretdinov, I., Luo, Z., & Gammerman, A. (2021). Probability calibration in machine learning: The case of Venn–ABERS predictors. Entropy, 23(8), 1061.
-
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., … & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (pp. 13991-14002).
-
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers (pp. 61–74).
-
Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371-421.
-
Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J., & Schön, T. (2019). Evaluating model calibration in classification. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467).
-
Vovk, V., Petej, I., & Fedorova, V. (2015). Large-scale probabilistic predictors with and without guarantees of validity. Proceedings of the 32nd International Conference on Machine Learning (ICML 2015).
-
Wenger, J., Kjellström, H., & Tomczak, J. M. (2020). Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics (pp. 178-190).