Probability Calibration in Machine Learning: From Classical Methods to Modern Approaches and Venn–ABERS Predictors

Probability calibration is a fundamental step in modern machine learning pipelines, ensuring that predicted probabilities faithfully reflect observed frequencies. While classical methods such as Platt scaling and isotonic regression are widely adopted, they come with important limitations, including restrictive assumptions and vulnerability to overfitting, particularly when calibration data are scarce. This comprehensive review examines the evolution of calibration methods, from early histogram binning approaches to modern neural calibration techniques and conformal prediction methods. We highlight the theoretical foundations, practical challenges, and empirical performance of various approaches, with particular attention to Venn–ABERS predictors, which provide interval-based probability estimates with provable validity guarantees. The article also covers evaluation metrics, domain-specific applications, and emerging trends in uncertainty quantification, providing practitioners with a thorough guide to selecting and implementing appropriate calibration methods for their specific contexts.

1. Introduction

1.1 The Calibration Problem

Machine learning models often produce probability estimates as part of their output. For example, a binary classifier may predict that a sample belongs to the positive class with probability 0.8. For downstream tasks such as decision-making, ranking, cost-sensitive learning, or risk assessment, it is crucial that this probability is calibrated: across many such predictions with confidence 0.8, approximately 80% of the samples should indeed be positive.

Formally, a predictor is said to be calibrated if, for any predicted probability p, the conditional probability of the positive class given the prediction equals p:

P(Y = 1

f(X) = p) = p

where f(X) represents the model’s probability prediction for input X, and Y is the true binary label.

1.2 The Miscalibration Crisis

Unfortunately, many modern learners, particularly high-capacity models such as gradient-boosted trees, random forests, and deep neural networks, are known to be poorly calibrated out of the box (Guo et al., 2017; Minderer et al., 2021). This miscalibration manifests in several ways. Overconfidence is common in modern neural networks, especially those with high capacity, which tend to produce overly confident predictions with predicted probabilities clustering near 0 and 1. Conversely, underconfidence can occur in some scenarios, particularly with ensemble methods or when using techniques like dropout, where models may be systematically underconfident. Additionally, non-monotonic miscalibration can arise where the relationship between predicted probabilities and actual frequencies becomes complex and non-linear.

1.3 Why Calibration Matters

Proper calibration is essential for numerous applications. In medical diagnosis, probability estimates guide treatment decisions and risk stratification of patients. Financial risk assessment relies on accurate probability estimates for loan approval and investment decisions. Autonomous systems require well-calibrated uncertainty estimates for safe operation in unpredictable environments. Scientific discovery applications benefit from honest uncertainty quantification that reflects true epistemic limitations. Additionally, model selection and ensemble weighting processes can leverage calibrated probabilities to enable better meta-learning approaches.

2. Theoretical Foundations

The quality of probabilistic predictions can be decomposed into two components: calibration and refinement. Calibration measures how well predicted probabilities match observed frequencies, while refinement captures how far the conditional class probabilities P(Y

p) deviate from the base rate.

The Brier score, a popular proper scoring rule, decomposes as: Brier Score = Reliability - Resolution + Uncertainty

where Reliability measures miscalibration, Resolution measures refinement, and Uncertainty is the inherent randomness in the data.

2.2 Types of Calibration

Marginal Calibration: The overall predicted probabilities match observed frequencies across the entire dataset.

Conditional Calibration: Probabilities are calibrated within subgroups defined by additional covariates or features.

Distributional Calibration: The entire predictive distribution is well-calibrated, not just point estimates.

3. Classical Calibrators and Their Limitations

3.1 Histogram Binning

The earliest calibration method involves partitioning predictions into bins and replacing each prediction with the empirical frequency within its bin.

Algorithm:

Sort calibration data by predicted probability
Divide into B equal-width or equal-frequency bins
Replace predictions in each bin with the bin’s empirical frequency

Advantages: Histogram binning is simple, non-parametric, and highly interpretable, making it accessible to practitioners across different domains. Pitfalls: The method is sensitive to bin boundaries and requires careful selection of the number of bins. It is also prone to overfitting with small datasets, where individual bins may not contain sufficient samples for reliable frequency estimation.

3.2 Platt Scaling

Platt scaling (Platt, 1999) applies a logistic regression transformation to the uncalibrated scores of a base classifier:

P_calibrated = 1 / (1 + exp(A × s + B))

where s represents the uncalibrated score, and A, B are learned parameters.

Advantages: Platt scaling is simple and computationally efficient, making it well-suited for support vector machines and other margin-based methods. It provides smooth, monotonic calibration curves and requires minimal calibration data to achieve reasonable performance.

Pitfalls: The method suffers from an overly restrictive assumption of a sigmoid shape for the calibration function. Underfitting occurs when the true calibration curve deviates significantly from sigmoid behavior, and the method may not perform well with highly non-linear miscalibrations. Additionally, it assumes the uncalibrated scores follow a particular distributional form, which may not hold in practice.

3.3 Isotonic Regression

Isotonic regression (Zadrozny & Elkan, 2002) offers a non-parametric alternative, learning a monotonic stepwise function that maps scores to probabilities using the Pool Adjacent Violators (PAV) algorithm.

Algorithm:

Sort calibration data by predicted probability
Apply PAV algorithm to find the isotonic regression
Use the resulting step function for calibration

Advantages: Isotonic regression is flexible and capable of capturing arbitrary monotonic calibration functions. Being non-parametric, it makes no distributional assumptions about the underlying data. The method is guaranteed to produce monotonic outputs and can handle complex calibration curves that deviate significantly from simple parametric forms.

Pitfalls: The method is prone to overfitting, particularly with small calibration sets, and may produce sharp discontinuities that degrade generalization performance. A phenomenon known as “step overfitting” occurs where small fluctuations in calibration data cause large changes in the learned mapping function. Furthermore, isotonic regression provides no theoretical guarantees about the quality of the resulting calibration.

3.4 Temperature Scaling

Temperature scaling (Guo et al., 2017) has gained popularity in deep learning. It rescales logits by a learned temperature parameter τ before applying the softmax:

P_calibrated = softmax(z / τ)

where z represents the model’s logits.

Advantages: Temperature scaling is extremely simple, requiring only one parameter to learn, and proves particularly effective for neural networks trained with cross-entropy loss. The method preserves the relative ordering of predictions and does not change the model’s accuracy since argmax predictions remain unchanged. It is also fast to compute and apply in production settings.

Pitfalls: Like Platt scaling, temperature scaling is parametric and limited in flexibility, assuming a uniform miscalibration across all confidence levels. The method is unsuitable for highly non-linear or class-dependent miscalibrations and may not work well for models not trained with cross-entropy loss or other standard objective functions.

3.5 Matrix and Vector Scaling

Extensions of temperature scaling include:

Matrix Scaling: Applies a linear transformation to logits: softmax(Wz + b) Vector Scaling: Uses class-dependent temperatures: softmax(z ⊙ t + b)

These methods offer more flexibility while maintaining computational efficiency.

4. Advanced Calibration Methods

4.1 Beta Calibration

Beta calibration (Kull et al., 2017) assumes that the calibration curve follows a beta distribution, offering more flexibility than Platt scaling while maintaining parametric efficiency.

The method fits three parameters (a, b, c) to model: P_calibrated = sigmoid(a × sigmoid^(-1)(p) + b)

where p is the uncalibrated probability.

4.2 Spline-Based Calibration

Spline calibration uses piecewise polynomial functions to create smooth, flexible calibration curves. This approach balances the flexibility of isotonic regression with the smoothness of parametric methods.

4.3 Bayesian Binning into Quantiles (BBQ)

BBQ (Naeni et al., 2015) provides a Bayesian approach to histogram binning, automatically selecting the optimal number of bins and providing uncertainty estimates about the calibration function itself.

4.4 Ensemble Temperature Scaling

For ensemble models, specialized calibration methods account for the correlation structure between ensemble members, typically requiring separate temperature parameters for each ensemble component.

5. Neural Network Calibration

5.1 Why Neural Networks Are Miscalibrated

Several factors contribute to poor calibration in neural networks. Overfitting in high-capacity models leads to memorization of training data, resulting in overconfident predictions. Model size plays a role, as larger networks tend to be more miscalibrated. Training procedures including modern practices like data augmentation and batch normalization can affect calibration properties. Furthermore, different architecture choices exhibit varying calibration characteristics, with some designs being inherently better calibrated than others.

5.2 Regularization-Based Approaches

Label Smoothing: Replaces hard targets with soft distributions, reducing overconfidence during training.

Dropout: Using dropout at inference time can improve calibration by providing uncertainty estimates.

Batch Normalization: Can affect calibration properties, sometimes requiring specialized handling.

5.3 Multi-Class Calibration

Extending calibration to multi-class settings presents additional challenges:

One-vs-All: Apply binary calibration to each class separately Matrix Scaling: Learn a full linear transformation of logits Dirichlet Calibration: Model the calibration function as a Dirichlet distribution

6. Overfitting and Miscalibration in Small Data Regimes

6.1 The Small Data Challenge

The problem becomes particularly acute when calibration must be performed with limited data. This scenario is common in medical applications with rare diseases, industrial quality control with few defects, specialized scientific domains with expensive data collection, and real-time systems requiring quick adaptation to new conditions.

6.2 Manifestations of Small Data Problems

Isotonic Regression Overfitting: Small fluctuations in the calibration set can induce disproportionate changes in the learned mapping, creating jagged, unreliable calibration curves.

Bin Selection Sensitivity: Histogram binning becomes extremely sensitive to the number and boundaries of bins.

Parameter Instability: Even simple methods like Platt scaling can exhibit high variance in parameter estimates.

6.3 Diagnostic Tools

Bootstrap Analysis: Assess calibration stability by resampling the calibration set Cross-Validation: Use nested cross-validation to select calibration hyperparameters Learning Curves: Plot calibration performance vs. calibration set size

7. Venn–ABERS Predictors

7.1 Theoretical Foundations

Venn–ABERS predictors are rooted in the framework of Venn prediction theory (Vovk et al., 2015) and conformal prediction. Unlike classical calibrators that map scores directly to single probabilities, Venn predictors partition the calibration data into categories and compute conditional probabilities within each partition.

The method is based on several key principles:

Exchangeability: Assumes calibration data and test data are exchangeable
Validity: Provides theoretical guarantees about long-run calibration
Efficiency: Aims to provide tight probability intervals

7.2 The ABERS Algorithm

ABERS (Adaptive Beta with Exchangeable Random Sampling) works as follows:

Partitioning: Divide calibration data based on the rank of their scores
Beta Calculation: For each partition, compute empirical probabilities
Interval Construction: Create probability intervals using adjacent partitions
Aggregation: Combine intervals to produce final estimates

7.3 Interval-Valued Probabilities

A distinctive feature is that Venn predictors output two probabilities, effectively an interval estimate [p_lower, p_upper]. This interval provides several benefits:

Uncertainty Quantification: Wider intervals indicate greater uncertainty
Validity Guarantees: The true probability lies within the interval with high probability
Risk-Aware Decisions: Decision-makers can account for uncertainty in their choices

The interval can be reduced to a single point prediction (e.g., the average) when required, but the interval itself provides valuable uncertainty information.

7.4 Validity Properties

Venn–ABERS predictors guarantee that the long-run average of predicted probabilities matches the observed frequencies under the exchangeability assumption. Specifically, they provide marginal validity where E[Y

p ∈ [p_lower, p_upper]] ∈ [p_lower, p_upper], and conditional validity where the property holds within subgroups defined by the Venn predictor’s partitioning.

7.5 Empirical Performance

Studies have demonstrated that Venn–ABERS calibration yields superior robustness compared to isotonic regression and Platt scaling, particularly in small-sample regimes where performance degrades gracefully as calibration data decreases, imbalanced datasets where the method maintains calibration quality even with severe class imbalance, scenarios with distribution shift where it shows better robustness to changes in data distribution, and noisy label conditions where it demonstrates less sensitivity to label noise in calibration data.

Performance gains are often quantified through lower Expected Calibration Error (ECE), improved Brier scores, better reliability diagram behavior, and more stable performance across different data splits.

8. Evaluation Metrics for Calibration

8.1 Expected Calibration Error (ECE)

ECE measures the weighted average of the absolute differences between accuracy and confidence:

ECE = Σ (n_b / n)

acc(b) - conf(b)

where b indexes bins, n_b is the number of samples in bin b, acc(b) is the accuracy in bin b, and conf(b) is the average confidence in bin b.

Variants:

Static ECE: Uses fixed bin boundaries
Adaptive ECE: Uses quantile-based binning
Class-wise ECE: Computes ECE separately for each class

8.2 Maximum Calibration Error (MCE)

MCE measures the worst-case calibration error across all bins:

MCE = max_b

acc(b) - conf(b)

8.3 Brier Score

The Brier score measures the mean squared difference between predicted probabilities and binary outcomes:

BS = (1/n) Σ (p_i - y_i)²

It decomposes into reliability (miscalibration), resolution (refinement), and uncertainty components.

8.4 Reliability Diagrams

Visual representations plotting predicted probability vs. observed frequency, typically with bins. Well-calibrated models should show points near the diagonal.

8.5 Calibration Plots and Histograms

Calibration Plots: Show the relationship between predicted and actual probabilities Confidence Histograms: Display the distribution of predicted probabilities Gap Plots: Visualize the difference between predicted and observed frequencies

8.6 Proper Scoring Rules

Log Loss: Measures the negative log-likelihood of predictions Brier Score: As described above Spherical Score: Geometric mean-based proper scoring rule

8.7 Statistical Tests

Hosmer-Lemeshow Test: Chi-square test for goodness of calibration Spiegelhalter’s Z-test: Tests calibration in probabilistic models Bootstrap Confidence Intervals: Provide uncertainty estimates for calibration metrics

9. Practical Implementation Guidelines

9.1 Standard Calibration Pipeline

Data Splitting: Reserve a held-out calibration set (typically 10-20% of available data)
Base Model Training: Train the primary model on the training set
Calibration Method Selection: Choose appropriate calibration method based on data characteristics
Hyperparameter Tuning: Use nested cross-validation for calibration hyperparameters
Final Calibration: Apply chosen method to calibration set
Evaluation: Assess calibration quality on a separate test set

9.2 Method Selection Guidelines

Small Datasets (< 1000 samples):

Venn–ABERS predictors
Platt scaling for stable results
Avoid isotonic regression

Medium Datasets (1000-10000 samples):

Isotonic regression often performs well
Temperature scaling for neural networks
Beta calibration for complex curves

Large Datasets (> 10000 samples):

All methods typically viable
Isotonic regression and spline methods excel
Consider computational efficiency

Neural Networks:

Start with temperature scaling
Consider ensemble temperature scaling for ensembles
Matrix/vector scaling for multi-class problems

9.3 Implementation Considerations

Cross-Validation Strategies: Use stratified k-fold CV to maintain class balance across folds

Computational Efficiency: Consider the trade-off between calibration quality and inference speed

Memory Requirements: Some methods (like isotonic regression) require storing calibration data

Hyperparameter Sensitivity: Assess robustness to hyperparameter choices using bootstrap sampling

9.4 Common Pitfalls and Best Practices

Data Leakage: Ensure strict separation between training, calibration, and test sets

Insufficient Calibration Data: Reserve adequate data for calibration (at least 100-1000 samples when possible)

Evaluation Bias: Use proper cross-validation schemes to avoid overly optimistic calibration estimates

Method Overfitting: Don’t over-optimize calibration hyperparameters on the test set

10. Venn–ABERS: Detailed Implementation

10.1 Step-by-Step Algorithm

Input: Calibration set {(x_i, y_i, s_i)} where s_i is the uncalibrated score

Step 1: Ranking and Partitioning

Sort calibration examples by score s_i
Create partitions based on score ranks

Step 2: Probability Calculation

For each test example with score s:
  Find adjacent partitions in calibration set
  Compute p0 = proportion of negatives in lower partition
  Compute p1 = proportion of positives in upper partition
  Return interval [p0, p1]

Step 3: Point Estimate (if needed)

Return (p0 + p1) / 2 as single probability estimate

10.2 Hyperparameter Selection

Partition Strategy: Various options for creating partitions

Equal-width partitions in score space
Equal-frequency partitions
Adaptive partitioning based on score distribution

Aggregation Method: Different ways to combine interval endpoints

Simple averaging: (p_lower + p_upper) / 2
Weighted averaging based on partition sizes
Conservative/optimistic approaches using interval bounds

10.3 Software Implementations

Open-source implementations are available in multiple languages:

Python: venn-abers package, scikit-learn integration
R: CRAN packages for Venn prediction
Julia: Native implementations in MLJ.jl ecosystem

Example Usage (Python):

from venn_abers import VennAbersPredictor

# Initialize predictor
va_predictor = VennAbersPredictor()

# Fit on calibration data
va_predictor.fit(cal_scores, cal_labels)

# Predict intervals
p_lower, p_upper = va_predictor.predict_proba(test_scores)

# Get point estimates
p_point = (p_lower + p_upper) / 2

11. Domain-Specific Applications

11.1 Medical Diagnosis

In healthcare, calibrated probabilities are crucial for:

Risk Stratification: Patients with different risk scores should have genuinely different risks
Treatment Decisions: Probability thresholds guide intervention choices
Regulatory Approval: Medical devices must demonstrate calibration for FDA approval

Case Study: COVID-19 severity prediction models required careful calibration to guide ICU admission decisions during the pandemic.

11.2 Financial Risk Assessment

Credit Scoring: Loan default probabilities must be accurate for regulatory capital calculations Algorithmic Trading: Portfolio optimization relies on well-calibrated return predictions Insurance: Premium calculations require accurate probability estimates

Regulatory Considerations: Basel III requirements mandate calibrated risk models for banks.

11.3 Autonomous Systems

Safety-Critical Decisions: Self-driving cars must have well-calibrated confidence in their perceptions Anomaly Detection: Industrial systems need calibrated uncertainty for maintenance scheduling Human-AI Collaboration: Calibrated confidence enables appropriate reliance on AI systems

11.4 Scientific Applications

Climate Modeling: Uncertainty quantification in climate predictions Drug Discovery: Calibrated models for molecular property prediction Astronomy: Calibrated probabilities for celestial object classification

12. Multi-Class and Structured Calibration

12.1 Multi-Class Extensions

Extending calibration to multi-class problems introduces additional complexity:

One-vs-Rest Approach: Apply binary calibration to each class separately

Simple to implement
May not preserve probability simplex constraints
Can lead to probabilities that don’t sum to 1

Matrix Scaling: Learn a full transformation matrix for logits

Maintains simplex constraints
More parameters to learn
Better theoretical properties

Dirichlet Calibration: Model the calibration as a Dirichlet distribution

Principled probabilistic approach
Natural handling of multi-class case
Computational complexity for parameter estimation

12.2 Regression Calibration

For continuous outputs, calibration focuses on prediction intervals rather than point probabilities:

Quantile Regression: Calibrate specific quantiles of the predictive distribution Conformal Prediction: Provide prediction intervals with coverage guarantees Distributional Calibration: Ensure the entire predictive distribution is well-calibrated

12.3 Structured Output Calibration

For complex outputs (sequences, trees, graphs):

Sequence Calibration: Calibrate probabilities for entire sequences in NLP tasks Hierarchical Calibration: Handle structured label spaces with taxonomic relationships Graph Calibration: Calibrate edge and node predictions in graph neural networks

13. Theoretical Advances and Recent Developments

13.1 Conformal Prediction

Conformal prediction provides a general framework for uncertainty quantification with finite-sample guarantees:

Coverage Guarantee: Prediction sets contain the true label with probability ≥ 1-α Distribution-Free: Works under minimal assumptions about data distribution Efficiency: Aims to produce small prediction sets while maintaining coverage

Connection to Calibration: Conformal methods naturally produce calibrated probability estimates.

13.2 Bayesian Calibration

Bayesian approaches to calibration treat the calibration function as uncertain:

Gaussian Process Calibration: Model calibration function as a GP Bayesian Neural Networks: Account for weight uncertainty in neural calibration Hierarchical Models: Share information across related calibration tasks

13.3 Online and Adaptive Calibration

For streaming data scenarios:

Online Isotonic Regression: Update calibration mappings as new data arrives Adaptive Temperature Scaling: Continuously adjust temperature parameters Change Point Detection: Identify when recalibration is needed

13.4 Fairness and Calibration

Intersection of calibration with algorithmic fairness:

Group-Wise Calibration: Ensure calibration within demographic groups Multi-Calibration: Satisfy calibration constraints for multiple overlapping groups Trade-offs: Balance between overall calibration and fairness constraints

14. Challenges and Future Directions

14.1 Current Limitations

Distribution Shift: Most calibration methods assume training and test distributions are identical Label Noise: Noisy calibration labels can severely degrade calibration quality Computational Scalability: Some methods don’t scale to very large datasets Multi-Task Calibration: Limited work on calibrating across multiple related tasks

14.2 Emerging Research Directions

Adversarial Calibration: Robustness to adversarial perturbations Meta-Learning for Calibration: Learning to calibrate across multiple tasks Causal Calibration: Accounting for causal relationships in calibration Quantum Machine Learning Calibration: Calibration methods for quantum algorithms

14.3 Standardization Efforts

Benchmark Datasets: Need for standardized calibration evaluation benchmarks Metric Standardization: Consensus on preferred calibration evaluation metrics
Regulatory Guidelines: Development of calibration requirements for regulated industries

14.4 Integration with Modern ML

AutoML Integration: Automated calibration method selection Neural Architecture Search: Architectures designed for better calibration Foundation Model Calibration: Calibrating large pre-trained models Federated Learning Calibration: Distributed calibration across multiple parties

15. Software Ecosystem and Tools

15.1 Python Libraries

scikit-learn: Basic calibration methods (Platt scaling, isotonic regression) netcal: Comprehensive calibration library with multiple methods and metrics uncertainty-toolbox: Toolkit for uncertainty quantification and calibration calibration-library: Research-focused library with recent methods

15.2 R Packages

pROC: ROC analysis with calibration diagnostics CalibrationCurves: Comprehensive calibration curve analysis rms: Regression modeling strategies with calibration tools

15.3 Specialized Frameworks

TensorFlow Probability: Uncertainty quantification for neural networks Pyro/NumPyro: Bayesian probabilistic programming with calibration MAPIE: Conformal prediction library with calibration methods

15.4 Evaluation Platforms

Calibration Benchmark: Standardized benchmarks for calibration evaluation OpenML: Open machine learning platform with calibration tasks Papers with Code: Tracking state-of-the-art calibration methods

16. Conclusion

Probability calibration remains a critical component of trustworthy machine learning systems. While classical methods like Platt scaling and isotonic regression have provided practical solutions for over two decades, their limitations–rigid assumptions, vulnerability to overfitting, and lack of validity guarantees–pose real challenges in modern settings.

The evolution from simple histogram binning to sophisticated methods like Venn–ABERS predictors represents significant progress in addressing these challenges. Modern calibration methods offer:

Theoretical Rigor: Methods with provable guarantees about calibration quality
Robustness: Better performance in challenging scenarios (small data, distribution shift, class imbalance)
Uncertainty Quantification: Explicit modeling of calibration uncertainty
Scalability: Methods that work efficiently with large-scale data

As machine learning continues to be deployed in safety-critical and high-stakes domains, the importance of well-calibrated uncertainty estimates will only grow. Future developments in calibration will likely focus on:

Adaptive Methods: Calibration techniques that automatically adjust to changing conditions
Multi-Modal Integration: Calibration across different data modalities and tasks
Real-Time Calibration: Methods suitable for online and streaming scenarios
Fairness Integration: Ensuring calibration while maintaining algorithmic fairness

Venn–ABERS predictors and other conformal prediction methods represent a promising direction: theoretically grounded, empirically validated, and robust to the challenging conditions common in real-world applications. However, the choice of calibration method should always be guided by the specific characteristics of the problem domain, available data, and performance requirements.

The field continues to evolve rapidly, with new theoretical insights, practical methods, and application domains emerging regularly. Practitioners are encouraged to stay current with developments and to evaluate multiple calibration approaches for their specific use cases, always with careful attention to proper evaluation methodology and the unique requirements of their application domains.

References

Bella, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2010). Calibration of machine learning models. Handbook of Research on Machine Learning Applications and Trends.
Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643), 1512-1519.
Dawid, A. P. (1982). The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379), 605-610.
DeGroot, M. H., & Fienberg, S. E. (1983). The comparison and evaluation of forecasters. Journal of the Royal Statistical Society, 32(1), 12-22.
Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B, 69(2), 243-268.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
Hébert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018). Multicalibration: Calibration for the computationally-identifiable masses. In International Conference on Machine Learning (pp. 1939-1948).
Kull, M., Silva Filho, T., & Flach, P. (2017). Beta calibration: A well-founded and easily implemented improvement on Platt scaling for binary SVM classification. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (pp. 623-631).
Kumar, A., Liang, P. S., & Ma, T. (2019). Verified uncertainty calibration. In Advances in Neural Information Processing Systems (pp. 3787-3798).
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (pp. 6402-6413).
Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., … & Lucic, M. (2021). Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems (pp. 15682-15694).
Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P. H., & Dokania, P. K. (2020). Calibrating uncertainties in object localization task. In Advances in Neural Information Processing Systems (pp. 15334-15345).
Naeni, L. S., Cooper, G., & Hauskrecht, M. (2015). Bayesian binning for calibration of binary classifiers. Machine Learning, 98(1-2), 151–182.
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019). Measuring calibration in deep learning. In CVPR Workshops (pp. 38-41).
Nouretdinov, I., Luo, Z., & Gammerman, A. (2021). Probability calibration in machine learning: The case of Venn–ABERS predictors. Entropy, 23(8), 1061.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., … & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (pp. 13991-14002).
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers (pp. 61–74).
Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371-421.
Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J., & Schön, T. (2019). Evaluating model calibration in classification. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467).
Vovk, V., Petej, I., & Fedorova, V. (2015). Large-scale probabilistic predictors with and without guarantees of validity. Proceedings of the 32nd International Conference on Machine Learning (ICML 2015).
Wenger, J., Kjellström, H., & Tomczak, J. M. (2020). Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics (pp. 178-190).

Check out Data Science Books on Amazon

Share on

Twitter Facebook LinkedIn