Expected calibration error (ECE)
Definition
A common calibration measure is the so-called expected calibration error (ECE). In its most general form, the ECE with respect to distance measure $d(p, p')$ is defined[WLZ21] as
\[\mathrm{ECE}_d := \mathbb{E} d\big(P_X, \mathrm{law}(Y \,|\, P_X)\big).\]
As implied by its name, the ECE is the expected distance between the left and right hand side of the calibration definition with respect to $d$.
Usually, the ECE is used to analyze classification models.[GPSW17][VWALRS19] In this case, $P_X$ and $\mathrm{law}(Y \,|\, P_X)$ can be identified with vectors in the probability simplex and $d$ can be chosen as a the cityblock distance, the total variation distance, or the squared Euclidean distance.
For other probabilistic predictive models such as regression models, one has to choose a more general distance measure $d$ between probability distributions on the target space since the conditional distributions $\mathrm{law}(Y \,|\, P_X)$ can be arbitrarily complex in general.
Estimators
The main challenge in the estimation of the ECE is the estimation of the conditional distribution $\mathrm{law}(Y \,|\, P_X)$ from a finite data set of predictions and corresponding targets. Typically, predictions are binned and empirical estimates of the conditional distributions are calculated for each bin. You can construct such estimators with ECE
.
CalibrationErrors.ECE
— TypeECE(binning[, distance = TotalVariation()])
Estimator of the expected calibration error (ECE) for a classification model with respect to the given distance
function using the binning
algorithm.
For classification models, the predictions $P_{X_i}$ and targets $Y_i$ are identified with vectors in the probability simplex. The estimator of the ECE is defined as
\[\frac{1}{B} \sum_{i=1}^B d\big(\overline{P}_i, \overline{Y}_i\big),\]
where $B$ is the number of non-empty bins, $d$ is the distance function, and $\overline{P}_i$ and $\overline{Y}_i$ are the average vector of the predictions and the average vector of targets in the $i$th bin. By default, the total variation distance is used.
The distance
has to be a function of the form
distance(pbar::Vector{<:Real}, ybar::Vector{<:Real}).
In particular, distance measures of the package Distances.jl are supported.
Binning algorithms
Currently, two binning algorithms are supported. UniformBinning
is a binning schemes with bins of fixed bins of uniform size whereas MedianVarianceBinning
splits the validation data set of predictions and targets dynamically to reduce the variance of the predictions.
CalibrationErrors.UniformBinning
— TypeUniformBinning(nbins::Int)
Binning scheme of the probability simplex with nbins
bins of uniform width for each component.
CalibrationErrors.MedianVarianceBinning
— TypeMedianVarianceBinning([minsize::Int = 10, maxbins::Int = typemax(Int)])
Dynamic binning scheme of the probability simplex with at most maxbins
bins that each contain at least minsize
samples.
The data set is split recursively as long as it is possible to split the bins while satisfying these conditions. In each step, the bin with the maximum variance of predicted probabilities for any component is selected and split at the median of the predicted probability of the component with the largest variance.
- GPSW17Guo, C., et al. (2017). On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (pp. 1321-1330).
- VWALRS19Vaicenavicius, J., et al. (2019). Evaluating model calibration in classification. In Proceedings of Machine Learning Research (AISTATS 2019) (pp. 3459-3467).
- WLZ21Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. To be presented at ICLR 2021.