Expected calibration error (ECE)

Definition

A common calibration measure is the so-called expected calibration error (ECE). In its most general form, the ECE with respect to distance measure $d(p, p')$ is defined^[WLZ21] as

\[\mathrm{ECE}_d := \mathbb{E} d\big(P_X, \mathrm{law}(Y \,|\, P_X)\big).\]

As implied by its name, the ECE is the expected distance between the left and right hand side of the calibration definition with respect to $d$.

Usually, the ECE is used to analyze classification models.^[GPSW17]^[VWALRS19] In this case, $P_X$ and $\mathrm{law}(Y \,|\, P_X)$ can be identified with vectors in the probability simplex and $d$ can be chosen as a the cityblock distance, the total variation distance, or the squared Euclidean distance.

For other probabilistic predictive models such as regression models, one has to choose a more general distance measure $d$ between probability distributions on the target space since the conditional distributions $\mathrm{law}(Y \,|\, P_X)$ can be arbitrarily complex in general.

Estimators

The main challenge in the estimation of the ECE is the estimation of the conditional distribution $\mathrm{law}(Y \,|\, P_X)$ from a finite data set of predictions and corresponding targets. Typically, predictions are binned and empirical estimates of the conditional distributions are calculated for each bin. You can construct such estimators with ECE.

CalibrationErrors.ECE — Type

ECE(binning[, distance = TotalVariation()])

Estimator of the expected calibration error (ECE) for a classification model with respect to the given distance function using the binning algorithm.

For classification models, the predictions $P_{X_i}$ and targets $Y_i$ are identified with vectors in the probability simplex. The estimator of the ECE is defined as

\[\frac{1}{B} \sum_{i=1}^B d\big(\overline{P}_i, \overline{Y}_i\big),\]

where $B$ is the number of non-empty bins, $d$ is the distance function, and $\overline{P}_i$ and $\overline{Y}_i$ are the average vector of the predictions and the average vector of targets in the $i$th bin. By default, the total variation distance is used.

The distance has to be a function of the form

distance(pbar::Vector{<:Real}, ybar::Vector{<:Real}).

In particular, distance measures of the package Distances.jl are supported.

source

Binning algorithms

Currently, two binning algorithms are supported. UniformBinning is a binning schemes with bins of fixed bins of uniform size whereas MedianVarianceBinning splits the validation data set of predictions and targets dynamically to reduce the variance of the predictions.

CalibrationErrors.UniformBinning — Type

UniformBinning(nbins::Int)

Binning scheme of the probability simplex with nbins bins of uniform width for each component.

source

CalibrationErrors.MedianVarianceBinning — Type

MedianVarianceBinning([minsize::Int = 10, maxbins::Int = typemax(Int)])

Dynamic binning scheme of the probability simplex with at most maxbins bins that each contain at least minsize samples.

The data set is split recursively as long as it is possible to split the bins while satisfying these conditions. In each step, the bin with the maximum variance of predicted probabilities for any component is selected and split at the median of the predicted probability of the component with the largest variance.

source

GPSW17Guo, C., et al. (2017). On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (pp. 1321-1330).
VWALRS19Vaicenavicius, J., et al. (2019). Evaluating model calibration in classification. In Proceedings of Machine Learning Research (AISTATS 2019) (pp. 3459-3467).
WLZ21Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. To be presented at ICLR 2021.