Kernel calibration error (KCE)

Definition

The kernel calibration error (KCE) is another calibration error. It is based on real-valued kernels on the product space $\mathcal{P} \times \mathcal{Y}$ of predictions and targets.

The KCE with respect to a real-valued kernel $k \colon (\mathcal{P} \times \mathcal{Y}) \times (\mathcal{P} \times \mathcal{Y}) \to \mathbb{R}$ is defined^[WLZ21] as

\[\mathrm{KCE}_k := \sup_{f \in \mathcal{B}_k} \bigg| \mathbb{E}_{Y,P_X} f(P_X, Y) - \mathbb{E}_{Z_X,P_X} f(P_X, Z_X)\bigg|,\]

where $\mathcal{B}_{k}$ is the unit ball in the reproducing kernel Hilbert space (RKHS) to $k$ and $Z_X$ is an artificial random variable on the target space $\mathcal{Y}$ whose conditional law is given by

\[Z_X \,|\, P_X = \mu \sim \mu.\]

The RKHS to kernel $k$, and hence also the unit ball $\mathcal{B}_k$, consists of real-valued functions of the form $f \colon \mathcal{P} \times \mathcal{Y} \to \mathbb{R}$.

For classification models with $m$ classes, there exists an equivalent formulation of the KCE based on matrix-valued kernel $\tilde{k} \colon \mathcal{P} \times \mathcal{P} \to \mathbb{R}^{m \times m}$ on the space $\mathcal{P}$ of predictions.^[WLZ19] The definition above can be rewritten as

\[\mathrm{KCE}_{\tilde{k}} := \sup_{f \in \mathcal{B}_{\tilde{k}}} \bigg| \mathbb{E}_{P_X} \big(\mathrm{law}(Y \,|\, P_X) - P_X\big)^\mathsf{T} f(P_X) \bigg|,\]

where the matrix-valued kernel $\tilde{k}$ is given by

\[\tilde{k}_{i,j}(p, q) = k((p, i), (q, j)) \quad (i,j=1,\ldots,m),\]

and $\mathcal{B}_{\tilde{k}}$ is the unit ball in the RKHS of $\tilde{k}$, consisting of vector-valued functions $f \colon \mathcal{P} \to \mathbb{R}^m$. However, this formulation applies only to classification models whereas the general definition above covers all probabilistic predictive models.

For a large class of kernels the KCE is zero if and only if the model is calibrated.^[WLZ21] Moreover, the squared KCE (SKCE) can be formulated in terms of the kernel $k$ as

\[\begin{aligned} \mathrm{SKCE}_{k} := \mathrm{KCE}_k^2 &= \int k(u, v) \, \big(\mathrm{law}(P_X, Y) - \mathrm{law}(P_X, Z_X)\big)(u) \big(\mathrm{law}(P_X, Y) - \mathrm{law}(P_X, Z_X)\big)(v) \\ &= \mathbb{E} h_k\big((P_X, Y), (P_{X'}, Y')\big), \end{aligned}\]

where $(X',Y')$ is an independent copy of $(X,Y)$ and

\[\begin{aligned} h_k\big((\mu, y), (\mu', y')\big) :={}& k\big((\mu, y), (\mu', y')\big) - \mathbb{E}_{Z \sim \mu} k\big((\mu, Z), (\mu', y')\big) \\ &- \mathbb{E}_{Z' \sim \mu'} k\big((\mu, y), (\mu', Z')\big) + \mathbb{E}_{Z \sim \mu, Z' \sim \mu'} k\big((\mu, Z), (\mu', Z')\big). \end{aligned}\]

The KCE is actually a special case of calibration errors that are formulated as integral probability metrics of the form

\[\sup_{f \in \mathcal{F}} \big| \mathbb{E}_{Y,P_X} f(P_X, Y) - \mathbb{E}_{Z_X,P_X} f(P_X, Z_X)\big|,\]

where $\mathcal{F}$ is a space of real-valued functions of the form $f \colon \mathcal{P} \times \mathcal{Y} \to \mathbb{R}$.^[WLZ21] For classification models, the ECE with respect to common distances such as the total variation distance or the squared Euclidean distance can be formulated in this way.^[WLZ19]

The maximum mean calibration error (MMCE)^[KSJ] can be viewed as a special case of the KCE, in which only the most-confident predictions are considered.^[WLZ19]

Estimator

For the SKCE biased and unbiased estimators exist. In CalibrationErrors.jl SKCE lets you construct unbiased and biased estimators with quadratic and sub-quadratic sample complexity.

CalibrationErrors.SKCE — Type

SKCE(k; unbiased::Bool=true, blocksize=identity)

Estimator of the squared kernel calibration error (SKCE) with kernel k.

Kernel k on the product space of predictions and targets has to be a Kernel from the Julia package KernelFunctions.jl that can be evaluated for inputs that are tuples of predictions and targets.

One can choose an unbiased or a biased variant with unbiased=true or unbiased=false, respectively (see details below).

The SKCE is estimated as the average estimate of different blocks of samples. The number of samples per block is set by blocksize:

If blocksize is a function blocksize(n::Int), then the number of samples per block is set to blocksize(n) where n is the total number of samples.
If blocksize is an integer, then the number of samplers per block is set to blocksize, indepedent of the total number of samples.

The default setting blocksize=identity implies that a single block with all samples is used.

The number of samples per block must be at least 1 if unbiased=false and 2 if unbiased=true. Additionally, it must be at most the total number of samples. Note that the last block is neglected if it is incomplete (see details below).

Details

The unbiased estimator is not guaranteed to be non-negative whereas the biased estimator is always non-negative.

The sample complexity of the estimator is $O(mn)$, where $m$ is the block size and $n$ is the total number of samples. In particular, with the default setting blocksize=identity the estimator has a quadratic sample complexity.

Let $(P_{X_i}, Y_i)_{i=1,\ldots,n}$ be a data set of predictions and corresponding targets. The estimator with block size $m$ is defined as

\[{\bigg\lfloor \frac{n}{m} \bigg\rfloor}^{-1} \sum_{b=1}^{\lfloor n/m \rfloor} |B_b|^{-1} \sum_{(i, j) \in B_b} h_k\big((P_{X_i}, Y_i), (P_{X_j}, Y_j)\big),\]

where

\[\begin{aligned} h_k\big((μ, y), (μ', y')\big) ={}& k\big((μ, y), (μ', y')\big) - 𝔼_{Z ∼ μ} k\big((μ, Z), (μ', y')\big) \\ & - 𝔼_{Z' ∼ μ'} k\big((μ, y), (μ', Z')\big) + 𝔼_{Z ∼ μ, Z' ∼ μ'} k\big((μ, Z), (μ', Z')\big) \end{aligned}\]

and blocks $B_b$ ($b = 1, \ldots, \lfloor n/m \rfloor$) are defined as

\[B_b = \begin{cases} \{(i, j): (b - 1) m < i < j \leq bm \} & \text{(unbiased)}, \\ \{(i, j): (b - 1) m < i, j \leq bm \} & \text{(biased)}. \end{cases}\]

References

Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In: Advances in Neural Information Processing Systems (NeurIPS 2019) (pp. 12257–12267).

Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification.

source

KSJKumar, A., Sarawagi, S., & Jain, U. (2018). Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of the 35th International Conference on Machine Learning (pp. 2805-2814).
WLZ19Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (pp. 12257–12267).
WLZ21Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. To be presented at ICLR 2021.