Uniform convergence in probability

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family uniformly converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory. Specifically, the Glivenko-Cantelli theorem and the homonymous classes of functions are fundamentally related to uniform convergence.

The law of large numbers says that, for each single event $A$ , its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class $S$ from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events $S$ ^[1]. The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.

Definitions

For a class of predicates $H$ defined on a set $X$ and a set of samples $x=(x_{1},x_{2},\dots ,x_{m})$ , where $x_{i}\in X$ , the empirical frequency of $h\in H$ on $x$ is

{\widehat {Q}}_{x}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|.

The theoretical probability of $h\in H$ is defined as $Q_{P}(h)=P\{y\in X:h(y)=1\}.$

The Uniform Convergence Theorem states, roughly, that if $H$ is "simple" and we draw samples independently (with replacement) from $X$ according to any distribution $P$ , then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.^[2]

Here "simple" means that the Vapnik–Chervonenkis dimension of the class $H$ is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis^[1] using the concept of growth function.

Uniform Convergence Theorem

The statement of the Uniform Convergence Theorem is as follows:^[3]

If $H$ is a set of $\{0,1\}$ -valued functions defined on a set $X$ and $P$ is a probability distribution on $X$ then for $\varepsilon >0$ and $m$ a positive integer, we have:

P^{m}\{|Q_{P}(h)-{\widehat {Q_{x}}}(h)|\geq \varepsilon {\text{ for some }}h\in H\}\leq 4\Pi _{H}(2m)e^{-\varepsilon ^{2}m/8}.

In the above, for any $x\in X^{m},$ $Q_{P}(h)=P\{(y\in X:h(y)=1\},$ ${\widehat {Q}}_{x}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|$ and $|x|=m.$ $P^{m}$ indicates that the probability is taken over $x$ consisting of $m$ i.i.d. draws from the distribution $P.$

Finally, the growth function $\Pi _{H}$ is defined in the following way, for any $\{0,1\}$ -valued functions $H$ over $X$ and for any natural number $m$ : $\Pi _{H}(m)=\max |\{h\cap D:D\subseteq X,|D|=m,h\in H\}|.$

From the point of view of Learning Theory one can consider $H$ to be the Concept/Hypothesis class defined over the instance set $X$ . Crucially, the Sauer–Shelah lemma implies that $\Pi _{H}(m)\leq m^{d}$ , where $d$ is the VC dimension of $H$ .

Proof of the Uniform Convergence Theorem

^[1] and ^[3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

Symmetrization: We transform the problem of analyzing $|Q_{P}(h)-{\widehat {Q}}_{x}(h)|\geq \varepsilon$ into the problem of analyzing $|{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2$ , where $r$ and $s$ are i.i.d samples of size $m$ drawn according to the distribution $P$ . One can view $r$ as the original randomly drawn sample of length $m$ , while $s$ may be thought as the testing sample which is used to estimate $Q_{P}(h)$ .
Permutation: Since $r$ and $s$ are picked identically and independently, so swapping elements between them will not change the probability distribution on $r$ and $s$ . So, we will try to bound the probability of $|{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2$ for some $h\in H$ by considering the effect of a specific collection of permutations of the joint sample $x=r||s$ . Specifically, we consider permutations $\sigma (x)$ which swap $x_{i}$ and $x_{m+i}$ in some subset of ${1,2,...,m}$ . The symbol $r||s$ means the concatenation of $r$ and $s$ .^{[citation needed]}
Reduction to a finite class: We can now restrict the function class $H$ to a fixed joint sample and hence, if $H$ has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof. It should be stressed that this proof glosses over details like the measurability of the events $V$ and $R$ ; measurability is granted in the case of $H$ being finite or countable, but this is not normally the case in standard applications of the theorem (e.g. for statistical learning theory or to prove the Glivenko-Cantelli theorem). To get measurability, one needs to use a notion of separability of the underlying space, possibly related to $H$ ^[4].

Symmetrization

Lemma: Let $V=\{x\in X^{m}:|Q_{P}(h)-{\widehat {Q}}_{x}(h)|\geq \varepsilon {\text{ for some }}h\in H\}$ and

R=\{(r,s)\in X^{m}\times X^{m}:|{\widehat {Q_{r}}}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2{\text{ for some }}h\in H\}.

Then for $m\geq {\frac {2}{\varepsilon ^{2}}}$ , $P^{m}(V)\leq 2P^{2m}(R)$ .

Proof: By the triangle inequality,
if $|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon$ and $|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2$ then $|{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2$ .

Therefore,

{\begin{aligned}&P^{2m}(R)\\[5pt]\geq {}&P^{2m}\{\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\\[5pt]={}&\int _{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\,dP^{m}(r)\\[5pt]={}&A\end{aligned}}

since $r$ and $s$ are independent.

Now for $r\in V$ fix an $h\in H$ such that $|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon$ . For this $h$ , we shall show that

P^{m}\left\{|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq {\frac {\varepsilon }{2}}\right\}\geq {\frac {1}{2}}.

Thus for any $r\in V$ , $A\geq {\frac {P^{m}(V)}{2}}$ and hence $P^{2m}(R)\geq {\frac {P^{m}(V)}{2}}$ . And hence we perform the first step of our high level idea.

Notice, $m\cdot {\widehat {Q}}_{s}(h)$ is a binomial random variable with expectation $m\cdot Q_{P}(h)$ and variance $m\cdot Q_{P}(h)(1-Q_{P}(h))$ . By Chebyshev's inequality we get

P^{m}\left\{|Q_{P}(h)-{\widehat {Q_{s}(h)}}|>{\frac {\varepsilon }{2}}\right\}\leq {\frac {m\cdot Q_{P}(h)(1-Q_{P}(h))}{(\varepsilon m/2)^{2}}}\leq {\frac {1}{\varepsilon ^{2}m}}\leq {\frac {1}{2}}

for the mentioned bound on $m$ . Here we use the fact that $x(1-x)\leq 1/4$ for $x$ .

Permutations

Let $\Gamma _{m}$ be the set of all permutations of $\{1,2,3,\dots ,2m\}$ that swaps $i$ and $m+i$ $\forall i$ in some subset of $\{1,2,3,\ldots ,2m\}$ .

Lemma: Let $R$ be any subset of $X^{2m}$ and $P$ any probability distribution on $X$ . Then,

P^{2m}(R)=E[\Pr[\sigma (x)\in R]]\leq \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]),

where the expectation is over $x$ chosen according to $P^{2m}$ , and the probability is over $\sigma$ chosen uniformly from $\Gamma _{m}$ .

Proof: For any $\sigma \in \Gamma _{m},$

P^{2m}(R)=P^{2m}\{x:\sigma (x)\in R\}

(since coordinate permutations preserve the product distribution $P^{2m}$ .)

{\begin{aligned}\therefore P^{2m}(R)={}&\int _{X^{2m}}1_{R}(x)\,dP^{2m}(x)\\[5pt]={}&{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}\int _{X^{2m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]={}&\int _{X^{2m}}{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]&{\text{(because }}|\Gamma _{m}|{\text{ is finite)}}\\[5pt]={}&\int _{X^{2m}}\Pr[\sigma (x)\in R]\,dP^{2m}(x)\quad {\text{(the expectation)}}\\[5pt]\leq {}&\max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]).\end{aligned}}

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

Lemma: Basing on the previous lemma,

\max _{x\in X^{2m}}(\Pr[\sigma (x)\in R])\leq 4\Pi _{H}(2m)e^{-\varepsilon ^{2}m/8}

.

Proof: Let us define $x=(x_{1},x_{2},\ldots ,x_{2m})$ and $t=|H|_{x}|$ which is at most $\Pi _{H}(2m)$ . This means there are functions $h_{1},h_{2},\ldots ,h_{t}\in H$ such that for any $h\in H,\exists i$ between $1$ and $t$ with $h_{i}(x_{k})=h(x_{k})$ for $1\leq k\leq 2m.$

We see that $\sigma (x)\in R$ iff for some $h$ in $H$ satisfies, $|{\frac {1}{m}}|\{1\leq i\leq m:h(x_{\sigma _{i}})=1\}|-{\frac {1}{m}}|\{m+1\leq i\leq 2m:h(x_{\sigma _{i}})=1\}||\geq {\frac {\varepsilon }{2}}$ . Hence if we define $w_{i}^{j}=1$ if $h_{j}(x_{i})=1$ and $w_{i}^{j}=0$ otherwise.

For $1\leq i\leq m$ and $1\leq j\leq t$ , we have that $\sigma (x)\in R$ iff for some $j$ in ${1,\ldots ,t}$ satisfies $|{\frac {1}{m}}\left(\sum _{i}w_{\sigma (i)}^{j}-\sum _{i}w_{\sigma (m+i)}^{j}\right)|\geq {\frac {\varepsilon }{2}}$ . By union bound we get

\Pr[\sigma (x)\in R]\leq t\cdot \max \left(\Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\varepsilon }{2}}]\right)

\leq \Pi _{H}(2m)\cdot \max \left(\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)\right|\geq {\frac {\varepsilon }{2}}\right]\right).

Since, the distribution over the permutations $\sigma$ is uniform for each $i$ , so $w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}$ equals $\pm |w_{i}^{j}-w_{m+i}^{j}|$ , with equal probability.

Thus,

\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}\left(w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}\right)\right)\right|\geq {\frac {\varepsilon }{2}}\right]=\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}|w_{i}^{j}-w_{m+i}^{j}|\beta _{i}\right)\right|\geq {\frac {\varepsilon }{2}}\right],

where the probability on the right is over $\beta _{i}$ and both the possibilities are equally likely. By Hoeffding's inequality, this is at most $2e^{-m\varepsilon ^{2}/8}$ .

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.

References

^ ^a ^b ^c Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.
^ "Martingales", Probability with Martingales, Cambridge University Press, pp. 93–105, 1991-02-14, doi:10.1017/cbo9780511813658.014, ISBN 978-0-521-40455-6, retrieved 2023-12-08
^ ^a ^b Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press ISBN 0-521-57353-X
^ Krapp, Lothar Sebastian; Wirth, Laura (26 September 2025), Measurability in the Fundamental Theorem of Statistical Learning, v3, arXiv:2410.10243, doi:10.48550/arXiv.2410.10243

[vc-1] Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.

[2] "Martingales", Probability with Martingales, Cambridge University Press, pp. 93–105, 1991-02-14, doi:10.1017/cbo9780511813658.014, ISBN 978-0-521-40455-6, retrieved 2023-12-08

[books.google.com-3] Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press ISBN 0-521-57353-X

[KrappWirthFTSL-4] Krapp, Lothar Sebastian; Wirth, Laura (26 September 2025), Measurability in the Fundamental Theorem of Statistical Learning, v3, arXiv:2410.10243, doi:10.48550/arXiv.2410.10243

[1]

[2]

[3]

[4]