Bernstein Concentration Inequalities

1 Bernstein concentration inequality

The standard Bernstein inequality for real random variables demonstrates that a sum of independent bounded random variables exhibits normal concentration near its mean on a scale determined by the variance of the sum.

Theorem 1 (Bernstein inequality)

Let $\{S_{1},\dots,S_{n}\}$ be independent and centered real random variables satisfying $|S_{i}|\leq R$ a.s. for $i=1,\dots,n$ . Let $Z=\sum_{i=1}^{n}S_{i}$ and assume $\mathbb{E}[Z^{2}]=\sum_{i=1}^{n}\mathbb{E}[S_{i}^{2}]\leq\sigma^{2}$ . Then for any $t>0$ ,

\displaystyle\mathbb{P}(|Z|\geq t)\leq\exp\!\left(\frac{-t^{2}/2}{\sigma^{2}+% Rt/3}\right).

Sums of independent random matrices exhibit the same type of behavior, where the normal concentration depends on a matrix generalization of the variance and the tails are controlled by a uniform bound on the spectral norm of each summand [3].

Theorem 2 (Matrix Bernstein concentration)

Let $\{S_{1},\dots,S_{n}\}$ be independent random real symmetric matrices with dimension $d$ . Assume each random matrix satisfies

\mathbb{E}[S_{i}]=\mathbf{0}\quad\mathrm{and}\quad\|S_{i}\|\leq R\ \ \mathrm{a% .s.}

where $\|\cdot\|$ is the matrix spectral norm, and $\|\sum_{i=1}^{n}\mathbb{E}[S_{i}^{2}]\|\leq\sigma^{2}$ . Then for any $t>0$ ,

\displaystyle\mathbb{P}(\|Z\|\geq t)\leq d\cdot\exp\!\left(\frac{-t^{2}/2}{% \sigma^{2}+Rt/3}\right).

(1)

A remarkable feature is that the above bound depends on the dimension of the matrices. On the other hand, in [2, Chapter 7], Tropp points out that $d$ can be replaced by the so called “intrinsic dimension” of a matrix.

2 Operator Bernstein concentration

For a positive-semidefinite Hilbert-Schmidt operator $A$ , we can also define the intrinsic dimension of it, as $r(A)=\frac{\mathrm{Tr}(A)}{\|A\|}$ (if $A$ is also trace class). The following result is from [1].

Theorem 3 (Operator Bernstein concentration)

Let $\{S_{1},\dots,S_{n}\}$ be independent random self-adjoint Hilbert–Schmidt operators, where $S_{i}:H\rightarrow H$ acts on a separable Hilbert space $H$ . Assume each random operator satisfies

\mathbb{E}[S_{i}]=\mathbf{0}\quad\mathrm{and}\quad\|S_{i}\|\leq R\ \ \mathrm{a% .s.}

where $\|\cdot\|$ is the operator norm, and $\|\sum_{i=1}^{n}\mathbb{E}[S_{i}^{2}]\|\leq\sigma^{2}$ . Then for any $t\geq\frac{1}{6}(R+(R^{2}+36\sigma^{2})^{1/2})$ ,

\displaystyle\mathbb{P}(\|Z\|\geq t)\leq r(\sum_{i=1}^{n}\mathbb{E}[S_{i}^{2}]% )\cdot\exp\!\left(\frac{-t^{2}/2}{\sigma^{2}+Rt/3}\right).

(2)

In statistical learning, random self-adjoint operators frequently emerge in the study of approximation errors. Assume $x$ be a random data in $X$ . Rigorously speaking, let $(\Omega,\mathcal{A},\mathbb{P})$ be a probability space and $x:(\Omega,\mathcal{A})\rightarrow(X,\mathcal{F})$ be measurable, such that the distribution on $X$ is $\nu=\mathbb{P}\circ x^{-1}$ . Let $K(\cdot,\cdot):X\times X\rightarrow\mathbb{R}$ be a symmetric positive definite kernel, satisfying $\mathrm{sup}_{x\in X}K(x,x)\leq\kappa<\infty$ . Define the kernel integral operator $T$ on $L^{2}(X,\mathcal{F},\nu)$ be

\displaystyle(Tf)(x)=\int_{X}K(x,y)f(y)\,\nu(\mathrm{d}y).

(3)

Then $T$ is a Hilbert–Schmidt operator with the Hilbert–Schmidt norm satisfying

\displaystyle\|T\|_{\mathrm{HS}}^{2}=\int_{X\times X}|K(x,y)|^{2}(\nu\otimes% \nu)(\mathrm{d}(x,y))\leq\kappa^{2},

where we have used $K(x,y)^{2}\leq K(x,x)K(y,y)$ .

Let $\mathcal{H}$ be the Reproducing Kernel Hilbert Space (RKHS) with kernel $K$ . If $X$ is a locally compact Hausdorff space, $\nu$ a $\sigma$ -finite Radon measure, and $K\in L^{2}(X,\nu)$ , then $\mathcal{H}$ is separable. In fact, the collection of countable eigenfunctions of $T$ corresponding to positive eigenvalues forms a complete orthogonal basis.

Suppose we have $n$ i.i.d. samples of $x$ , i.e., $\{x_{1},\dots,x_{n}\}$ are i.i.d. random variables with the same distribution as $x$ . Define the empirical operator on $\mathcal{H}$ as

\displaystyle T_{n}=\frac{1}{n}\sum_{i=1}^{n}K_{x_{i}}\otimes K_{x_{i}}.

(4)

Now $T_{i}$ is a random self-adjoint operator, and it will approximate $T$ as $n\to\infty$ . Notice that $T$ is also a Hilbert-Schmidt operator on $\mathscr{H}$ .

Let $S_{i}=K_{x_{i}}\otimes K_{x_{i}}-T$ . Then $\frac{1}{n}\sum_{i=1}^{n}S_{i}=T_{n}-T$ , and

\displaystyle\mathbb{E}[S_{i}]=\int_{X}K_{x}\otimes K_{x}\,\nu(\mathrm{d}x)-T=0,

where the Bochner integral $\int_{X}K_{x}\otimes K_{x}\,\nu(\mathrm{d}x)$ converges in the $\|\cdot\|_{\mathrm{HS}}$ norm. Therefore, we can apply the operator Bernstein concentration inequality to deal with the sum of the i.i.d random operators $\{S_{1},\dots,S_{n}\}$ with zero means.

References

[1] S. Minsker (2017) On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics and Probability Letters 127, pp. 111–119. External Links: Link Cited by: §2.
[2] J. A. Tropp et al. (2015) An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8 (1-2), pp. 1–230. External Links: Link Cited by: §1.
[3] J. A. Tropp (2012) User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12 (4), pp. 389–434. External Links: Link Cited by: §1.