1 Bernstein concentration inequality

1 Bernstein concentration inequality

The standard Bernstein inequality for real random variables demonstrates that a sum of independent bounded random variables exhibits normal concentration near its mean on a scale determined by the variance of the sum.

Theorem 1 (Bernstein inequality)

Let {S1,,Sn}subscript𝑆1subscript𝑆𝑛 be independent and centered real random variables satisfying |Si|Rsubscript𝑆𝑖𝑅 a.s. for i=1,,n𝑖1𝑛. Let Z=i=1nSi𝑍superscriptsubscript𝑖1𝑛subscript𝑆𝑖 and assume 𝔼[Z2]=i=1n𝔼[Si2]σ2𝔼delimited-[]superscript𝑍2superscriptsubscript𝑖1𝑛𝔼delimited-[]superscriptsubscript𝑆𝑖2superscript𝜎2. Then for any t>0𝑡0,

(|Z|t)exp(t2/2σ2+Rt/3).𝑍𝑡superscript𝑡22superscript𝜎2𝑅𝑡3

Sums of independent random matrices exhibit the same type of behavior, where the normal concentration depends on a matrix generalization of the variance and the tails are controlled by a uniform bound on the spectral norm of each summand [3].

Theorem 2 (Matrix Bernstein concentration)

Let {S1,,Sn}subscript𝑆1subscript𝑆𝑛 be independent random real symmetric matrices with dimension d𝑑. Assume each random matrix satisfies

𝔼[Si]=𝟎andSiRa.s.formulae-sequenceformulae-sequence𝔼delimited-[]subscript𝑆𝑖0andnormsubscript𝑆𝑖𝑅as

where is the matrix spectral norm, and i=1n𝔼[Si2]σ2normsuperscriptsubscript𝑖1𝑛𝔼delimited-[]superscriptsubscript𝑆𝑖2superscript𝜎2. Then for any t>0𝑡0,

(Zt)dexp(t2/2σ2+Rt/3).norm𝑍𝑡𝑑superscript𝑡22superscript𝜎2𝑅𝑡3 (1)

A remarkable feature is that the above bound depends on the dimension of the matrices. On the other hand, in [2, Chapter 7], Tropp points out that d𝑑 can be replaced by the so called “intrinsic dimension” of a matrix.

2 Operator Bernstein concentration

For a positive-semidefinite Hilbert-Schmidt operator A𝐴, we can also define the intrinsic dimension of it, as r(A)=Tr(A)A𝑟𝐴Tr𝐴norm𝐴 (if A𝐴 is also trace class). The following result is from [1].

Theorem 3 (Operator Bernstein concentration)

Let {S1,,Sn}subscript𝑆1subscript𝑆𝑛 be independent random self-adjoint Hilbert–Schmidt operators, where Si:HH:subscript𝑆𝑖𝐻𝐻 acts on a separable Hilbert space H𝐻. Assume each random operator satisfies

𝔼[Si]=𝟎andSiRa.s.formulae-sequenceformulae-sequence𝔼delimited-[]subscript𝑆𝑖0andnormsubscript𝑆𝑖𝑅as

where is the operator norm, and i=1n𝔼[Si2]σ2normsuperscriptsubscript𝑖1𝑛𝔼delimited-[]superscriptsubscript𝑆𝑖2superscript𝜎2. Then for any t16(R+(R2+36σ2)1/2)𝑡16𝑅superscriptsuperscript𝑅236superscript𝜎212,

(Zt)r(i=1n𝔼[Si2])exp(t2/2σ2+Rt/3).norm𝑍𝑡𝑟superscriptsubscript𝑖1𝑛𝔼delimited-[]superscriptsubscript𝑆𝑖2superscript𝑡22superscript𝜎2𝑅𝑡3 (2)

In statistical learning, random self-adjoint operators frequently emerge in the study of approximation errors. Assume x𝑥 be a random data in X𝑋. Rigorously speaking, let (Ω,𝒜,)Ω𝒜 be a probability space and x:(Ω,𝒜)(X,):𝑥Ω𝒜𝑋 be measurable, such that the distribution on X𝑋 is ν=x1𝜈superscript𝑥1. Let K(,):X×X:𝐾𝑋𝑋 be a symmetric positive definite kernel, satisfying supxXK(x,x)κ<subscriptsup𝑥𝑋𝐾𝑥𝑥𝜅. Define the kernel integral operator T𝑇 on L2(X,,ν)superscript𝐿2𝑋𝜈 be

(Tf)(x)=XK(x,y)f(y)ν(dy).𝑇𝑓𝑥subscript𝑋𝐾𝑥𝑦𝑓𝑦𝜈d𝑦 (3)

Then T𝑇 is a Hilbert–Schmidt operator with the Hilbert–Schmidt norm satisfying

THS2=X×X|K(x,y)|2(νν)(d(x,y))κ2,superscriptsubscriptnorm𝑇HS2subscript𝑋𝑋superscript𝐾𝑥𝑦2tensor-product𝜈𝜈d𝑥𝑦superscript𝜅2

where we have used K(x,y)2K(x,x)K(y,y)𝐾superscript𝑥𝑦2𝐾𝑥𝑥𝐾𝑦𝑦.

Let be the Reproducing Kernel Hilbert Space (RKHS) with kernel K𝐾. If X𝑋 is a locally compact Hausdorff space, ν𝜈 a σ𝜎-finite Radon measure, and KL2(X,ν)𝐾superscript𝐿2𝑋𝜈, then is separable. In fact, the collection of countable eigenfunctions of T𝑇 corresponding to positive eigenvalues forms a complete orthogonal basis.

Suppose we have n𝑛 i.i.d. samples of x𝑥, i.e., {x1,,xn}subscript𝑥1subscript𝑥𝑛 are i.i.d. random variables with the same distribution as x𝑥. Define the empirical operator on as

Tn=1ni=1nKxiKxi.subscript𝑇𝑛1𝑛superscriptsubscript𝑖1𝑛tensor-productsubscript𝐾subscript𝑥𝑖subscript𝐾subscript𝑥𝑖 (4)

Now Tisubscript𝑇𝑖 is a random self-adjoint operator, and it will approximate T𝑇 as n𝑛. Notice that T𝑇 is also a Hilbert-Schmidt operator on .

Let Si=KxiKxiTsubscript𝑆𝑖tensor-productsubscript𝐾subscript𝑥𝑖subscript𝐾subscript𝑥𝑖𝑇. Then 1ni=1nSi=TnT1𝑛superscriptsubscript𝑖1𝑛subscript𝑆𝑖subscript𝑇𝑛𝑇, and

𝔼[Si]=XKxKxν(dx)T=0,𝔼delimited-[]subscript𝑆𝑖subscript𝑋tensor-productsubscript𝐾𝑥subscript𝐾𝑥𝜈d𝑥𝑇0

where the Bochner integral XKxKxν(dx)subscript𝑋tensor-productsubscript𝐾𝑥subscript𝐾𝑥𝜈d𝑥 converges in the HS norm. Therefore, we can apply the operator Bernstein concentration inequality to deal with the sum of the i.i.d random operators {S1,,Sn}subscript𝑆1subscript𝑆𝑛 with zero means.

References

  • [1] S. Minsker (2017) On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics and Probability Letters 127, pp. 111–119. External Links: Link Cited by: §2.
  • [2] J. A. Tropp et al. (2015) An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8 (1-2), pp. 1–230. External Links: Link Cited by: §1.
  • [3] J. A. Tropp (2012) User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12 (4), pp. 389–434. External Links: Link Cited by: §1.