Proof of the day


TS Contributor
In this thread we post one (or more if you can't wait) proof a day. I'll start by proving the that


in the linear regression model. We can write the beta estimator as


Then we have that




follows from one of the assumptions of the classical linear model: the spherical disturbances assumption.


Smelly poop man with doo doo pants.
me likes this. most of my proofs would come from the field of psychometrics or quantitative psychology though (mostly factor analysis and stuctural equation modelling).

here i'm doing the (rather simple) proof of how the linear factor analysis model can be parameterised as a covariance structure model. it's relevant because as a linear factor model it is unsolvable, but as a covariance structure model it is possible to obtain parameter estimates.

let the obseverd score \(x\) be defined as the linear factor model \(x = \Lambda F+\epsilon_{i}\) since it is known that (in the case of multivariate normality) \(E(xx')=\Sigma\) it trivially follows that:

xx' = (\Lambda F+\epsilon)(\Lambda F+\epsilon)'
xx' = (\Lambda F+\epsilon)(F'\Lambda' + \epsilon')
xx' = \Lambda FF'\Lambda' + \epsilon F'\Lambda'+ \Lambda F \epsilon' + \epsilon\epsilon'

so taking the expectation of both sides:

E(xx') = E(\Lambda FF'\Lambda') + 0 + 0 + E(\epsilon\epsilon')

which happens because the erros are random and assumed uncorrelated with the Factors and estimated loadings. Now by linearity of expectation and substituting the covariance matrix of the Factors and of the errors we can see that:

E(xx') = \Lambda E(FF')\Lambda' + E(\epsilon\epsilon')

\Sigma=\Lambda \Phi \Lambda' + \Psi

which is known as the fundamental equation of Factor Analysis.


TS Contributor
Okay, since this day is soon over (at least according to Swedish time) and no one posted a proof yet today, I'll post another proof. I'll give a very simple, and possibly boring, proof this time. I'll prove that \(\bar{x}\) is the value that minimizes the sum \(\sum_{i=1}^n{(x_i-a)^2}\) (1).

By taking the first derivative with respect to a and setting it equal to zero, we get \(\sum_{i=1}^n{-2(x_i-a)}=0 \Leftrightarrow -2\sum_{i=1}^n{x_i}+2na=0 \Leftrightarrow \sum_{i=1}^n{x_i}=na \Leftrightarrow \bar{x}=a\).

By checking the second order condition we see that it's equal to 2n, which is always positive, so now we know that \(\bar{x}\) is at least a local minimum. By investigating (1) it is easily seen that it is also a global minimum.


Ambassador to the humans
I prefer the version that doesn't require the use of calculus.

\(\sum (x_i - a)^2 = \sum (x_i - \bar{x} + \bar{x} - a)^2 = \sum (x_i - \bar{x})^2 + (\bar{x} - a)^2 + 2(x_i - \bar{x})(\bar{x} - a)\)

\( = \sum (x_i - \bar{x})^2 + \sum(\bar{x} - a)^2 + 2\sum(x_i - \bar{x})(\bar{x} - a)\)

Now consider the last summation. Note that in the sum both \(a\) and \(\bar{x}\) are constant so we can pull them out

\( = 2(\bar{x} - a) \sum (x_i - \bar{x})\)
We know that that sum is equal to 0 so this shows the third summation disappears.

We are left with

\(\sum (x_i - a)^2 = \sum (x_i - \bar{x})^2 + (\bar{x} - a)^2\)

The first summation we can't control and the second sum is always non-negative so the minimum would occur if we can make it equal to 0 - which happens when \(a=\bar{x}\).

Now clearly I need a few more details to make it more rigorous but I like that version a little bit more because it also gives hints at what we do in ANOVA when decomposing the sums of squares.
Last edited:


Smelly poop man with doo doo pants.
a while ago (before Englund became an MVC) I posted a proof about another result in factor analysis. I thought it would be nice to resurrect it (briefly) and add it here to our small (but growing) compendium of proofs. the original thread is here

and the proof goes like this:

Let \(\bf{S}\) be a covariance matrix with eigenvalue-eigenvector pairs (\(\lambda_1, \mathbf{e}_1\)), (\(\lambda_2, \mathbf{e}_2\)), ..., (\(\lambda_p, \mathbf{e}_p\)), where
\(\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_p\). Let \(m<p\) and define:

\(\bf{L} = \{l_{ij}\} = \left[\sqrt{\lambda_1 }\mathbf{e}_1\ |\ \sqrt{\lambda_2} \mathbf{e}_2\ |\ ...\ |\ \sqrt{\lambda_m} \mathbf{e}_m \right] \)


\mathbf\Psi =
\psi_1 & 0 & ... & 0 \\
0 & \psi_2 & ... & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & ... & \psi_p \\
\text{ with } \psi_i = s_{ii} - \sum_{j=1}^{m} l_{ij}^2\)

Then, PROVE:

\text{Sum of squared entries of } (\mathbf{S} - (\mathbf{LL'} + \mathbf{\Psi})) \le \lambda_{m+1}^2 + \cdots + \lambda_p^2\)

Spunky's attempt of a proof:

By definition of \(\psi_i\), we know that the diagonal of \((\mathbf{S} - (\mathbf{LL'} + \mathbf{\Psi}))\) is all zeroes. Since
\((\mathbf{S} - (\mathbf{LL'} + \mathbf{\Psi})))\) and \((\mathbf{S} - \mathbf{LL'})\) have the same elements except on the diagonal, we know that

\(\text{(Sum of squared entries of } (\mathbf{S} - (\mathbf{LL'} + \mathbf{\Psi}))) \leq \text{ Sum of squared entries of } (\mathbf{S} - \mathbf{LL'}) \)

Since \(\mathbf{S} = \lambda_1 \mathbf{e}_1 \mathbf{e}'_1 + \cdots + \lambda_p \mathbf{e}_p \mathbf{e}'_p \)
and \(\mathbf{LL'} = \lambda_1 \mathbf{e}_1 \mathbf{e}'_1 + \cdots + \lambda_m \mathbf{e}_m \mathbf{e}'_m \), then it follows that
\(\mathbf{S} - \mathbf{LL'} = \lambda_{m+1} \mathbf{e}_{m+1} \mathbf{e}'_{m+1} + \cdots + \lambda_p \mathbf{e}_p \mathbf{e}'_p\)

Writing it in matrix form, this is saying \(\mathbf{S} - \mathbf{LL'} = \mathbf{P}_2 \mathbf{\Lambda}_2 \mathbf{P}'_2\) where
\(\mathbf{P}_2 = [ \mathbf{e}_{m+1} | \cdots | \mathbf{e}_p ]\) and \(\mathbf{\Lambda}_2 = Diag(\lambda_{m+1}, \cdots, \lambda_{p})\)

Then, the following is true:

\(\text{Sum of squared entries of }(\mathbf{S}- \mathbf{LL'})= \text{tr}((\mathbf{S} - \mathbf{LL'}) (\mathbf{S} - \mathbf{LL'})')=\)

\(\text{tr} (( \mathbf{P}_2 \mathbf{\Lambda}_2 \mathbf{P}'_2)( \mathbf{P}_2 \mathbf{\Lambda}_2 \mathbf{P}'_2)')=\text{tr}( \mathbf{P}_2 \mathbf{\Lambda}_2\mathbf{\Lambda}_2 \mathbf{P}'_2)\)

\(tr(\mathbf{\Lambda}_2\mathbf{\Lambda}_2)=\lambda_{m+1}^2 + \cdots + \lambda_p^2.\)

All the \(\bf{P}_2\) disappear because by the definition of \(\bf{P}_2\) we know that \(\bf{P}_2 '\bf{P}_2=\bf{I}\)
Last edited by a moderator:


TS Contributor
a while ago (before Englund became an MVC)
Time wasn't even defined before I became MVC, so that's per definition impossible ;)
I posted a proof about another result in factor analysis. I thought it would be nice to resurrect it (briefly) and add it here to our small (but growing) compendium of proofs.

and the proof goes like this:
Very nice. If you keep posting stuff on FA I'll be forced to get more familiar with it, which is good :)


Smelly poop man with doo doo pants.
If you keep posting stuff on FA I'll be forced to get more familiar with it, which is good :)
i don't quite understand why but pretty much NO ONE in the Statistics world even touches on Factor Analysis. when it comes to dimension reduction techniques almost all of the undergrad stats textbooks i've seen that deal with intro to multivariate analysis stop at principal components. there may be like some small subsection in some namless appendix that says something about Factor Analysis... but that's it!

Nice thread so I make my debut here: The derivation of the Ridge-Estimator in the linear Regression Model.

\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{u}, \quad \mathbf{u} \sim N_n(\mathbf{0},\sigma_u^2\mathbf{I}_n)

with strong correlation patterns among the vectors within the data matrix \( \mathbf{X} \in Mat_{n,p}(\mathbb{R}) \). The problem with multicollinearity is that single components within the vector of parameters \( \boldsymbol{\beta} \in \mathbb{R}^k \) can take absurdly large values. So the general idea is to restrict the length of said vector to a prespecified positve real number. Let this restriction been noted by \( \left\| \boldsymbol{\beta} \right\|_2^2=c \), whereas \( \left\|\cdot \right\|_2 \) is just the euclidian norm on \( \mathbb{R}^n\).

Eventually one faces the restricted least squares problem

Q_n(\boldsymbol{\beta},\lambda) := \left\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\right\|_2^2 + \lambda (\left\|\boldsymbol{\beta}\right\|_2^2-c) \rightarrow \min_{\boldsymbol{\theta} \in \mathbf{\Theta}}

whereas the Lagrange parameter is assumed to be positive and \(\mathbf{\Theta} \subseteq \mathbb{R}^k \times \mathbb{R}_{>0}\) is the associated parameter space. The optimization problem is equivalent to

Q_n(\boldsymbol{\beta},\lambda):= (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}) + \lambda (\boldsymbol{\beta}'\boldsymbol{\beta}-c) \rightarrow \min_{\boldsymbol{\theta} \in \mathbf{\Theta}}

Taking the derivative with respect to \( \boldsymbol{\beta} \) yields

\displaystyle \frac{\partial}{\partial \boldsymbol{\beta}} Q_T(\boldsymbol{\beta},\lambda) = -2\mathbf{X}'(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}})+ 2\lambda \hat{\boldsymbol{\beta}}

This leads to the first order condition (note you can set the hats already due to the fact that the potential minimizers of the problem above are already given as an implicit function)

-\mathbf{X}'\mathbf{y} + \mathbf{X}'\mathbf{X}\hat{\boldsymbol{\beta}} + \lambda \hat{\boldsymbol{\beta}} = \mathbf{0}

Arranging terms leads to the modified normal equations

(\mathbf{X}'\mathbf{X}+ \lambda \mathbf{I}_k)\hat{\boldsymbol{\beta}} = \mathbf{X}'\mathbf{y}

Since \( \mathbf{X}'\mathbf{X} \) is at least positive semi definite and \( \lambda \mathbf{I}_k \) is positive definite one yields that*

det(\mathbf{X}'\mathbf{X}+ \lambda \mathbf{I}_k) \geq det(\mathbf{X}'\mathbf{X})+det(\lambda\mathbf{I}_k) = det(\mathbf{X}'\mathbf{X}) + \lambda^n >0

so that \( (\mathbf{X}'\mathbf{X}+ \lambda \mathbf{I}_k) \) is an invertible matrix even if the data matrix is of less than full column rank. This finally yields the ridge estimator in its known form

\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X}+ \lambda \mathbf{I}_k)^{-1}\mathbf{X}'\mathbf{y}

Also this is the unique global minimizer of \( Q_n \) due to the fact that the problem under consideration is just a sum auf convex functions and \( \hat{\boldsymbol{\beta}} \) is the only local minimizer, so one doesn't need to check the second order conditon and the associated hessians.

*One can find a good proof for that inequality in Magnus, J.R. & Neudecker, H. (1999). Matrix Differential Calculus. Wiley and Sons on page 227 theorem 28.
Last edited:


Super Moderator
The Pearson product-moment coefficient of correlation can be interpreted as the cosine of the angle between variable vectors in \(n\) dimensional space. Here, I will show the relationship between the Pearson and Spearman (rank-based) correlation coefficients for the bivariate normal distribution through the following series:

\(\sum_{n=1}^{\infty }\frac{\cos nx}{n} \).

If we let \( z=\cos x+i\sin x\), then

\(\sum_{n=1}^{m}y^{n-1}z^{n}=\frac{z\left \{ 1-\left ( yz \right )^{m} \right \}}{1-yz} \)

where it follows for \( \left | y \right |<1 \),

\( \sum_{n=1}^{\infty }y^{n-1}\left ( \cos nx+i\sin nx \right )=\frac{\cos x+i\sin x}{1-y\cos x-yi\sin x} \)

\( =\frac{\left ( \cos x-y \right )+i\sin x}{1-2y+y^{2}} \), so that

\( \sum_{n-1}^{\infty }\cos nx=\frac{\cos x-y}{1-2y\cos x+y^{2}} \).

This series is uniformly convergent for all values of \(y\) and for \( \left | y \right |\leq p<1 \). Hence, integrating with respect to \(y \), where \( 0<y<1 \) gives

\( \sum_{n=1}^{\infty }y^{n}\frac{\cos nx}{n} \)

\( =\int_{0}^{y}\frac{\cos x-t}{1-2t\cos x+t^{2}}dt \)

\( =-\frac{1}{2}\ln \left ( 1-2y\cos x+y^{2} \right ) \).

Suppose that \( x \) is neither zero nor a multplei of \( 2\pi \).

Then the series \(\sum_{n=1}^{\infty }\frac{\cos nx}{n} \) is convergent, and, for \( 0\leq y\leq 1 \), \( y^{n} \), is positive, monotonic, decreasing and bounded. As such the series:

\( \sum_{n=1}^{\infty }y^{n}\frac{\cos nx}{n} \)

is therefore uniformly convergent on the interval \( 0\leq x\leq 1 \).

Subsequently letting \( x\rightarrow 1 \), then it follows that if \( x \) is neither \( 0 \) nor a multiple of \( 2\pi \) we have

\( \sum_{n=1}^{\infty }\frac{\cos nx}{n} =-\frac{1}{2}\ln \left ( 2-2\cos x \right ) \)

\( =-\ln \left ( 2\sin \frac{1}{2} x\right ) \).

Setting \( x=\frac{\pi }{3}r_{s} \) and exponentiating \( e^{-1} \) gives the relationship (for large sample sizes) between the Pearson and Spearman correlation coefficients as:

\( r_{p}=2\sin\left ( \frac{\pi }{6}r _{s}\right ) \)

for the bivariate normal distribution.


Ambassador to the humans
It's been over three years since we've had a post. If somebody doesn't post a proof in the next two days I'm going to unsticky this.