$$ \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

While learning about estimation in the space of square-integrable functions, I noticed the estimators looked similar to predictions in Gaussian Process regression. It motivated me to find out the connection if there was one. The upshot is that one can view predictions in a Gaussian Process as a particular case of linear least-squares estimation in the Hilbert space of square-integrable random variables. I will gloss over some technical details in an attempt to make this post concise.

Let's begin with Gaussian Processes (GPs). A GP is an infinite collection of random variables $\left\{Y(x): x \in \chi \right\}$, where any finite collection of the variables is jointly normally distributed. $\chi$ is the index set of the process. Each random variable in the GP is uniquely associated with an element in the index set. We will assume the mean of any random variable in this GP is zero. Then, the behaviour of the process is fully specified by its covariance function. The covariance between any two points $x_1, x_2 \in \chi$ is given by the covariance function: $C(x_1,x_2) = \text{cov}\left(Y(x_1),Y(x_2)\right)$.

In the regression setting, we assume the data is drawn from a GP with a known covariance function. Suppose we are given a dataset $\{(x_1,f(x_1)), \ldots, (x_n, f(x_n))\}$, where $x_i \in \mathbb{R}^d, f(x_i) \in \mathbb{R}$ for any $i$. The goal is to predict the function value $f(x_*)$ at an unknown point $x_*$. The assumption implies the index set is $\mathbb{R}^d$, there is a Gaussian random variable $Y(x_i)$ at each $x_i$, and the variables are correlated through the covariance function. The given data are observed values of the Gaussian random variables $Y(x_i) = f(x_i),\ i = \{1, \ldots, n\}$.

Now, there is a Gaussian random variable $Y_*$ at $x_*$ as well. We are interested in the distribution of $Y_*|Y_1, \ldots, Y_n$ i.e. the conditional distribution of the Gaussian random variable at the unknown point $x_*$ given the data. Using this neat property about multivariate Gaussian distributions, it turns out the conditional distribution is also Gaussian! Consider the joint Gaussian distribution of $\left(Y(x_1), \ldots, Y(x_n), Y_*\right)$ with mean $\mathbf{0} \in \mathbb{R}^{n+1}$ and covariance $K \in \mathbb{R}^{n+1 \times n+1}$. Note that the entries of $K$ are given by the covariance function. Let $K$ be partitioned into $K_+ \in \mathbb{R}^{n \times n}$, $\mathbf{k} \in \mathbb{R}^n$, and $k_* \in \mathbb{R}$ $$ K = \begin{pmatrix} K_+ & \mathbf{k}\\ \mathbf{k}^T & k_* \end{pmatrix}. $$ Given that we observe $Y_1 = f(x_1), \ldots, Y_n = f(x_n)$ and letting $\mathbf{y} = \left[f(x_1), \ldots, f(x_n)\right]^T \in \mathbb{R}^n$, the mean and variance of the conditional distribution is $$ \begin{align} \label{eq:gp_pred} Y_*|Y_1, \ldots, Y_n \sim \mathcal{N}\left(\mathbf{k}^T K_{+}^{-1} \mathbf{y}, k_* - \mathbf{k}^T K_+^{-1} \mathbf{k}\right). \end{align} $$ If you're interested in computing this distribution using the joint density, take a look here. Our prediction of the value $f(x_*)$ would be the mean of this distribution i.e. $f(x_*) = \mathbf{k}^T K_+^{-1} \mathbf{y}$.

Now we will establish the relationship to linear least-squares estimation in Hilbert spaces. First, we need to state the projection theorem.

Theorem : Let $\mathcal{H}$ be a Hilbert space and $\mathcal{M}$ be a closed linear subspace of $\mathcal{H}$. Then, each $x \in H$ has the unique decomposition $$ x = y + z, $$ where $y \in \mathcal{M}$ and $\langle z,v \rangle = 0$, $\forall v \in \mathcal{M}$. Moreover, $$ \lVert x-y \rVert = \underset{v \in \mathcal{M}}{\text{min}}\ \lVert x - v \rVert. $$ Note: $y$ is known as the projection of $x$ onto $\mathcal{M}$.

Consider a probability space $\left(\Omega, \mathcal{A}, \mathbb{P}\right)$. The Riesz-Fischer theorem ensures the space of square-integrable random variables, $L_2(\Omega, \mathcal{A}, \mathbb{P}) = \left\{X: X\ \text{is a random variable on}\ (\Omega, \mathcal{A}, \mathbb{P})\ \text{and}\ \mathbb{E}(X^2) < \infty\right\}$, is a Hilbert space (this set should contain equivalence classes of random variables but I will skip that detail).

We define an inner product on this space, $\langle X,Y \rangle = \mathbb{E}[XY]$ for any $X, Y \in L_2(\Omega, \mathcal{A}, \mathbb{P})$. I will leave it to the reader to verify that $\mathbb{E}[XY]$ satisfies the properties of an inner product. Now, we can define the linear least-squares estimator. Recall that the norm associated with the space is $\lVert x \rVert = \sqrt{\langle x,x \rangle}$.

Definition: Let $X, Y_1, \ldots, Y_n \in L_2(\Omega, \mathcal{A}, \mathbb{P})$ with $\mathbb{E}[X] = \mathbb{E}[Y_1] = \ldots = \mathbb{E}[Y_n] = 0$. The linear least-squares estimator $\hat{X}$ of $X$ given $Y_1, \ldots, Y_n$, is the projection of $X$ onto $L(Y_1, \ldots, Y_n) = \text{span}(Y_1, \ldots, Y_n)$.

Since $L(Y_1, \ldots, Y_n)$ is a closed linear subspace of $L_2(\Omega, \mathcal{A}, \mathbb{P})$, the projection theorem guarantees that $$ \mathbb{E}\left[(X - \hat{X})^2 \right] = \underset{V \in L(Y_1, \ldots, Y_n)}{\text{min}}\ \mathbb{E}\left[(X - V)^2 \right]. $$

This characterizes the expected mean-squared error of estimating $X$ using $Y_1, \ldots, Y_n$. The linear least-squares estimator is optimal in the class of linear predictors and this is guaranteed by the projection theorem.

Proposition: Let $X, Y_1, \ldots, Y_n \in L_2(\Omega, \mathcal{A}, \mathbb{P})$ with $\mathbb{E}[X] = \mathbb{E}[Y_1] = \ldots = \mathbb{E}[Y_n] = 0$. The linear least-squares estimator $\hat{X}$ of $X$ given $Y_1, \ldots, Y_n$ is given by: $$\begin{align} \label{eq:llse} \hat{X} = \mathbb{E}\left[X\mathbf{y}\right]^T \mathbb{E}\left[\mathbf{y}\mathbf{y}^T\right]^{-1} \mathbf{y}, \end{align} $$ where $$ \text { where } \mathbf{y} {=} \left[\begin{array}{c} Y_{1} \\ Y_{2} \\ \vdots \\ Y_{n} \end{array}\right] \quad \text { and }\ \mathbb{E}(X \mathbf{y}) =\left[\begin{array}{c} \mathbb{E}\left(X Y_{1}\right) \\ \mathbb{E}\left(X Y_{2}\right) \\ \mathbb{E}\left(X Y_{n}\right) \end{array}\right]. $$

Proof: Since $\hat{X} \in L(Y_1, \ldots, Y_n)$, $$ \begin{align} \label{eq:lincomb} \hat{X} = \alpha_1 Y_1 + \ldots + \alpha_n Y_n, \end{align} $$ for some $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$. By the projection theorem, $X - \hat{X}$ is orthogonal to every element in $L(Y_1, \ldots, Y_n)$ $$ \mathbb{E}[(X-\hat{X})Y_j] = 0 \implies \mathbb{E}[XY_j] = \mathbb{E}[\hat{X}Y_j] = \sum_{i=1}^n \alpha_i \mathbb{E}[Y_iY_j], $$ for $j = 1, \ldots, n$. The last equality uses the definition of $\hat{X}$. In matrix form, $$ \left[\begin{array}{c} \mathbb{E}\left(X Y_{1}\right) \\ \mathbb{E}\left(X Y_{2}\right) \\ \vdots \\ \mathbb{E}\left(X Y_{n}\right) \end{array}\right] = \left[\begin{array}{c} \mathbb{E}\left(Y_{1} Y_1\right) \ldots \mathbb{E}\left(Y_{1} Y_n\right) \\ \vdots \\ \mathbb{E}\left(Y_{n} Y_1\right) \ldots \mathbb{E}\left(Y_{n} Y_n\right) \end{array}\right] \left[\begin{array}{c} \alpha_1\\ \vdots \\ \alpha_n \end{array}\right] $$ Solving for $\alpha$ and plugging it into Equation (\ref{eq:lincomb}) gives us the required result.

You may have noticed that Equations (\ref{eq:gp_pred}) and (\ref{eq:llse}) are similar! The GP prediction is a special case of the LLSE: when the random variables $X, Y_1, \ldots, Y_n$ are jointly normally distributed. This also shows the predictions from a GP are optimal in the class of linear predictors. It turns out that the GP predictions are also minimum mean-squared error predictors, not only restricted to the class of linear predictors. Maybe I will cover that in another blog post.