← Back to Models

Gaussian Processes (GP)

A probabilistic model that delivers both predictions and calibrated uncertainty — perfect for guiding experiment optimisation.

1. What & Why of Gaussian Processes

A Gaussian Process (GP) is a powerful probabilistic model for regression and classification. Unlike most models that learn the parameters of a single function, a GP learns a distribution over functions. This Bayesian approach makes it exceptionally useful for scientific applications where quantifying uncertainty is as important as the prediction itself.

1.1 The Function-Space View: A Distribution Over Functions

The most intuitive way to understand a GP is to think of it as defining a "prior" over functions. Before seeing any data, the GP assumes that any function drawn from it is possible. When we provide observed data, we constrain this distribution. Functions that don't pass through our data points are ruled out, and we are left with a "posterior" distribution over functions that are consistent with our observations.

[Prior vs. Posterior Diagram: A two-panel plot. Left panel ("Prior") shows many random, wavy functions sampled from the GP. Right panel ("Posterior") shows a few data points, with the sampled functions now all passing through these points, and a shaded 95% confidence interval that is tightest around the observed data.]

This posterior gives us a mean prediction (the average of all plausible functions) and a variance (how much the plausible functions disagree). This variance is our measure of uncertainty: it's low where we have data and high where we don't.

1.2 The Weight-Space View: Bayesian Linear Regression on Steroids

A GP can also be seen as the infinite-dimensional limit of Bayesian linear regression. A linear regression model is \(y = \mathbf{w}^T\mathbf{x}\). A Bayesian approach puts a prior distribution on the weights \(\mathbf{w}\). If we use an infinite number of basis functions to project our inputs \(\mathbf{x}\) into a higher-dimensional space, this Bayesian linear regression model converges to a Gaussian Process. This connection is what allows the "kernel trick" to work, letting us operate in this infinite-dimensional space without ever defining it.

[GP vs. Linear Regression Diagram: A plot showing a non-linear dataset. A straight line from linear regression is shown cutting through the data, clearly missing the trend. A GP's mean prediction curve is shown perfectly capturing the non-linear trend, with its uncertainty bands.]

1.3 Key Components and Assumptions

A GP is fully specified by a mean function and a covariance (kernel) function: \(f(x) \sim \mathcal{GP}(m(x), k(x,x'))\).

1.4 The Scalability Challenge

The primary drawback of a standard GP is its computational complexity. To make a prediction, a GP must invert an \(n \times n\) covariance matrix, where \(n\) is the number of data points. This inversion, typically done via Cholesky decomposition, has a computational cost of \(\mathcal{O}(n^3)\) and a memory cost of \(\mathcal{O}(n^2)\). This makes standard GPs impractical for datasets with more than a few thousand data points (\(n \approx 2000-5000\)). For larger datasets, approximation methods are necessary.

2. Kernels — Encoding Prior Knowledge

The kernel (or covariance function) is the heart of the GP. It encodes our prior beliefs about the function we are modeling by defining the covariance, or similarity, between function values at any two points \(x\) and \(x'\). The choice of kernel is the most important modeling decision when using a GP.

2.1 A Gallery of Common Kernels

Different kernels correspond to different assumptions about the underlying function.

[Three-panel plot of sample functions from different kernels. Left ("RBF"): very smooth functions. Middle ("Matérn"): less smooth, more jagged functions. Right ("Periodic"): functions that repeat in a wave-like pattern.]
KernelEquationCharacteristics & Use Case
Squared-Exponential (RBF)\(k(x, x')=\sigma^2\exp(-\tfrac{(x-x')^2}{2\ell^2})\)A universal default. Assumes the function is very smooth (infinitely differentiable).
Matérn (\(\nu=3/2, 5/2\))Complex form, based on modified Bessel functions.Excellent general-purpose kernel for physical processes that are not infinitely smooth. \(\nu\) controls the differentiability.
Periodic\(k(x, x')=\sigma^2\exp(-\tfrac{2\sin^2(\pi ||x-x'||/p)}{\ell^2})\)Captures repeating patterns with period \(p\). Useful for seasonal or cyclic data.
Rational Quadratic\(k(x, x')=\sigma^2(1+\tfrac{(x-x')^2}{2\alpha\ell^2})^{-\alpha}\)Can be seen as a mixture of RBF kernels with different length-scales, allowing it to capture variations at multiple scales.
[Kernel Matrix Heatmap: A side-by-side comparison of two covariance matrices visualized as heatmaps. The RBF kernel matrix shows a smooth decay from the diagonal. The Periodic kernel matrix shows a repeating, blocky pattern.]

2.2 Combining Kernels for Complex Models

The real power of kernels comes from combining them. Simple kernels can be composed using addition and multiplication to model more complex phenomena.

2.3 Automatic Relevance Determination (ARD)

When dealing with multiple input features, we can use a separate length-scale parameter \(\ell_d\) for each feature dimension \(d\). This is known as an ARD kernel:

\[k(\mathbf{x}, \mathbf{x'}) = \sigma^2 \exp\left(-\frac{1}{2} \sum_{d=1}^{D} \frac{(x_d - x'_d)^2}{\ell_d^2}\right)\]

After optimizing the hyperparameters, if a length-scale \(\ell_d\) is large, it means the function is nearly constant along that dimension, so the feature is not very relevant. If \(\ell_d\) is small, the function varies rapidly along that dimension, indicating an important feature. This provides a powerful, built-in method for feature selection.

[ARD Diagram: An input vector [x1, x2, x3] is shown. Arrows point from each feature to a separate length-scale parameter (l1, l2, l3) inside the kernel function, illustrating that each dimension is treated differently.]

2.4 Advanced and Automated Kernel Design

2.5 A Practical Necessity: Feature Scaling

It is crucial to scale input features before training a GP, typically to have zero mean and unit variance (`StandardScaler`). Because kernels rely on distance metrics, features with larger scales will dominate the distance calculation. Proper scaling ensures that all features are on a level playing field and makes the hyperparameter optimization process more stable and reliable.

3. Hyperparameters & Evidence Maximisation

Kernel hyperparameters \(\theta\) (e.g., length-scale \(\ell\), signal variance \(\sigma^2\)) are not set manually. Instead, they are learned from the data by maximizing the log-marginal likelihood (LML), also known as the "evidence". This process finds the kernel parameters that make the observed data most plausible under the GP model.

3.1 The Optimization Process

The LML objective function is:

\[\log p(\mathbf{y}|\mathbf{X}, \theta) = -\tfrac{1}{2} \mathbf{y}^T (K_{\theta} + \sigma_n^2 I)^{-1}\mathbf{y} - \tfrac{1}{2}\log|K_{\theta} + \sigma_n^2 I| - \tfrac{n}{2}\log 2\pi\]

This function is typically optimized using gradient-based methods like L-BFGS-B. While the LML itself involves an \(\mathcal{O}(n^3)\) matrix inversion, its gradient \(\partial \log p / \partial \theta\) can be calculated efficiently (\(\mathcal{O}(n^2)\)) by reusing the inverted matrix. This makes gradient-based optimization feasible.

[A 2D contour plot showing the LML surface as a function of two hyperparameters (e.g., length-scale and signal variance), with a clear peak indicating the optimal values.]
[A line graph showing the Log-Marginal Likelihood increasing over optimization iterations (e.g., L-BFGS-B steps) and converging to a maximum value.]

3.2 Overfitting Risks and The Role of Priors

Maximizing the LML can sometimes lead to overfitting. A common failure mode is the noise variance hyperparameter \(\sigma_n^2\) being driven to zero, causing the GP to interpolate the noisy data perfectly. To combat this, one can place a hyperprior on the noise parameter (e.g., an inverse-gamma prior) to prevent it from becoming too small. This adds a regularization effect to the optimization.

3.3 Interpreting ARD Kernels

As mentioned, an ARD kernel learns a separate length-scale \(\ell_d\) for each feature. After optimization, these length-scales provide a powerful way to interpret feature relevance. A long length-scale implies the function changes slowly along that feature's axis, making it less important. A short length-scale implies high importance.

[A bar chart of optimized ARD length-scales for several features. Features with short bars are labeled "Important," and features with long bars are labeled "Less Important."]

3.4 Beyond Point Estimates: Full Bayesian Treatment

While maximizing the LML gives a single point estimate for the best hyperparameters, a fully Bayesian approach would integrate over the uncertainty in the hyperparameters themselves. This is computationally expensive but can be approximated with methods like Markov Chain Monte Carlo (MCMC), such as Hamiltonian Monte Carlo (HMC), which is particularly useful when the posterior uncertainty of the hyperparameters is important for decision-making (e.g., in robust experimental design).

4. Bayesian Optimisation: Guiding Experiments Intelligently

Bayesian Optimisation (BO) is a powerful global optimization strategy for finding the maximum or minimum of "black-box" functions that are expensive to evaluate (e.g., running a physical experiment). It is one of the most important applications of GPs in science.

4.1 The BO Loop

The process works in a closed loop, intelligently trading off exploration and exploitation:

  1. Model: Fit a GP to all the experimental data collected so far. This creates a probabilistic surrogate model of the objective function.
  2. Acquire: Use an acquisition function to identify the most promising point to sample next. This function balances exploring uncertain regions (exploration) and sampling near known good values (exploitation).
  3. Evaluate: Run the expensive experiment at the point suggested by the acquisition function.
  4. Repeat: Add the new data point to the dataset and refit the GP, updating your beliefs about the objective function.
[GP Posterior + Acquisition Function Diagram: A 1D plot showing the GP's mean prediction and shaded uncertainty. Below it, the acquisition function (e.g., UCB) is plotted, with its peak clearly marked as the "Next Point to Sample".]

4.2 Tuning the Exploration-Exploitation Trade-off

The behavior of the acquisition function is controlled by a key parameter. Adjusting it allows you to tailor the BO strategy:

4.3 Advanced BO for Real-World Scenarios

Standard BO can be extended to handle more complex experimental realities.

4.4 The Curse of Dimensionality in BO

BO is most effective in low-dimensional spaces (typically <20 dimensions). In high-dimensional spaces, the volume to search becomes vast, and the performance of standard acquisition functions can degrade to be no better than random search. For these problems, dimensionality reduction techniques or specialized BO algorithms that use trust regions are often required.

5. Scalable & Sparse Gaussian Processes

This section addresses the primary limitation of GPs—their poor scalability—and introduces methods to apply them to large datasets.

5.1 Motivation: The \(\mathcal{O}(n^3)\) Barrier

As we've seen, a standard GP requires inverting an \(n \times n\) covariance matrix, a step with \(\mathcal{O}(n^3)\) time and \(\mathcal{O}(n^2)\) memory complexity due to Cholesky decomposition. This is infeasible for modern scientific datasets (e.g., spectra, images, simulations) which can contain tens of thousands to millions of samples. To overcome this, we turn to approximation methods.

5.2 Inducing-Point Approximations

The core idea behind most sparse GP methods is to select a smaller set of \(m \ll n\) inducing points that effectively summarize the full dataset. The GP is then conditioned on these inducing points instead of the entire dataset, reducing the complexity. Different methods vary in how they relate these inducing points back to the full data.

[Inducing Points Diagram: A scatter plot of the original n data points (in gray). A smaller subset of m inducing points (in blue) is shown, strategically placed to capture the overall trend of the data. The GP posterior is then approximated based on these m points.]
MethodCore IdeaComplexity (n, m)Characteristics
DTC / PP A simple low-rank approximation using \(m\) inducing points. \(\mathcal{O}(nm^2)\) Fast but known to underestimate the posterior variance.
FITC Adds independent noise terms for each training point to correct the variance. \(\mathcal{O}(nm^2)\) Better variance estimates than DTC, but can be prone to overfitting.
VFE Optimizes a variational free energy bound, providing better theoretical guarantees. \(\mathcal{O}(nm^2)\) The foundation for modern Stochastic Variational GPs (SVGP).
SVGP Combines VFE with mini-batching and Stochastic Variational Inference. \(\mathcal{O}(m^3)\) per step Enables GP training on massive, streaming datasets.

Choosing Inducing Points: The locations of the \(m\) points are critical. They can be placed on a regular grid (for low-dimensional data), chosen via K-means clustering (for high-dimensional data), or even treated as hyperparameters to be optimized. A common rule of thumb is to start with \(m \approx \sqrt{n}\).

5.3 Structured Kernel Interpolation (SKI / KISS-GP)

For datasets with low-dimensional grid-like structure, SKI provides a powerful alternative. It approximates the full kernel matrix \(K_{nn}\) by rewriting it as an interpolation from a smaller kernel matrix \(K_{mm}\) defined on a grid of \(m\) inducing points: \(K_{nn} \approx W_{nm} K_{mm} W_{nm}^{\top}\), where \(W\) is a sparse interpolation matrix. By leveraging fast linear algebra techniques for structured matrices (Toeplitz, Kronecker), KISS-GP can reduce complexity dramatically.

This makes it possible to train near-exact GPs on millions of data points, provided the data has low-dimensional grid structure.

5.4 Scalable Exact GPs on GPU (GPyTorch)

The GPyTorch library, built on PyTorch, enables training of exact GPs on massive datasets by leveraging GPU acceleration. It uses advanced numerical linear algebra techniques like Lanczos Variance Estimation (LOVE) and Black-box Matrix-Matrix multiplication (BBMM) to perform the key computations without ever materializing the full \(n \times n\) matrix. This allows for training exact GPs on up to a million points on a single GPU.


# GPyTorch example for large-scale exact GP on GPU
import gpytorch

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ZeroMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel())
    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Use fast computations and move model to GPU
with gpytorch.settings.fast_computations(log_prob=True):
    model = ExactGPModel(train_x.cuda(), train_y.cuda(), likelihood).cuda()
    ...
      

5.5 Practical Checklist for Scalable GPs

ScenarioRecommended TechniqueTips
\(n < 2,000\)Full GPStandard Cholesky decomposition on a CPU is fine.
\(2,000 \le n \le 50,000\)FITC / VFE / SVGPUse \(m=500-1000\) inducing points. Optimize with Adam.
\(n > 50,000\) (Low-dim inputs)KISS-GPPlace inducing points on a grid. Leverage Toeplitz algebra.
\(n > 50,000\) (High-dim or streaming)SVGPUse mini-batches (e.g., size 512) and \(m \le 1024\) inducing points.
GPU is availableGPyTorchUse `float16` precision and disable gradient checks (`check_grad=False`) for speed.

6. GP Classification & Non-Gaussian Likelihoods

While GPs are naturally suited for regression with Gaussian noise, their probabilistic framework can be extended to other data types like binary classification or count data. This requires moving beyond the standard Gaussian likelihood.

6.1 Motivation: Why a Gaussian Likelihood Isn't Always Enough

A Gaussian likelihood assumes the target variable \(y\) is continuous and unbounded. For many scientific problems, this assumption is invalid.

Data TypeAppropriate Likelihood \(p(y|f)\)Example (Research Context)
Binary \(y \in \{0, 1\}\)Bernoulli (via logit/probit link)Classifying whether an electrochemical sensor detects a target analyte or not.
Counts \(y \in \mathbb{N}\)Poisson (or Negative-Binomial)Counting the number of current pulses generated in a single-molecule experiment.
Bounded Continuous \(0 < y < 1\)Beta (via logit link)Modeling the faradaic efficiency or yield of a reaction.

The core idea is to keep the Gaussian Process prior over the latent function \(f(x)\), but combine it with a more appropriate likelihood. However, this comes at a cost: the posterior distribution \(p(f|y)\) is no longer a Gaussian and cannot be computed in closed form. This necessitates the use of approximate inference algorithms.

6.2 Likelihoods and Link Functions

To connect the unbounded output of the latent GP, \(f(x)\), to a constrained space (like probabilities [0, 1] or positive counts), we use a link function.

[Image of a sigmoid link function, showing how the latent function f(x) on the x-axis is mapped to a probability between 0 and 1 on the y-axis.]
Fig 6-1. The sigmoid link function \(\sigma(f)=1/(1+e^{-f})\) maps the latent function \(f(x)\) to a probability space, so that \(p(y=1|x)=\sigma(f(x))\).

6.3 Approximate Inference Algorithms

Since the true posterior is intractable, we must approximate it. The choice of algorithm involves a trade-off between accuracy, speed, and implementation complexity.

MethodCore IdeaComplexity*Prediction QualityImplementation Difficulty
Laplace ApproximationFinds the posterior mode (MAP) of \(f\) and fits a Gaussian centered there using a second-order Taylor expansion.\(\mathcal{O}(n^3)\)★★☆☆☆★☆☆☆☆
Expectation Propagation (EP)Iteratively refines local Gaussian "site" approximations for each data point by matching moments of the true posterior.\(\sim 3 \times\) Laplace★★★★★★★★★☆
Variational Inference (VI)Finds the "closest" Gaussian distribution \(q(f)\) to the true posterior by minimizing their KL-divergence (equivalent to maximizing the ELBO).\(\mathcal{O}(m^3)\) per step★★★☆☆★★★☆☆

* n = data points, m = inducing points (for SVGP/VI).

6.4 Decision Boundary and Uncertainty

For GP classification, the output is a probability for each class. The decision boundary is the contour where \(p(y=1|x) = 0.5\). The model's uncertainty is highest near this boundary and in regions far from any training data. This uncertainty is crucial for active learning, where we can choose to sample the next point in the most uncertain region to gain the most information.

[Image of a GP classification result on 2D data. Blue and red regions show the predicted class. A purple band in the middle shows the uncertainty region (e.g., p=0.5 ± 0.1), which is widest far from the data.]

6.5 Probability Calibration

The raw probabilities from a GP classifier are not always "calibrated" — meaning a predicted probability of 0.8 might not correspond to an 80% chance of the event actually happening. It's important to check and, if necessary, correct this.

6.6 Implementation Snippet — Bernoulli GP with GPyTorch


# GPyTorch requires a GP model and a Likelihood
class GPClassifier(gpytorch.models.ApproximateGP):
    def __init__(self, inducing_points):
        variational_dist = gpytorch.variational.CholeskyVariationalDistribution(
            inducing_points.size(0))
        variational_strategy = gpytorch.variational.VariationalStrategy(
            self, inducing_points, variational_dist, learn_inducing_locations=True)
        super().__init__(variational_strategy)
        self.mean_module  = gpytorch.means.ZeroMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.MaternKernel(nu=2.5))
    
    def forward(self, x):
        # Returns the latent function distribution
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Use a Bernoulli likelihood for binary classification
likelihood = gpytorch.likelihoods.BernoulliLikelihood()
model = GPClassifier(inducing_points)
        

6.7 Practical Checklist

ScenarioRecommended AlgorithmHyperparameter Tips
Small data (n ≲ 1k)Laplace / EPSet Newton iterations ≤ 10, add jitter of 1e-6.
Medium data (1k – 50k)VFE / SVGPUse \(m \sim \sqrt{n}\) inducing points.
Large-scale & OnlineSVGPUse mini-batches (e.g., 512) and Adam optimizer with lr=3e-3.
Calibration is criticalEP or VI + Platt ScalingAim for Brier score < 0.05.
Multi-class ClassificationOne-vs-Rest wrapper around binary GP classifiers.Account for class imbalance with `class_weight`.

7. Hands-On Python Examples

7.1 GP Regression (scikit-learn)

This example shows how to fit a GP to data, making predictions and visualizing the uncertainty.

7.2 One-line Bayesian Optimisation (scikit-optimize)

The `scikit-optimize` library provides a convenient wrapper for Bayesian Optimization.

8. Lab: Next‑Experiment Suggestion

Scenario: An automated electrochemistry bench can synthesize and test catalysts at different temperatures. We want to find the temperature that maximizes catalytic yield with as few runs as possible.

Approach: We use Bayesian Optimization. Starting with two initial experiments, we fit a GP, use the UCB acquisition function to propose the next experiment, run the experiment, and repeat the cycle. This closed-loop process rapidly converges to an optimum while explicitly quantifying uncertainty.

9. Practical Tips & Pitfalls

10. Key Takeaways