← Back to Models

Unsupervised Learning: PCA, Manifolds & Clustering

From eigen-decomposition to density-based discovery — mine hidden patterns in complex scientific data without a single label.

1. Motivation & Roadmap

In scientific research, we often collect vast amounts of data before we have clear, well-defined labels. Unsupervised learning provides the tools to explore this data, uncover its intrinsic structure, and generate new hypotheses. It primarily addresses two core questions:

  1. Compression & Denoising: "What are the dominant factors of variation in my data?" This is the domain of dimensionality reduction.
  2. Segmentation & Discovery: "Which of my samples behave similarly?" This is the domain of clustering.

1.1 The Scientific Discovery Workflow

Unsupervised learning is a key part of the broader scientific discovery process. This entire workflow can be visualized as a cycle of hypothesis and discovery.

[Flowchart: Experiment Design → Data Collection → Exploratory Data Analysis (EDA) → Unsupervised Modeling → Interpretation & Hypothesis Generation → (loops back to Experiment Design)]

Our focus is on the "Unsupervised Modeling" block, which itself is a detailed pipeline. Let's explore this with a concrete example.

1.2 A Practical Roadmap: From Raw Data to Insight

Imagine we have collected a dataset of 200 Raman spectra, each with 1024 data points (features). Our goal is to see if there are distinct material phases present in the samples. The path from this raw data (a 200x1024 matrix) to scientific insight follows a structured roadmap.

[Detailed Flowchart: Raw Spectra (200x1024) ↦ Preprocess ↦ Dimension Reduction (PCA/UMAP) ↦ Cluster (HDBSCAN/GMM) ↦ Validate (Silhouette/Stability) ↦ Interpret (Inspect Cluster-wise Spectra)]

Step 1: Preprocessing - The Crucial First Step

Raw data is rarely ready for modeling. Preprocessing ensures that the patterns found by the model are scientifically meaningful, not artifacts of data collection or scaling. A good checklist includes:

[Two-panel plot. Left: "Before Scaling" - two histograms of features with vastly different scales (e.g., one from 0-1, another from 1000-5000). Right: "After StandardScaler" - both histograms are now centered at 0 with similar spread.]

Step 2: Dimensionality Reduction

Working with 1024 features directly is computationally expensive and prone to the "curse of dimensionality." We use techniques like PCA or UMAP to project the data into a low-dimensional space (e.g., 2 or 3 dimensions) that captures the most important variance. This allows for visualization and often improves clustering performance.

Step 3: Clustering

On these low-dimensional embeddings, we apply a clustering algorithm (like K-Means, HDBSCAN, or GMM) to find natural groups in our 200 spectra. The choice of algorithm depends on the expected geometry of the data (e.g., spherical vs. arbitrary shapes).

Step 4: Validation

How good are the clusters? We use internal metrics like the Silhouette score to quantify how dense and well-separated our clusters are. We can also assess their stability by seeing if the same clusters emerge when we re-run the analysis on bootstrapped samples of our data.

Step 5: Interpretation

This is where data science becomes science. If we found 3 stable clusters, we must go back to the original, high-dimensional data. We can plot the average spectrum for each of the 3 groups. Do they correspond to known material phases? Does one cluster show a new, unexpected peak? This is the step that generates new hypotheses to be tested in the lab.

2. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a cornerstone of unsupervised learning. It performs a linear transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This new coordinate system is chosen such that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

2.1 The Mathematics of PCA: Two Perspectives

Understanding PCA can be approached from two equivalent mathematical viewpoints: eigenvalue decomposition and singular value decomposition (SVD).

Perspective 1: Eigenvalue Decomposition of the Covariance Matrix

This is the traditional view. For a centered data matrix \(X \in \mathbb{R}^{n \times d}\) (where \(n\) is samples, \(d\) is features), PCA seeks to find the directions of maximum variance. This is framed as an eigenvalue problem on the covariance matrix \(C = \frac{1}{n-1}X^T X\):

\[C w_i = \lambda_i w_i\]

The eigenvectors \(w_i\) (the principal components) form the axes of the new feature space, and the corresponding eigenvalues \(\lambda_i\) represent the variance captured by each component. The eigenvectors are ordered by their eigenvalues, from largest to smallest.

Perspective 2: Singular Value Decomposition (SVD)

A more numerically stable and general approach is to use SVD on the data matrix \(X\). SVD decomposes \(X\) into three matrices:

\[X = U \Sigma V^T\]

This view is often preferred in practice due to its numerical stability. The computational cost of SVD-based PCA is typically \(\mathcal{O}(nd^2)\) or \(\mathcal{O}(n^2d)\), depending on which dimension is larger.

2.2 The Toolkit for PCA Interpretation

Performing PCA is easy, but interpreting the results to extract scientific insight is the real skill. Several tools are essential for this.

Scree Plot and Cumulative Variance

How many principal components should we keep? The scree plot helps us decide. It plots the variance (eigenvalue) explained by each principal component, in descending order. We look for an "elbow" in the plot, a point where the explained variance begins to level off. A cumulative variance plot is often overlaid to see the total percentage of variance captured by the first \(k\) components.

[Scree Plot: A bar chart showing the variance explained by each PC (e.g., PC1: 45%, PC2: 20%, PC3: 8%...). An overlaid line plot shows the cumulative sum (e.g., PC1: 45%, PC1+PC2: 65%, PC1+PC2+PC3: 73%...). An "elbow" is visible around PC3.]

Loading Plot

The loadings are the coefficients of the linear combination of the original variables from which the principal components are constructed. A loading plot (often a bar chart) shows the contribution of each original feature to a given principal component. This is the key to understanding the physical meaning of the PCs.

[Loading Bar Chart for PC1: A bar chart where the x-axis represents original features (e.g., Raman shifts 100cm⁻¹, 150cm⁻¹, etc.) and the y-axis shows the loading value. Large positive or negative bars indicate features that strongly influence PC1.]

Biplot

A biplot is a powerful visualization that overlays the scores (the projected data points) and the loadings (the original feature directions) in the same plot, typically for PC1 vs. PC2. It allows us to see the relationships between samples and how they relate to the original variables simultaneously.

[2D Biplot: A scatter plot of samples projected onto PC1 and PC2. Arrows originating from the center represent the original features (loadings). Samples that are far along the direction of an arrow have high values for that feature.]
# Python pseudocode for PCA interpretation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Assume X is our n x d data matrix
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_scaled)

# 1. Scree Plot
plt.figure()
plt.bar(range(10), pca.explained_variance_ratio_)
plt.plot(range(10), np.cumsum(pca.explained_variance_ratio_), 'r-o')
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.title("Scree Plot")
plt.show()

# 2. Loading Plot for PC1
loadings_pc1 = pca.components_[0, :]
plt.figure()
plt.bar(range(len(loadings_pc1)), loadings_pc1)
plt.xlabel("Original Feature Index")
plt.ylabel("Loading Value")
plt.title("Loadings for PC1")
plt.show()

# 3. Biplot (conceptual)
# Requires more complex code to scale arrows, but the concept is:
plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1]) # Plot scores
for i, feature in enumerate(feature_names):
    plt.arrow(0, 0, pca.components_[0, i]*5, pca.components_[1, i]*5, color='r', head_width=0.1)
    plt.text(pca.components_[0, i]*5.5, pca.components_[1, i]*5.5, feature, color='r')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Biplot")
plt.grid()
plt.show()

2.3 PCA Whitening

An optional but useful transformation in PCA is whitening. After projecting the data onto the principal components, this step rescales each component to have unit variance. The resulting data matrix has a covariance matrix that is the identity matrix, meaning the features are uncorrelated and all have the same variance. This can be a useful preprocessing step for other algorithms and is common in signal processing for decorrelating signals.

3. Advanced PCA: Kernel & Robust PCA

While powerful, standard PCA has two key limitations: it's linear and it's sensitive to outliers. Advanced PCA variants address these issues.

3.1 Kernel PCA

When the variance in data follows a non-linear structure (e.g., curves, spirals), standard PCA fails to capture it effectively. Kernel PCA overcomes this using the "kernel trick." Instead of explicitly mapping the data with a non-linear function, it replaces the dot products in the standard PCA algorithm with a non-linear kernel function (e.g., RBF, polynomial). This implicitly maps the data into a higher-dimensional feature space where, hopefully, the structure becomes linearly separable.

[Two-panel comparison on spiral data. Left: Standard PCA fails to separate the spiral, projecting it into an overlapping blob. Right: Kernel PCA with an RBF kernel successfully "unrolls" the spirals into two distinct, linearly separated clusters.]

Choosing Kernel PCA Parameters

The performance of Kernel PCA is critically dependent on the choice of kernel and its parameters.

KernelUse CaseKey Parameter(s)
Linear Recovers standard PCA. Good for baseline comparison. None
Polynomial Models polynomial relationships. degree, gamma, coef0
RBF (Radial Basis Function) Highly flexible, can capture complex structures. A great default choice. gamma (\(\gamma\))
Sigmoid Similar to activation functions in neural networks. gamma, coef0

The gamma (\(\gamma\)) parameter for the RBF kernel is crucial. It controls the width of the kernel. If \(\gamma\) is too large, the model can overfit, treating each point as its own cluster. If it's too small, the model underfits and fails to capture the non-linear structure. A common heuristic for a starting value is 1 / (n_features * X.var()).

3.2 Robust PCA

Because it tries to maximize variance, standard PCA is extremely sensitive to large outliers. A single corrupted measurement can completely skew the direction of the principal components. Robust PCA addresses this by decomposing the data matrix \(X\) into two separate matrices: a low-rank matrix \(L\), representing the underlying structure, and a sparse matrix \(S\), representing the outliers or sparse noise.

\[ X = L + S \]

This decomposition is achieved by solving the optimization problem: \(\min_{L,S} \|L\|_* + \lambda\|S\|_1\), where \(\|L\|_*\) is the nuclear norm (sum of singular values) and \(\|S\|_1\) is the L1-norm (sum of absolute values of elements). This is like separating the background (low-rank) from the foreground (sparse).

[Three-panel decomposition of a face image. Left: Original image of a face with sunglasses. Center: The low-rank component L, showing only the clean face without sunglasses. Right: The sparse component S, showing only the sunglasses against a black background.]

Practical Applications and Parameters

Robust PCA is highly effective for tasks like removing background in video surveillance or removing instrumental glitches, like cosmic ray hits, from a set of spectra. The regularization parameter \(\lambda\) balances the trade-off between the low-rank and sparse components. A good rule of thumb for a starting value is \(\lambda = 1 / \sqrt{\max(n, d)}\), where \(n\) and \(d\) are the dimensions of the data matrix.

# Pseudocode for Robust PCA using the r_pca library
from r_pca import R_pca
import numpy as np

# X is our n x d data matrix (e.g., video frames or spectra)
# Set the lambda parameter
lambda_param = 1 / np.sqrt(max(X.shape))
rpca = R_pca(X, lmbda=lambda_param)
L, S = rpca.fit(max_iter=10000, iter_print=100)

# L is the clean background / underlying structure
# S is the sparse outliers / moving objects

4. Non-linear Manifold Learning

While Kernel PCA can find non-linear structures, a broader class of algorithms known as manifold learning is specifically designed for this purpose, primarily for visualization. The core idea is the "manifold hypothesis": that high-dimensional data often lies on or near a much lower-dimensional, non-linear manifold embedded within the high-dimensional space. These algorithms try to "unroll" this manifold to see its true geometry.

[Multi-panel collage. Top-left: A 3D Swiss-roll dataset. Other panels show the 2D embeddings from different algorithms: PCA (fails, squashes layers), Isomap (unrolls successfully), LLE (unrolls but can have distortions), t-SNE (separates clusters but tears the manifold), UMAP (preserves both local and global structure well).]

4.1 A Tour of Manifold Algorithms

Different algorithms make different assumptions about the manifold's structure.

4.2 The Impact of Hyperparameters

The output of these algorithms is highly sensitive to their hyperparameters. Understanding them is key to producing meaningful visualizations.

[Three-panel comparison of t-SNE on the same data. Left: perplexity=5 (data is shattered into many small, meaningless clumps). Center: perplexity=30 (a good balance, revealing clear clusters). Right: perplexity=100 (clusters are merged, global structure is distorted).]

4.3 Evaluating Embedding Quality

Since these are visualization tools, visual inspection is the primary evaluation method. However, quantitative metrics can help assess how well the embedding preserves the original data's structure.

[Conceptual diagram of Local vs. Global preservation. Shows a high-D space with points A, B, C, D. A and B are close neighbors. C and D are far apart. A good embedding preserves the A-B neighborhood (local) and the C-D separation (global).]

Because these algorithms often have a random component, it's good practice to assess their stability by running them multiple times with different random seeds and evaluating the average quality and variance of the results.

[Bar chart of Trustworthiness and Continuity scores for different algorithms (e.g., UMAP, t-SNE, Isomap) on a test dataset. Error bars indicate the standard deviation over 10 random seeds.]

5. Clustering Algorithms

Clustering algorithms seek to partition data into groups, or "clusters," where samples within a group are more similar to each other than to those in other groups. The choice of algorithm is critical as it imposes a certain structure on the data. No single algorithm is best for all datasets.

5.1 A Map of Clustering Algorithms

We can broadly categorize clustering methods based on their underlying principles:

5.2 Prototype-Based Clustering

k-Means

The goal of k-Means is to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid). It assumes clusters are convex and spherical, which is a major limitation in scientific data.

[Two-panel comparison on a 'two moons' dataset. Left: k-Means fails, drawing a linear boundary and incorrectly splitting the two moons. Right: DBSCAN succeeds, correctly identifying the two non-spherical clusters.]

5.3 Hierarchical Clustering

Agglomerative Clustering

This method starts with each point as its own cluster and progressively merges the closest pair of clusters until only one cluster (or a specified number) remains. The result is a dendrogram, a tree diagram that visualizes the merge hierarchy.

Interpreting a Dendrogram: The y-axis of the dendrogram represents the distance or dissimilarity between clusters. To obtain a specific number of clusters, you "cut" the dendrogram horizontally. The number of vertical lines your horizontal cut intersects is the number of clusters you get.

[Example of a dendrogram. A horizontal dashed line is shown cutting across the tree, intersecting three vertical lines, indicating a choice of 3 clusters.]

5.4 Density-Based Clustering

These methods are often a superior choice for complex scientific data as they make fewer assumptions about cluster shape.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN groups together points that are closely packed, marking as outliers points that lie alone in low-density regions. It requires two key parameters:

Parameter Selection: Choosing eps is critical. A common heuristic is to use a k-distance plot. For a chosen min_samples (e.g., 5), calculate the distance to the 5th nearest neighbor for every point. Sort these distances and plot them. The "elbow" of this curve is a good estimate for an optimal eps value.

[A k-distance plot. The y-axis shows the sorted distance to the k-th neighbor. The plot is initially flat, then rises sharply. The "elbow" is the point where this sharp increase begins.]

HDBSCAN and OPTICS

HDBSCAN improves upon DBSCAN by converting it into a hierarchical clustering algorithm, effectively allowing for variable-density clusters and eliminating the need to choose eps. OPTICS is another variant that can also find clusters in data of varying density. For most applications, HDBSCAN is a powerful and easy-to-use starting point.

5.5 Graph-Based Clustering

Spectral Clustering

This technique uses the connectivity information of the data to perform dimensionality reduction before clustering in a lower-dimensional space. It constructs a similarity graph and uses the eigenvectors of the graph Laplacian to create embeddings. K-Means is then applied to these embeddings. It is excellent for non-convex clusters like the "two moons" example but can be computationally expensive for large datasets.

# Pseudocode for key clustering algorithms
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering, SpectralClustering

# k-Means
kmeans = KMeans(n_clusters=3, random_state=0)
labels_kmeans = kmeans.fit_predict(X)

# Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_agg = agg.fit_predict(X)

# DBSCAN
# Find eps from k-distance plot first
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_dbscan = dbscan.fit_predict(X)

# Spectral Clustering
spectral = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', random_state=0)
labels_spectral = spectral.fit_predict(X)

6. Probabilistic Clustering: Gaussian Mixture Models

A Gaussian Mixture Model (GMM) is a powerful probabilistic algorithm that assumes the data is generated from a mixture of a finite number of Gaussian distributions (i.e., "elliptical clouds") with unknown parameters. It's a generalization of k-Means that offers more flexibility.

6.1 Core Concepts

[Two-panel comparison. Left: k-Means fails to correctly cluster two overlapping, elliptical clouds of data. Right: GMM successfully models the two elliptical clusters, capturing their shape and orientation.]

6.2 Model Selection: How many clusters?

The optimal number of Gaussian components (clusters) is often chosen by fitting GMMs with a range of components and selecting the one that minimizes an information criterion. These criteria balance model fit with model complexity.

[A plot of AIC and BIC scores vs. the number of Gaussian components. Both curves decrease and then increase, and the minimum point (e.g., at k=3) suggests the optimal number of clusters.]

7. Cluster Validation & Stability

In unsupervised learning, there is no "ground truth" label to measure against. Therefore, we rely on internal validation metrics to assess the quality of the clustering results. A good clustering result is one where the clusters are dense (points are close to each other) and well-separated (points are far from other clusters).

7.1 Key Validation Metrics

No single metric is perfect; it's best to use a combination to get a complete picture.

MetricCore QuestionInterpretation
Silhouette Score "How much better is a point's own cluster than the next-best alternative?" Ranges from -1 to 1. Scores > 0.5 are good; close to 0 means overlapping clusters.
Davies-Bouldin Index "What is the average similarity between each cluster and its most similar one?" Lower scores are better (closer to 0), indicating dense and well-separated clusters.
Calinski-Harabasz Index "What is the ratio of between-cluster dispersion to within-cluster dispersion?" Higher scores are better. Also known as the Variance Ratio Criterion.
[A diagram explaining the Silhouette Score. It shows point 'i', the average distance to other points in its own cluster (a), and the average distance to points in the nearest neighboring cluster (b). The score is (b-a)/max(a,b).]

7.2 The Importance of Stability Analysis

Perhaps more important than any single metric is the stability of the clustering solution. If the clusters are real, they should be found consistently even if the data is slightly perturbed.

Bootstrapping for Stability:

  1. Create multiple "bootstrap" samples of your data by sampling with replacement.
  2. Run your clustering algorithm on each bootstrap sample.
  3. Compare the clustering results between the original data and each bootstrap sample using a metric like the Adjusted Rand Index (ARI). ARI measures the similarity between two clusterings, where 1 is a perfect match.

A solution that is consistently found across many bootstrap samples (e.g., average ARI > 0.85) is highly likely to be a meaningful and robust feature of the data.

8. The Full Unsupervised Workflow: A Decision Guide

The combination of these techniques forms a powerful, standard workflow for scientific data exploration. It's not a rigid pipeline but a series of decisions guided by your data and your scientific goals.

[A visual decision tree for the unsupervised workflow. Start -> "What is my Goal?" - If "Visualize/Explore" -> Path to Manifold Learning (UMAP, t-SNE) - If "Group/Segment" -> Path to Clustering - If "Compress/Denoise" -> Path to PCA/Robust PCA From "Group/Segment" -> "What is the data geometry?" - If "Spherical/Convex" -> k-Means - If "Non-spherical/Complex" -> Density-based (HDBSCAN) or Graph-based (Spectral) - If "Elliptical/Probabilistic" -> GMM All paths lead to -> "Validate & Interpret" -> which loops back to the start ("Iterate").]

The Iterative Cycle of Discovery

This workflow is rarely linear. The final step, "Interpretation," is the most critical. If the discovered clusters don't make scientific sense or are not stable, you must loop back. This might involve:

This iterative process of modeling, validating, and interpreting is the core engine of data-driven scientific discovery.

9. Lab: Automatic Grouping of CV Curves

Goal: We have a large, unlabeled dataset of Cyclic Voltammograms. Can we automatically group them by their underlying electrochemical mechanism without any prior knowledge?

10. Practical Tips & Pitfalls

11. Key Takeaways