Dimensionality Reduction

High-dimensional molecular embeddings, while rich in information, are difficult to visualize and interpret directly. The Dimensionality Reduction module in ChemXploreML provides a suite of powerful algorithms to project these embeddings into a lower-dimensional space (typically 2D or 3D), making it possible to visualize the chemical space, identify clusters, and understand the relationships between molecules.

Overview

Dimensionality Reduction Overview

Workflow

The process of applying dimensionality reduction involves the following steps:

Select Embedder: Choose the molecular embedding that you want to reduce. These are the embeddings you generated in the "Embed Molecule" section.
Choose a Method: Select a dimensionality reduction algorithm from the available tabs. Each method has its own strengths and is suited for different types of analysis.
Configure Parameters: Adjust the parameters for the selected algorithm. You can hover over each parameter to get a description of what it does.
Save Parameters: Before running the reduction, you must save your parameter configuration. This creates a reusable settings file and ensures your workflow is reproducible.
Run Reduction: Click the "Run" button to start the dimensionality reduction process. The application will save the resulting low-dimensional embeddings as a new .npy file.
Visualize: The reduced embeddings can then be used for visualization and further analysis.

Supported Methods

Below is a detailed description of each supported dimensionality reduction method and its configurable parameters.

PCA (Principal Component Analysis)

Description: A linear technique that transforms the data into a new coordinate system, where the axes (principal components) are orthogonal and capture the maximum variance in the data.
Local Structure: ⚠️ Moderate
Global Structure: ✅ Excellent
Scalability: ✅ Fast

Parameters:

n_components: The number of principal components to keep. Default: 70.
random_state: Seed for the random number generator. Default: 42.

UMAP (Uniform Manifold Approximation and Projection)

Description: A non-linear technique excellent for preserving both local and some global structure. It is often used for visualizing clusters in high-dimensional data.
Local Structure: ✅ Excellent
Global Structure: ⚠️ Moderate
Scalability: ✅ Fast

Parameters:

n_neighbors: Controls how UMAP balances local versus global structure. Default: 15.
min_dist: The minimum distance between embedded points. Default: 0.1.
n_components: The dimension of the space to embed into. Default: 2.
metric: The distance metric to use. Default: 'euclidean'.
random_state: Seed for the random number generator. Default: 42.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Description: A non-linear technique that is particularly good at revealing the underlying cluster structure in the data. It excels at preserving local structure.
Local Structure: ✅ Excellent
Global Structure: ❌ Poor
Scalability: ❌ Slow

Parameters:

n_components: The dimension of the space to embed into. Default: 2.
perplexity: Related to the number of nearest neighbors considered for each point. Default: 30.
random_state: Seed for the random number generator. Default: 42.

KernelPCA

Description: An extension of PCA that uses kernel methods to perform non-linear dimensionality reduction.
Local Structure: ✅ Good
Global Structure: ✅ Good
Scalability: ⚠️ Slower

Parameters:

n_components: Number of components to keep. Default: 2.
kernel: The kernel function to use ('linear', 'poly', 'rbf', 'sigmoid', 'cosine'). Default: 'rbf'.
gamma: Kernel coefficient. Default: null.

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding)

Description: A visualization method that captures both local and global nonlinear structure using a heat-diffusion-based affinity metric. It is particularly well-suited for visualizing trajectories and progressions in data.
Local Structure: ✅ Good
Global Structure: ✅ Excellent
Scalability: ⚠️ Medium

Parameters:

n_components: Number of dimensions to reduce to. Default: 2.
knn: Number of nearest neighbors. Default: 5.
decay: Controls the decay of the kernel. Default: 40.
t: Diffusion time scale. Default: 'auto'.
random_state: Seed for reproducibility. Default: 42.

ISOMAP

Description: A non-linear technique that preserves the geodesic distances between points on a manifold.
Local Structure: ✅ Excellent
Global Structure: ✅ Good
Scalability: ⚠️ Medium

Parameters:

n_components: Number of coordinates for the manifold. Default: 2.
n_neighbors: Number of neighbors to consider for each point. Default: 5.

Laplacian Eigenmaps

Description: A spectral method that preserves local manifold information by constructing a graph from the data and embedding it in a lower-dimensional space.
Local Structure: ✅ Excellent
Global Structure: ❌ Poor
Scalability: ✅ Fast

Parameters:

n_components: Dimension of the embedding space. Default: 2.
n_neighbors: Number of neighbors for constructing the neighborhood graph. Default: 10.

TriMap

Description: A non-linear method that uses triplet constraints to preserve both local and global structure in the data.
Local Structure: ✅ Excellent
Global Structure: ✅ Good
Scalability: ✅ Fast

Parameters:

n_dims: Output dimensionality. Default: 2.
n_inliers: Number of inlier points per triplet. Default: 10.
n_outliers: Number of outlier points per triplet. Default: 5.
n_random: Number of random triplets per point. Default: 5.
distance: The distance metric to use. Default: 'euclidean'.

Factor Analysis

Description: A linear statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.
Local Structure: ⚠️ Limited
Global Structure: ✅ Good
Scalability: ✅ Fast

Parameters:

n_components: Number of latent factors to extract. Default: 2.

Next Steps

After reducing the dimensionality of your embeddings, you can:

Use the 2D or 3D embeddings to create scatter plots and visualize your chemical space.
Use the reduced-dimension embeddings as features for training a machine learning model.

Dimensionality Reduction ​

Overview ​

Workflow ​

Supported Methods ​

PCA (Principal Component Analysis) ​

UMAP (Uniform Manifold Approximation and Projection) ​

t-SNE (t-Distributed Stochastic Neighbor Embedding) ​

KernelPCA ​

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) ​

ISOMAP ​

Laplacian Eigenmaps ​

TriMap ​

Factor Analysis ​

Next Steps ​

Dimensionality Reduction

Overview

Workflow

Supported Methods

PCA (Principal Component Analysis)

UMAP (Uniform Manifold Approximation and Projection)

t-SNE (t-Distributed Stochastic Neighbor Embedding)

KernelPCA

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding)

ISOMAP

Laplacian Eigenmaps

TriMap

Factor Analysis

Next Steps