Load and Analyze Data

The "Load File" section in ChemXploreML is the starting point for most workflows. It allows you to import your molecular data, preprocess it, and perform an initial analysis to understand its characteristics. This feature is divided into two main tabs: Load Data and Analyse Data.

Load Data Tab

This tab is for loading your dataset and configuring how it should be read and processed.

File Loading

1. File Selection

Browse File: Click this to open a file dialog and select your dataset. The application supports common formats like CSV and SDF.
Use Dask: Enable this option if you are working with a large dataset that doesn't fit into memory. Dask will process the data in chunks.

2. Column Configuration

Once a file is loaded, you need to specify the key columns:

Column X: Select the column containing the molecular representations, typically SMILES strings.
Column Y: Select the column containing the target property you want to predict (e.g., bioactivity, solubility).
Index Column: Specify a column to be used as a unique identifier for each molecule. If you don't have one, you can create it.

3. Indexing and Saving

Make INDEX and save file: If your dataset lacks a unique index column, you can create one here. This will save a new version of your file with an added index column.

4. State Management

Save State: Saves the current configuration (file path, column selections, etc.) to a JSON file. This is useful for saving your work and ensuring reproducibility.
Load State: Loads a previously saved configuration file.

Analyse Data Tab

After configuring your dataset in the "Load Data" tab, switch to the "Analyse Data" tab to perform a detailed analysis.

Analysis Overview

1. Duplicate Removal

Remove duplicates on X column: This function checks for and removes any duplicate entries based on the SMILES column (Column X). It will create a new, deduplicated file for you to use in subsequent steps.

2. Data Visualization and Filtering

The application generates a series of plots to help you visualize the distribution of various molecular properties in your dataset.

Analysis Plots

For each plot, you can apply filters to select a subset of your data. For example, you can filter molecules based on the number of atoms, the presence of certain elements, or specific structural features.

Apply Filters: Once you have defined your filters, you can apply them to create a new, filtered dataset. This is a powerful way to refine your dataset for model training.

Next Steps

After loading, analyzing, and filtering your data, you are ready to proceed with the core machine learning tasks:

Load and Analyze Data ​

Load Data Tab ​

1. File Selection ​

2. Column Configuration ​

3. Indexing and Saving ​

4. State Management ​

Analyse Data Tab ​

1. Duplicate Removal ​

2. Data Visualization and Filtering ​

Next Steps ​