SMILES workflow with atomic properties

Overview

This is a modified version of the workflow used in the ROBERT manuscript for Vaska's catalysts, with a smaller database size.

Reproducibility warning

Warning

Update to AQME v1.6.0 or higher to obtain fully reproducible results! You can do it with pip install aqme --upgrade. Otherwise, it may not be possible to exactly reproduce the results due to subtle differences in the generated xTB descriptors (0.1% changes in most cases).

Required inputs

vaska_short.csv: CSV file with SMILES to generate descriptors that will be used as the training and validation sets. The full CSV file can be downloaded here:

code_name	smiles	barrier	charge	mult	complex_type	geom
ir_tbp_1_dft-pet3_1_dft-sime_1_dft-co_1_dft-icn_1_smi1_1_s_1	[Ir]([P+](CC)(CC)CC)([C-]1N(C)CCN(C)1)(C#[O+])(N=C=O)	14.8	0	1	squareplanar	["Ir_squareplanar"]
ir_tbp_1_dft-nme3_1_dft-sime_1_dft-hicn_1_dft-itcn_1_smi1_1_s_1	[Ir]([N+](C)(C)C)([C-]1N(C)CCN(C)1)(C#[N+][H])(N=C=S)	12.4	0	1	squareplanar	["Ir_squareplanar"]
...
ir_tbp_1_dft-ime_1_dft-nme3_1_dft-hicn_1_dft-icn_1_smi1_1_s_1	[Ir]([C-]1N(C)C=CN(C)1)([N+](C)(C)C)(C#[N+][H])(N=C=O)	18.1	0	1	squareplanar	["Ir_squareplanar"]
ir_tbp_1_dft-pyz_1_dft-pyz_1_dft-co_1_dft-icn_1_smi1_1_s_1	[Ir]([n+]1ccncc1)([n+]1ccncc1)(C#[O+])(N=C=O)	22	0	1	squareplanar	["Ir_squareplanar"]

The CSV database contains parameters used in the AQME workflow:

code_name: compound names.
smiles: SMILES strings of the Ir complexes.
barrier: activation energy barriers.
charge: charge of the Ir complexes.
mult: multiplicity of the Ir complexes.
complex_type: defines squareplanar geometry during conformer generation.
geom: filters off conformers based on ligand positions of Ir squareplanar complexes.

Required packages

Openbabel: Install Openbabel with conda-forge:

conda install -y -c conda-forge openbabel=3.1.1

AQME: Install (or update) AQME with conda-forge (or follow the instructions from their ReadtheDocs):

pip install aqme

xTB: Install xTB with conda-forge (or follow the instructions from their documentation):

conda install -y -c conda-forge xtb

Warning

Due to an update in the libgfortran library, xTB and CREST may encounter issues during optimizations. If you plan to use them, please make sure to run the following command after installing them:

conda install conda-forge::libgfortran=14.2.0

Executing the job

Instructions:

Install the programs specified in Required packages.
Download the vaska_short.csv file specified in Required inputs.
Go to the folder containing the CSV file in your terminal (using the "cd" command, i.e. cd C:/Users/test_robert).
Activate the conda environment where ROBERT was installed (conda activate robert).
Run the following command line:

python -m robert --aqme --y "barrier" --csv_name "vaska_short.csv" --qdescp_keywords "--qdescp_atoms ['Ir']"

Options used:

--aqme: Calls the AQME module to convert SMILES into RDKit and xTB descriptors, retrieving a new CSV database.
--y barrier: Name of the column containing the response y values.
--csv_name vaska_short.csv: CSV with the SMILES strings.
--qdescp_keywords "--qdescp_atoms ['Ir']": activates the generation of atomic descriptors with xTB using SMARTS patterns.

Note

In this example, the SMARTS pattern used is 'Ir', which specifies Ir atoms. The program allows the use of multiple SMARTS patterns simultaneously, using commas as separators, and it accepts atoms, bonds, and other structural motifs. For example:

Atomic descriptors at Zn and Ir centers: "--qdescp_atoms ['Zn','Ir']"
At the two C atoms of a triple bond: "--qdescp_atoms ['C#C']"
At the C and Zn atoms from a C-Zn bond: "--qdescp_atoms ['[C][Zn]']"
At the C and Zn atoms from a C-Zn bond, and at the two C atoms of a triple bond: "--qdescp_atoms ['[C][Zn]','C#C']"
At a Zn atom and at the two C atoms of a triple bond: "--qdescp_atoms ['Zn','C#C']"

For more information about SMARTS patterns, follow this link.

Warning

When --qdescp_keywords "--qdescp_atoms ['Ir']" is used, all the molecules in the database that do not contain Ir atoms will not be included in the workflow.

By default, the workflow sets:

--ignore "[code_name]" (variables ignored in the model)
--discard "[smiles,charge,mult,complex_type,geom]" (variables discarded after descriptor generation)
--names code_name (name of the column containing the names of the datapoints)

Execution time and versions

Time: ~3 min

System: 4 processors (Intel Xeon Ice Lake 8352Y) using 8.0 GB RAM memory

ROBERT version: 1.2.0

scikit-learn-intelex version: 2024.5.0

AQME version: 1.6.1

xTB version: 6.6.1

Results

Initial AQME workflow

The workflow starts with a CSEARCH-RDKit conformer sampling (using RDKit by default, although CREST is also available if --csearch_keywords "--program crest" is added).
Then, QDESCP is used to generate more than 200 RDKit and xTB Boltzmann-averaged molecular descriptors (using xTB geometry optimizations and different single-point calculations).

A CSV file called AQME-ROBERT_vaska_short.csv should be created in the folder where ROBERT was executed. The CSV file can be downloaded here:

Following ROBERT workflow

A PDF file called ROBERT_report.pdf should be created in the folder where ROBERT was executed. The PDF file can be visualized here:

The PDF report contains all the results of the workflow. In this case, a Random Forest (RF) model with 60% training size and a Neural Network (NN) with 70% training size were the optimal models found from:

Four different models (Gradient Boosting GB, MultiVariate Linear MVL, Neural Network NN, Random Forest RF)

Two different partition sizes (60%, 70%)

The first part of the PDF file is shown below as a preview: