GENERATE (screening of ML models)

Overview

generate

Required inputs

  • Robert_example_CURATE.csv: Curated CSV file with data to use as the training and validation sets. It is created in a previous CURATE job.

Name

Target_values

x2

x5

x6

x7

x8

x9

x10

x11

Csub-Csub

Csub-H

Csub-O

H-O

1

1.854766065

110.9270401

89.87553406

49.77253406

1

0

0

0

1

FALSE

TRUE

FALSE

FALSE

2

2.034511341

110.6553116

78.65235138

55.53135138

1

0

0

0

1

TRUE

FALSE

FALSE

FALSE

...

36

0.321084552

110.7593079

59.81459808

-76.60640192

0

2

1

3

3

FALSE

FALSE

TRUE

FALSE

37

0.329517076

115.2292938

70.45233154

-65.96866846

0

2

1

3

3

FALSE

FALSE

TRUE

FALSE

Executing the job

Instructions:

  1. First, go to the folder where CURATE was previously run in your terminal. You should see a folder called CURATE on it.

  2. Run the following command line:

python -m robert --csv_name CURATE/Robert_example_CURATE.csv --generate

Options used:

  • --csv_name CURATE/Robert_example_CURATE.csv: CSV with the curated database.

  • --generate: Use only the GENERATE module.

Execution time

Time: ~2 min

System: 4 processors (Intel Xeon Ice Lake 8352Y) using 8.0 GB RAM memory

Results

  • Four CSV files for each combination of ML model/training size (in /GENERATE/Raw_data). Half of the CSVs relate to models with all the variables (No_PFI folder) and the other half for models that use only the msot important features based on PFI (PFI folder). For each pair of CSVs, one contains the parameters of the model and the other contains the database already split into training/validation sets (_db suffix).

  • The two best models, for No_PFI (all descriptors) and for PFI (only important descriptors), are stored in /GENERATE/Best_model.

  • Two heatmaps with a summary of the results for all the models, created in /GENERATE/Raw_data.