GENERATE (screening of ML models)

Overview

Robert_example_CURATE.csv: Curated CSV file with data to use as the training and validation sets. It is created in a previous CURATE job.

Name	Target_values	x2	x5	x6	x7	x8	x9	x10	x11	Csub-Csub	Csub-H	Csub-O	H-O
1	1.854766065	110.9270401	89.87553406	49.77253406	1	0	0	0	1	FALSE	TRUE	FALSE	FALSE
2	2.034511341	110.6553116	78.65235138	55.53135138	1	0	0	0	1	TRUE	FALSE	FALSE	FALSE
...
36	0.321084552	110.7593079	59.81459808	-76.60640192	0	2	1	3	3	FALSE	FALSE	TRUE	FALSE
37	0.329517076	115.2292938	70.45233154	-65.96866846	0	2	1	3	3	FALSE	FALSE	TRUE	FALSE

Instructions:

First, go to the folder where CURATE was previously run in your terminal. You should see a folder called CURATE on it.
Run the following command line:

python -m robert --csv_name CURATE/Robert_example_CURATE.csv --generate

Options used:

Time: ~2 min

System: 4 processors (Intel Xeon Ice Lake 8352Y) using 8.0 GB RAM memory

Four CSV files for each combination of ML model/training size (in /GENERATE/Raw_data). Half of the CSVs relate to models with all the variables (No_PFI folder) and the other half for models that use only the msot important features based on PFI (PFI folder). For each pair of CSVs, one contains the parameters of the model and the other contains the database already split into training/validation sets (_db suffix).
The two best models, for No_PFI (all descriptors) and for PFI (only important descriptors), are stored in /GENERATE/Best_model.
Two heatmaps with a summary of the results for all the models, created in /GENERATE/Raw_data.