GENERATE

class generate(**kwargs)

Class containing all the functions from the GENERATE module.

Parameters

kwargsargument class

Specify any arguments from the GENERATE module (for a complete list of variables, visit the ROBERT documentation)

Parameters

csv_namestr, default=''

Name of the CSV file containing the database. A path can be provided (i.e. 'C:/Users/FOLDER/FILE.csv').

ystr, default=''

Name of the column containing the response variable in the input CSV file (i.e. 'solubility').

discardlist, default=[]

List containing the columns of the input CSV file that will not be included as descriptors in the curated CSV file (i.e. ['name','SMILES']).

ignorelist, default=[]

List containing the columns of the input CSV file that will be ignored during the curation process (i.e. ['name','SMILES']). The descriptors will be included in the curated CSV file. The y value is automatically ignored.

destinationstr, default=None

Directory to create the output file(s).

varfilestr, default=None

Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).

auto_typebool, default=True

If there are only two y values, the program automatically changes the type of problem to classification.

modellist, default=['RF','GB','NN','MVL'] (regression) and default=['RF','GB','NN','AdaB'] (classification)

ML models available: 1. 'RF' (Random forest) 2. 'MVL' (Multivariate lineal models) 3. 'GB' (Gradient boosting) 4. 'NN' (MLP neural network) 5. 'GP' (Gaussian Process) 6. 'AdaB' (AdaBoost)

custom_paramsstr, default=None

Define new parameters for the ML models used in the hyperoptimization workflow. The path to the folder containing all the yaml files should be specified (i.e. custom_params='YAML_FOLDER')

typestr, default='reg'

Type of the pedictions. Options: 1. 'reg' (Regressor) 2. 'clas' (Classifier)

seedint, default=0

Random seed used in the ML predictor models and other protocols.

error_typestr, default: rmse (regression), mcc (classification)

Target value used during the hyperopt optimization. Options: Regression: 1. rmse (root-mean-square error) 2. mae (mean absolute error) 3. r2 (R-squared, not recommended since R2 might be good even with high errors in small datasets) Classification: 1. mcc (Matthew's correlation coefficient) 2. f1 (F1 score) 3. acc (accuracy, fraction of correct predictions)

init_pointsint, default=10

Number of initial points for Bayesian optimization (exploration)

n_iterint, default=10

Number of iterations for Bayesian optimization (exploitation)

expect_improvint, default=0.05

Expected improvement for Bayesian optimization

pfi_filterbool, default=True

Activate the PFI filter of descriptors.

pfi_epochsint, default=5

Sets the number of times a feature is randomly shuffled during the PFI analysis (standard from sklearn webpage: 5).

pfi_thresholdfloat, default=0.2

The PFI filter is X% of the model's score (% adjusted, 0.2 = 20% of the total score during PFI).

pfi_maxint, default=0

Number of features to keep after the PFI filter. If pfi_max is 0, all the features that pass the PFI filter are used.

auto_testbool, default=True

Raises % of test points to 20% if test_set is lower than that.

test_setfloat, default=0.2

Amount of datapoints to separate as external test set (0.2 = 20%). These points will not be used during the hyperoptimization, and PREDICT will use the points as test set during ROBERT workflows. Select --test_set 0 to use only training and validation.

kfoldint, default=5

Number of random data splits for the cross-validation of the models.

repeat_kfoldsint, default=10

Number of repetitions for the k-fold cross-validation of the models.

splitstr, default= 'even' (regression) or 'rnd' (classification)

Specifies how the data is split into training and test sets. Options: 1. 'even': splits the data evenly into training and test sets. 2. 'RND': randomly splits the data. 3. 'stratified': splits the data while preserving the distribution of the target variable. 4. 'KN': uses a k-means approach to select representative samples for training (good for intrapolation, bad for extrapolation). 5. 'extra_q1': selects the 20% lowest values. 6. 'extra_q5': selects the 20% highest values.