Default parameters
This documents details the default parameters used in the ROBERT program.
AQME
Parameters
- csv_namestr, default=''
Name of the CSV file containing the database with SMILES and code_name columns. A path can be provided (i.e. 'C:/Users/FOLDER/FILE.csv').
- destinationstr, default=None,
Directory to create the output file(s).
- varfilestr, default=None
Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).
- ystr, default=''
Name of the column containing the response variable in the input CSV file (i.e. 'solubility').
- qdescp_keywordsstr, default=''
Add extra keywords to the AQME-QDESCP run (i.e. qdescp_keywords="--qdescp_atoms ['Ir']")
- descp_lvlstr, default='interpret'
Type of descriptor to be used in the AQME-ROBERT workflow. Options are 'interpret', 'denovo' or 'full'.
CURATE
Parameters
- csv_namestr, default=''
Name of the CSV file containing the database. A path can be provided (i.e. 'C:/Users/FOLDER/FILE.csv').
- ystr, default=''
Name of the column containing the response variable in the input CSV file (i.e. 'solubility').
- discardlist, default=[]
List containing the columns of the input CSV file that will not be included as descriptors in the curated CSV file (i.e. "['name','SMILES']").
- ignorelist, default=[]
List containing the columns of the input CSV file that will be ignored during the curation process (i.e. "['name','SMILES']"). The descriptors will be included in the curated CSV file. The y value is automatically ignored.
- namesstr, default=''
Column of the names for each datapoint. Names are used to print outliers.
- destinationstr, default=None,
Directory to create the output file(s).
- varfilestr, default=None
Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).
- categoricalstr, default='onehot'
Mode to convert data from columns with categorical variables. As an example, a variable containing 4 types of C atoms (i.e. primary, secondary, tertiary, quaternary) will be converted into categorical variables. Options: 1. 'onehot' (for one-hot encoding, ROBERT will create a descriptor for each type of C atom using 0s and 1s to indicate whether the C type is present) 2. 'numbers' (to describe the C atoms with numbers: 1, 2, 3, 4).
- corr_filter_xbool, default=True
Activate the correlation filters of descriptors, based on the correlation of the descriptors with other descriptors (x filter).
- corr_filter_ybool, default=False
Activate the correlation filters of descriptors, based on the correlation of the descriptors with the y values (y filter, for noise). This filter is only suggested for MVL.
- desc_thresfloat, default=25
Threshold for the descriptor-to-datapoints ratio to loose the correlation filter. By default, the correlation filter is loosen if there are 25 times more datapoints than descriptors.
- thres_xfloat, default=0.7
Thresolhold to discard descriptors based on high R**2 correlation with other descriptors (i.e. if thres_x=0.7, variables that show R**2 > 0.7 will be discarded).
- thres_yfloat, default=0.001
Thresolhold to discard descriptors with poor correlation with the y values based on R**2 (i.e. if thres_y=0.001, variables that show R**2 < 0.001 will be discarded).
- seedint, default=0
Random seed used in RFECV feature selector and other protocols.
- kfoldint, default=5
Number of random data splits for the cross-validation of the RFECV feature selector.
- repeat_kfoldsint, default=10
Number of repetitions for the k-fold cross-validation of the RFECV feature selector.
- auto_typebool, default=True
If there are only two y values, the program automatically changes the type of problem to classification.
- auto_fillbool, default = True
Complete missing values in columns with descriptors of "float" type using a KNN imputer
GENERATE
Parameters
- csv_namestr, default=''
Name of the CSV file containing the database. A path can be provided (i.e. 'C:/Users/FOLDER/FILE.csv').
- ystr, default=''
Name of the column containing the response variable in the input CSV file (i.e. 'solubility').
- discardlist, default=[]
List containing the columns of the input CSV file that will not be included as descriptors in the curated CSV file (i.e. ['name','SMILES']).
- ignorelist, default=[]
List containing the columns of the input CSV file that will be ignored during the curation process (i.e. ['name','SMILES']). The descriptors will be included in the curated CSV file. The y value is automatically ignored.
- destinationstr, default=None
Directory to create the output file(s).
- varfilestr, default=None
Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).
- auto_typebool, default=True
If there are only two y values, the program automatically changes the type of problem to classification.
- modellist, default=['RF','GB','NN','MVL'] (regression) and default=['RF','GB','NN','AdaB'] (classification)
ML models available: 1. 'RF' (Random forest) 2. 'MVL' (Multivariate lineal models) 3. 'GB' (Gradient boosting) 4. 'NN' (MLP neural network) 5. 'GP' (Gaussian Process) 6. 'AdaB' (AdaBoost)
- custom_paramsstr, default=None
Define new parameters for the ML models used in the hyperoptimization workflow. The path to the folder containing all the yaml files should be specified (i.e. custom_params='YAML_FOLDER')
- typestr, default='reg'
Type of the pedictions. Options: 1. 'reg' (Regressor) 2. 'clas' (Classifier)
- seedint, default=0
Random seed used in the ML predictor models and other protocols.
- error_typestr, default: rmse (regression), mcc (classification)
Target value used during the hyperopt optimization. Options: Regression: 1. rmse (root-mean-square error) 2. mae (mean absolute error) 3. r2 (R-squared, not recommended since R2 might be good even with high errors in small datasets) Classification: 1. mcc (Matthew's correlation coefficient) 2. f1 (F1 score) 3. acc (accuracy, fraction of correct predictions)
- init_pointsint, default=10
Number of initial points for Bayesian optimization (exploration)
- n_iterint, default=10
Number of iterations for Bayesian optimization (exploitation)
- expect_improvint, default=0.05
Expected improvement for Bayesian optimization
- pfi_filterbool, default=True
Activate the PFI filter of descriptors.
- pfi_epochsint, default=5
Sets the number of times a feature is randomly shuffled during the PFI analysis (standard from sklearn webpage: 5).
- pfi_thresholdfloat, default=0.2
The PFI filter is X% of the model's score (% adjusted, 0.2 = 20% of the total score during PFI).
- pfi_maxint, default=0
Number of features to keep after the PFI filter. If pfi_max is 0, all the features that pass the PFI filter are used.
- auto_testbool, default=True
Raises % of test points to 20% if test_set is lower than that.
- test_setfloat, default=0.2
Amount of datapoints to separate as external test set (0.2 = 20%). These points will not be used during the hyperoptimization, and PREDICT will use the points as test set during ROBERT workflows. Select --test_set 0 to use only training and validation.
- kfoldint, default=5
Number of random data splits for the cross-validation of the models.
- repeat_kfoldsint, default=10
Number of repetitions for the k-fold cross-validation of the models.
- splitstr, default= 'even' (regression) or 'rnd' (classification)
Specifies how the data is split into training and test sets. Options: 1. 'even': splits the data evenly into training and test sets. 2. 'RND': randomly splits the data. 3. 'stratified': splits the data while preserving the distribution of the target variable. 4. 'KN': uses a k-means approach to select representative samples for training (good for intrapolation, bad for extrapolation). 5. 'extra_q1': selects the 20% lowest values. 6. 'extra_q5': selects the 20% highest values.
PREDICT
Parameters
- destinationstr, default=None,
Directory to create the output file(s).
- varfilestr, default=None
Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).
- params_dirstr, default=''
Folder containing the database and parameters of the ML model.
- csv_teststr, default=''
Name of the CSV file containing the test set (if any). A path can be provided (i.e. 'C:/Users/FOLDER/FILE.csv').
- t_valuefloat, default=2
t-value that will be the threshold to identify outliers (check tables for t-values elsewhere). The higher the t-value the more restrictive the analysis will be (i.e. there will be more outliers with t-value=1 than with t-value = 4).
- alphafloat, default=0.05
Significance level, or probability of making a wrong decision. This parameter is related to the confidence intervals (i.e. 1-alpha is the confidence interval). By default, an alpha value of 0.05 is used, which corresponds to a confidence interval of 95%.
- shap_showint, default=10,
Number of descriptors shown in the plot of the SHAP analysis.
- pfi_showint, default=10,
Number of descriptors shown in the plot of the PFI analysis.
- pfi_epochsint, default=5,
Sets the number of times a feature is randomly shuffled during the PFI analysis (standard from sklearn webpage: 5).
- namesstr, default=''
Column of the names for each datapoint. Names are used to print outliers.
VERIFY
Parameters
- destinationstr, default=None,
Directory to create the output file(s).
- varfilestr, default=None
Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).
- params_dirstr, default=''
Folder containing the database and parameters of the ML model to analyze.
- seedint, default=0
Random seed used in the ML predictor models and other protocols.
- kfoldint, default=5
Number of random data splits for the cross-validation of the models.
- repeat_kfoldsint, default=10
Number of repetitions for the k-fold cross-validation of the models.
REPORT
Parameters
- destinationstr, default=None,
Directory to create the output file(s).
- varfilestr, default=None
Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).
- report_moduleslist of str, default=['CURATE','GENERATE','VERIFY','PREDICT']
List of the modules to include in the report.
- debug_reportbool, default=False
Debug mode using during the pytests of report.py