CURATE
- class curate(**kwargs)
Class containing all the functions from the CURATE module.
Parameters
- kwargsargument class
Specify any arguments from the CURATE module (for a complete list of variables, visit the ROBERT documentation)
- dup_filter(csv_df_dup)
Removes duplicated datapoints and descriptors
- save_curate(csv_df)
Saves the curated database and options used in CURATE
Parameters
- csv_namestr, default=''
Name of the CSV file containing the database. A path can be provided (i.e. 'C:/Users/FOLDER/FILE.csv').
- ystr, default=''
Name of the column containing the response variable in the input CSV file (i.e. 'solubility').
- discardlist, default=[]
List containing the columns of the input CSV file that will not be included as descriptors in the curated CSV file (i.e. "['name','SMILES']").
- ignorelist, default=[]
List containing the columns of the input CSV file that will be ignored during the curation process (i.e. "['name','SMILES']"). The descriptors will be included in the curated CSV file. The y value is automatically ignored.
- namesstr, default=''
Column of the names for each datapoint. Names are used to print outliers.
- destinationstr, default=None,
Directory to create the output file(s).
- varfilestr, default=None
Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).
- categoricalstr, default='onehot'
Mode to convert data from columns with categorical variables. As an example, a variable containing 4 types of C atoms (i.e. primary, secondary, tertiary, quaternary) will be converted into categorical variables. Options: 1. 'onehot' (for one-hot encoding, ROBERT will create a descriptor for each type of C atom using 0s and 1s to indicate whether the C type is present) 2. 'numbers' (to describe the C atoms with numbers: 1, 2, 3, 4).
- corr_filter_xbool, default=True
Activate the correlation filters of descriptors, based on the correlation of the descriptors with other descriptors (x filter).
- corr_filter_ybool, default=False
Activate the correlation filters of descriptors, based on the correlation of the descriptors with the y values (y filter, for noise). This filter is only suggested for MVL.
- desc_thresfloat, default=25
Threshold for the descriptor-to-datapoints ratio to loose the correlation filter. By default, the correlation filter is loosen if there are 25 times more datapoints than descriptors.
- thres_xfloat, default=0.7
Thresolhold to discard descriptors based on high R**2 correlation with other descriptors (i.e. if thres_x=0.7, variables that show R**2 > 0.7 will be discarded).
- thres_yfloat, default=0.001
Thresolhold to discard descriptors with poor correlation with the y values based on R**2 (i.e. if thres_y=0.001, variables that show R**2 < 0.001 will be discarded).
- seedint, default=0
Random seed used in RFECV feature selector and other protocols.
- kfoldint, default=5
Number of random data splits for the cross-validation of the RFECV feature selector.
- repeat_kfoldsint, default=10
Number of repetitions for the k-fold cross-validation of the RFECV feature selector.
- auto_typebool, default=True
If there are only two y values, the program automatically changes the type of problem to classification.
- auto_fillbool, default = True
Complete missing values in columns with descriptors of "float" type using a KNN imputer