utils
- BO_iteration(self, bo_data, Xy_data, **params)
Evaluate a model with given parameters using cross-validation. Returns the mean negative root mean squared error (higher is better).
- BO_metrics(self, bo_data, Xy_data)
Get combined score for repeated k-fold and top-bottom sorted CVs (used in BO)
- class Logger(filein, append, suffix='dat')
Class that wraps a file object to abstract the logging.
- finalize()
Closes the file
- PFI_filter(self, Xy_data, model_data)
Performs the PFI calculation and returns a list of the descriptors that are not important
- PFI_plot(self, Xy_data, model_data, path_n_suffix)
Plots and prints the results of the PFI analysis
- Xy_split(csv_df, csv_X, X_scaled_df, csv_y, csv_external_df, csv_X_external, X_scaled_external_df, csv_y_external, test_points, column_names)
Returns a dictionary with the database divided into train and validation
- categorical_transform(self, csv_df, module)
converts all columns with strings into categorical values (one hot encoding by default, can be set to numerical 1,2,3... with categorical = True). Troubleshooting! For one-hot encoding, don't use variable names that are also column headers! i.e. DESCRIPTOR "C_atom" contain C2 as a value, but C2 is already a header of a different column in the database. Same applies for multiple columns containing the same variable names.
- check_clas_problem(self, csv_df)
Changes type to classification if there are only two different y values. Automatically converts any pair of values (strings or numbers) to 0 and 1. Stores the original labels for later reconversion in outputs.
- check_csv_option(self, csv_option, print_err)
Checks missing values in input CSV options
- command_line_args(exe_type, sys_args)
Load default and user-defined arguments specified through command lines. Arrguments are loaded as a dictionary
Correct for a problem with the 'hidden_layer_sizes' parameter when loading arrays from JSON
- correlation_filter(self, csv_df)
Discards a) correlated variables and b) variables that do not correlate with the y values, based on R**2 values c) reduces the number of descriptors to one third of the datapoints using RFECV.
REPRODUCIBILITY GUARANTEES: - Columns are sorted alphabetically before any operation - Rows are sorted by y value to ensure consistent ordering - Tie-breaking in correlations uses alphabetical order - RFECV descriptor selection uses sorted feature importances with alphabetical tie-breaking
- create_heatmap(self, csv_df, suffix, path_raw)
Graph the heatmap
- detect_outliers(self, outliers_scaled, name_points, naming_detect, set_type)
Detects and store outliers with their corresponding datapoint names
- dict_formating(dict_csv)
Adapt format of dictionaries that come from dataframes loaded from CSV
- distribution_plot(self, Xy_data, path_n_suffix, params_dict)
Plots histogram (reg) or bin plot (clas).
- format_lists(value)
Transforms strings into a list
- generate_lhs_points(pbounds, n_points, random_state=None)
Generate initial points using Latin Hypercube Sampling for better space coverage. LHS ensures uniform distribution across all dimensions of the hyperparameter space.
- Args:
pbounds: Dictionary with parameter bounds from BO_hyperparams n_points: Number of initial points to generate random_state: Random seed for reproducibility
- Returns:
List of dictionaries with parameter values
- get_graph_style()
Retrieves the graph style for regression plots
- get_prediction_results(model_data, y, y_pred_all)
Calculate metrics based on y and y_pred
- get_scoring_key(problem_type, error_type)
Load scoring function for evaluating models
- graph_clas(self, Xy_data, params_dict, set_type, path_n_suffix, csv_test=False, print_fun=True)
Plot a confusion matrix with the prediction vs actual values
- graph_reg(self, Xy_data, params_dict, set_types, path_n_suffix, graph_style, csv_test=False, print_fun=True, sd_graph=False)
Plot regression graphs of predicted vs actual values for train, validation and test sets
- graph_title(self, csv_test, sd_graph, error_bars)
Retrieves the corresponding graph title.
- graph_vars(Xy_data, set_types, csv_test, path_n_suffix, sd_graph)
Set axis limits for regression plots and PATH to save the graphs
- k_means(self, X_scaled, csv_y, size, seed, idx_list)
Uses k-means clustering to select the test points to be as diverse as possible, but it returns the test pointsReturns the data points that will be used as training set based on the k-means clustering
- kfold_cv(y_global, y_pred_global, y_pred_global_test, y_pred_global_external, model_data, loaded_model, Xy_data, random_state, BO_opt=False, shuffle=True, kfold_cv_type='repeated')
Perform a k-fold CV Uses StratifiedKFold for classification problems to maintain class distribution
- load_database(self, csv_load, module, print_info=True, external_test=False)
Loads either a Xy (params=False) or a parameter (params=True) database from a CSV file
- load_db_n_params(self, params_dir, suffix, suffix_title, module, print_load)
Loads the parameters and Xy databases from a folder, add scaled X data and print information about the databases
- load_dfs(self, folder_model, module, sanity_check=False, print_info=True)
Loads the parameters and Xy databases from the GENERATE folder as dataframes
- load_from_yaml(self)
Loads the parameters for the calculation from a yaml if specified. Otherwise does nothing.
- load_minimal_model(model)
Load the parameters of the minimalist models used for REFCV
- load_model(self, model_name, **params)
Load models with their corresponding parameters.
- load_n_predict(self, model_data, Xy_data, BO_opt=False, verify_job=False)
Load model and calculate errors/precision and predicted values of the ML models
- load_params(self, path_csv)
Load parameters from a CSV and adjust the format
- load_print(self, params_name, suffix, model_data, Xy_data)
Print information of the database loaded and type of model used
- load_variables(kwargs, robert_module)
Load default and user-defined variables
- locate_csv(self, csv_input, curate_valid)
Assesses whether the input CSV databases can be located
- mcc_scorer_clf(y_true, y_pred)
Forces classification predictions to integer for MCC.
- missing_inputs(self, module, print_err=False)
Gives the option to input missing variables in the terminal
- model_adjust_params(self, model_name, params)
Add seed and convert parameters to integers, since they come as floats with decimals in the iterations
- outlier_analysis(print_outliers, outliers_data, outliers_set)
Analyzes the outlier results
- outlier_filter(self, Xy_data, name_points)
Calculates and stores absolute errors in SD units for all the sets
- outlier_plot(self, Xy_data, path_n_suffix, name_points, graph_style)
Plots and prints the results of the outlier analysis
- pearson_map(self, csv_df_pearson, module, params_dir=None)
Creates Pearson heatmap
- plot_metrics(model_data, suffix_title, verify_metrics, verify_results)
Creates a plot with the results of the flawed models in VERIFY
- plot_quartiles(y_combined, ax)
Plot histogram, quartile lines and the points in each quartile.
- plot_y_count(y_combined, ax)
Plot a bar plot with the count of each y type.
- prepare_sets(self, csv_df, csv_X, csv_y, test_points, column_names, csv_external_df, csv_X_external, csv_y_external, BO_opt=False)
Standardizes and separate test set
- repeated_kfold_cv(model_data, loaded_model, Xy_data, BO_opt)
Performs a repeated k-fold cross-validation on the Xy dataset
- sanity_checks(self, type_checks, module, columns_csv)
Check that different variables are set correctly
- scale_df(csv_X, csv_X_external)
Scale the X matrix for the training set and the external test set (if any)
- scoring_n_score(self, model_data, Xy_data, loaded_model)
Get scoring system and score of the original model with CV
Build hidden layer structure from provided parameters
- shap_analysis(self, Xy_data, model_data, path_n_suffix)
Plots and prints the results of the SHAP analysis
- sort_n_load(Xy_data)
Sort Xy data values to enhance reproducibility in cases where same databases are loaded with different row order, ensuring stable sorting across OS with kind='stable'.
- sorted_kfold_cv(loaded_model, model_data, Xy_data, error_labels)
Performs a sorted k-fold cross-validation on the Xy dataset. Returns the average of the two results
- test_select(self, X_scaled, csv_y)
Selection of test set (if any)