utils

BO_iteration(self, bo_data, Xy_data, **params): Evaluate a model with given parameters using cross-validation. Returns the mean negative root mean squared error (higher is better).

BO_metrics(self, bo_data, Xy_data): Get combined score for repeated k-fold and top-bottom sorted CVs (used in BO)

class Logger(filein, append, suffix='dat')

Class that wraps a file object to abstract the logging.

finalize(): Closes the file

write(message)

Appends a newline character to the message and writes it into the file.

Parameters

messagestr: Text to be written in the log file.

PFI_filter(self, Xy_data, model_data): Performs the PFI calculation and returns a list of the descriptors that are not important

PFI_plot(self, Xy_data, model_data, path_n_suffix): Plots and prints the results of the PFI analysis

Xy_split(csv_df, csv_X, X_scaled_df, csv_y, csv_external_df, csv_X_external, X_scaled_external_df, csv_y_external, test_points, column_names): Returns a dictionary with the database divided into train and validation

categorical_transform(self, csv_df, module): converts all columns with strings into categorical values (one hot encoding by default, can be set to numerical 1,2,3... with categorical = True). Troubleshooting! For one-hot encoding, don't use variable names that are also column headers! i.e. DESCRIPTOR "C_atom" contain C2 as a value, but C2 is already a header of a different column in the database. Same applies for multiple columns containing the same variable names.

check_clas_problem(self, csv_df): Changes type to classification if there are only two different y values. Automatically converts any pair of values (strings or numbers) to 0 and 1. Stores the original labels for later reconversion in outputs.

check_csv_option(self, csv_option, print_err): Checks missing values in input CSV options

command_line_args(exe_type, sys_args): Load default and user-defined arguments specified through command lines. Arrguments are loaded as a dictionary

correct_hidden_layers(params): Correct for a problem with the 'hidden_layer_sizes' parameter when loading arrays from JSON

correlation_filter(self, csv_df)

Discards a) correlated variables and b) variables that do not correlate with the y values, based on R**2 values c) reduces the number of descriptors to one third of the datapoints using RFECV.

REPRODUCIBILITY GUARANTEES: - Columns are sorted alphabetically before any operation - Rows are sorted by y value to ensure consistent ordering - Tie-breaking in correlations uses alphabetical order - RFECV descriptor selection uses sorted feature importances with alphabetical tie-breaking

create_heatmap(self, csv_df, suffix, path_raw): Graph the heatmap

detect_outliers(self, outliers_scaled, name_points, naming_detect, set_type): Detects and store outliers with their corresponding datapoint names

dict_formating(dict_csv): Adapt format of dictionaries that come from dataframes loaded from CSV

distribution_plot(self, Xy_data, path_n_suffix, params_dict): Plots histogram (reg) or bin plot (clas).

format_lists(value): Transforms strings into a list

generate_lhs_points(pbounds, n_points, random_state=None)

Generate initial points using Latin Hypercube Sampling for better space coverage. LHS ensures uniform distribution across all dimensions of the hyperparameter space.

Args:: pbounds: Dictionary with parameter bounds from BO_hyperparams n_points: Number of initial points to generate random_state: Random seed for reproducibility
Returns:: List of dictionaries with parameter values

get_graph_style(): Retrieves the graph style for regression plots

get_prediction_results(model_data, y, y_pred_all): Calculate metrics based on y and y_pred

get_scoring_key(problem_type, error_type): Load scoring function for evaluating models

graph_clas(self, Xy_data, params_dict, set_type, path_n_suffix, csv_test=False, print_fun=True): Plot a confusion matrix with the prediction vs actual values

graph_reg(self, Xy_data, params_dict, set_types, path_n_suffix, graph_style, csv_test=False, print_fun=True, sd_graph=False): Plot regression graphs of predicted vs actual values for train, validation and test sets

graph_title(self, csv_test, sd_graph, error_bars): Retrieves the corresponding graph title.

graph_vars(Xy_data, set_types, csv_test, path_n_suffix, sd_graph): Set axis limits for regression plots and PATH to save the graphs

k_means(self, X_scaled, csv_y, size, seed, idx_list): Uses k-means clustering to select the test points to be as diverse as possible, but it returns the test pointsReturns the data points that will be used as training set based on the k-means clustering

kfold_cv(y_global, y_pred_global, y_pred_global_test, y_pred_global_external, model_data, loaded_model, Xy_data, random_state, BO_opt=False, shuffle=True, kfold_cv_type='repeated'): Perform a k-fold CV Uses StratifiedKFold for classification problems to maintain class distribution

load_database(self, csv_load, module, print_info=True, external_test=False): Loads either a Xy (params=False) or a parameter (params=True) database from a CSV file

load_db_n_params(self, params_dir, suffix, suffix_title, module, print_load): Loads the parameters and Xy databases from a folder, add scaled X data and print information about the databases

load_dfs(self, folder_model, module, sanity_check=False, print_info=True): Loads the parameters and Xy databases from the GENERATE folder as dataframes

load_from_yaml(self): Loads the parameters for the calculation from a yaml if specified. Otherwise does nothing.

load_minimal_model(model): Load the parameters of the minimalist models used for REFCV

load_model(self, model_name, **params): Load models with their corresponding parameters.

load_n_predict(self, model_data, Xy_data, BO_opt=False, verify_job=False): Load model and calculate errors/precision and predicted values of the ML models

load_params(self, path_csv): Load parameters from a CSV and adjust the format

load_print(self, params_name, suffix, model_data, Xy_data): Print information of the database loaded and type of model used

load_variables(kwargs, robert_module): Load default and user-defined variables

locate_csv(self, csv_input, curate_valid): Assesses whether the input CSV databases can be located

mcc_scorer_clf(y_true, y_pred): Forces classification predictions to integer for MCC.

missing_inputs(self, module, print_err=False): Gives the option to input missing variables in the terminal

model_adjust_params(self, model_name, params): Add seed and convert parameters to integers, since they come as floats with decimals in the iterations

outlier_analysis(print_outliers, outliers_data, outliers_set): Analyzes the outlier results

outlier_filter(self, Xy_data, name_points): Calculates and stores absolute errors in SD units for all the sets

outlier_plot(self, Xy_data, path_n_suffix, name_points, graph_style): Plots and prints the results of the outlier analysis

pearson_map(self, csv_df_pearson, module, params_dir=None): Creates Pearson heatmap

plot_metrics(model_data, suffix_title, verify_metrics, verify_results): Creates a plot with the results of the flawed models in VERIFY

plot_quartiles(y_combined, ax): Plot histogram, quartile lines and the points in each quartile.

plot_y_count(y_combined, ax): Plot a bar plot with the count of each y type.

prepare_sets(self, csv_df, csv_X, csv_y, test_points, column_names, csv_external_df, csv_X_external, csv_y_external, BO_opt=False): Standardizes and separate test set

repeated_kfold_cv(model_data, loaded_model, Xy_data, BO_opt): Performs a repeated k-fold cross-validation on the Xy dataset

sanity_checks(self, type_checks, module, columns_csv): Check that different variables are set correctly

scale_df(csv_X, csv_X_external): Scale the X matrix for the training set and the external test set (if any)

scoring_n_score(self, model_data, Xy_data, loaded_model): Get scoring system and score of the original model with CV

setup_hidden_layers(params): Build hidden layer structure from provided parameters

shap_analysis(self, Xy_data, model_data, path_n_suffix): Plots and prints the results of the SHAP analysis

sort_n_load(Xy_data): Sort Xy data values to enhance reproducibility in cases where same databases are loaded with different row order, ensuring stable sorting across OS with kind='stable'.

sorted_kfold_cv(loaded_model, model_data, Xy_data, error_labels): Performs a sorted k-fold cross-validation on the Xy dataset. Returns the average of the two results

test_select(self, X_scaled, csv_y): Selection of test set (if any)