lrtree.discretization

implements data preprocessing (merging and discretization)

class lrtree.discretization.Processing(target: str, discretize: bool = False, merge_threshold: float = 0.2)[source]

Encapsulates information necessary to discretize / merge and reapply to some new data

Methods

fit(X, categorical)

fits the preprocessing

fit_transform(X, categorical)

Calls fit and then transform

transform(X)

Preprocesses validation/test data similar to some previously seen training data

fit(X: DataFrame, categorical: list)[source]

fits the preprocessing

Parameters
  • X (pandas.DataFrame) – dataframe to merge and / or discretize (if self.discretize)

  • categorical (list) – list of categorical features’ names

fit_transform(X: DataFrame, categorical: list)[source]

Calls fit and then transform

transform(X: DataFrame)[source]

Preprocesses validation/test data similar to some previously seen training data

Functions

apply_discretization(X, var, cut_values)

bin_data_cate_test(data_val, enc, categorical)

bin_data_cate_train(data, var_cible[, ...])

categorie_data_bin_train_test(data, data_val)

categorie_data_labels(data, data_val[, ...])

chi2_test(liste)

create_reduction_matrix(reductions, ...)

cut_points(x, y)

Computes the cut values on x to minimize the entropy (on y)

discretize_feature(X, var, var_predite)

Discretizes the continuous variable X[var], using the target value var_predite MDPL (Minimum Description Length Principle)

entropy(variable)

Computes the entropy of the variable

extreme_values(data[, missing])

Deals with extreme values (ex : NaN, or not filled) Creates (or not) a column signaling which values were missing

find_cut_index(x, y)

Finds the best place to split x

get_index(xo, yo, low, upp, depth)

green_clust(X, var, var_predite, num_clusters)

GreenClust algorithm to group modalities

grouping(X, var, var_predite[, seuil])

Chi2 independence algorithm to group modalities

part(xo, yo, low, upp, cut_points, depth)

Recursive function with returns the cuts_points

stopping_criterion(cut_idx, target, ent, depth)

Decided whether we should cut target at cut_idx, knowing we imagine the new entropy to be

traitement_train(data, target[, ...])

Traite les données en gérant les valeurs extremes, les variables catégoriques et en normalisant

traitement_train_val(X, X_val)

Traite les données et les données de test en gérant les valeurs extremes, les variables catégoriques et en normalisant Retourne les données traitées et les labels des colonnes

traitement_val(data, enc, scaler, ...)

Traite les données de test en gérant les valeurs extremes, les variables catégoriques et en normalisant Retourne les données traitées

update_reduction_matrix(df, var, ...)

Classes

Processing(target[, discretize, merge_threshold])

Encapsulates information necessary to discretize / merge and reapply to some new data