lrtree.discretization

implements data preprocessing (merging and discretization)

class lrtree.discretization.Processing(target: str, discretize: bool = False, merge_threshold: float = 0.2)[source]

Encapsulates information necessary to discretize / merge and reapply to some new data

Methods

`fit`(X, categorical)	fits the preprocessing
`fit_transform`(X, categorical)	Calls fit and then transform
`transform`(X)	Preprocesses validation/test data similar to some previously seen training data

fit(X: DataFrame, categorical: list)[source]

fits the preprocessing

Parameters

X (pandas.DataFrame) – dataframe to merge and / or discretize (if self.discretize)
categorical (list) – list of categorical features’ names

fit_transform(X: DataFrame, categorical: list)[source]: Calls fit and then transform

transform(X: DataFrame)[source]: Preprocesses validation/test data similar to some previously seen training data

Functions

`apply_discretization`(X, var, cut_values)
`bin_data_cate_test`(data_val, enc, categorical)
`bin_data_cate_train`(data, var_cible[, ...])
`categorie_data_bin_train_test`(data, data_val)
`categorie_data_labels`(data, data_val[, ...])
`chi2_test`(liste)
`create_reduction_matrix`(reductions, ...)
`cut_points`(x, y)	Computes the cut values on x to minimize the entropy (on y)
`discretize_feature`(X, var, var_predite)	Discretizes the continuous variable X[var], using the target value var_predite MDPL (Minimum Description Length Principle)
`entropy`(variable)	Computes the entropy of the variable
`extreme_values`(data[, missing])	Deals with extreme values (ex : NaN, or not filled) Creates (or not) a column signaling which values were missing
`find_cut_index`(x, y)	Finds the best place to split x
`get_index`(xo, yo, low, upp, depth)
`green_clust`(X, var, var_predite, num_clusters)	GreenClust algorithm to group modalities
`grouping`(X, var, var_predite[, seuil])	Chi2 independence algorithm to group modalities
`part`(xo, yo, low, upp, cut_points, depth)	Recursive function with returns the cuts_points
`stopping_criterion`(cut_idx, target, ent, depth)	Decided whether we should cut target at cut_idx, knowing we imagine the new entropy to be
`traitement_train`(data, target[, ...])	Traite les données en gérant les valeurs extremes, les variables catégoriques et en normalisant
`traitement_train_val`(X, X_val)	Traite les données et les données de test en gérant les valeurs extremes, les variables catégoriques et en normalisant Retourne les données traitées et les labels des colonnes
`traitement_val`(data, enc, scaler, ...)	Traite les données de test en gérant les valeurs extremes, les variables catégoriques et en normalisant Retourne les données traitées
`update_reduction_matrix`(df, var, ...)

Classes

Processing(target[, discretize, merge_threshold])

Encapsulates information necessary to discretize / merge and reapply to some new data