glmdisc.Glmdisc¶
-
class
glmdisc.
Glmdisc
(algorithm='SEM', test=True, validation=True, criterion='bic', m_start=20)[source]¶ This class implements a supervised multivariate discretization method, factor levels grouping and interaction discovery for logistic regression.
-
test
¶ Boolean (T/F) specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.
- Type
-
validation
¶ Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the discretization at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.
- Type
-
criterion
¶ The criterion to be used to assess the goodness-of-fit of the discretization: “bic” or “aic” if no validation set, else “gini”.
- Type
-
iter
¶ Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically.
- Type
-
m_start
¶ Number of initial discretization intervals for all variables. If
m_start
is bigger than the number of factor levels for a given variable in predictors_qual, m_start is set (for this variable only) to this variable’s number of factor levels.- Type
-
best_link
¶ The best link function between the original features and their quantized counterparts that allows to quantize the data after learning.
- Type
-
best_reglog:
The best logistic regression on quantized data found with best_link.
-
affectations
¶ The label encoder of each original feature. best_encoder_emap (list): The label encoder of each of the best_link.
- Type
-
performance:
The best ‘criterion’ obtained.
- Type
Methods
Returns the best quantization found by the MCMC and prints it.
Returns the best discrete data (train, validation or test) found by the MCMC.
discretize
(predictors_cont, predictors_qual)Discretizes new continuous and categorical features using a previously fitted glmdisc object.
discretize_dummy
(predictors_cont, …)Discretizes new continuous and categorical features using a previously fitted glmdisc object as Dummy Variables usable with the best_reglog object.
fit
(predictors_cont, predictors_qual, labels)Fits the Glmdisc object.
generate_data
(n, d[, theta, plot])Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.
plot
([predictors_cont_number, …])Plots the stepwise function associating the continuous features to their discretization, the groupings made and the interactions.
predict
(predictors_cont, predictors_qual)Predicts the label values with new continuous and categorical features using a previously fitted glmdisc object.
-
__init__
(algorithm='SEM', test=True, validation=True, criterion='bic', m_start=20)[source]¶ Initializes self by checking if its arguments are appropriately specified.
- Parameters
algorithm (str) – Algorithm to use (SEM or NN).
test (bool) – Boolean specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.
validation (bool) – Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the discretization at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.
criterion (str) – The criterion to be used to assess the goodness-of-fit of the discretization: “bic” or “aic” if no validation set, else “gini”.
iter (int) – Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically. Defaults to 100.
m_start (int) – Number of initial discretization intervals for all variables. If
m_start
is bigger than the number of factor levels for a given variable inpredictors_qual
,m_start
is set (for this variable only) to this variable’s number of factor levels. Defaults to 20.
Todo
Gérer un try catch pour warm start ?
Methods
__init__
([algorithm, test, validation, …])Initializes self by checking if its arguments are appropriately specified.
Returns the best quantization found by the MCMC and prints it.
Returns the best discrete data (train, validation or test) found by the MCMC.
discretize
(predictors_cont, predictors_qual)Discretizes new continuous and categorical features using a previously fitted glmdisc object.
discretize_dummy
(predictors_cont, …)Discretizes new continuous and categorical features using a previously fitted glmdisc object as Dummy Variables usable with the best_reglog object.
fit
(predictors_cont, predictors_qual, labels)Fits the Glmdisc object.
generate_data
(n, d[, theta, plot])Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.
plot
([predictors_cont_number, …])Plots the stepwise function associating the continuous features to their discretization, the groupings made and the interactions.
predict
(predictors_cont, predictors_qual)Predicts the label values with new continuous and categorical features using a previously fitted glmdisc object.
-