glmdisc.Glmdisc¶

class glmdisc.Glmdisc(algorithm='SEM', test=True, validation=True, criterion='bic', m_start=20)[source]¶

This class implements a supervised multivariate discretization method, factor levels grouping and interaction discovery for logistic regression.

test¶

Boolean (T/F) specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.

Type: bool

validation¶

Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the discretization at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.

Type: bool

criterion¶

The criterion to be used to assess the goodness-of-fit of the discretization: “bic” or “aic” if no validation set, else “gini”.

Type: str

iter¶

Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically.

Type: int

m_start¶

Number of initial discretization intervals for all variables. If m_start is bigger than the number of factor levels for a given variable in predictors_qual, m_start is set (for this variable only) to this variable’s number of factor levels.

Type: int

criterion_iter¶

The value of the criterion wished to be optimized over the iterations.

Type: list

best_link¶

The best link function between the original features and their quantized counterparts that allows to quantize the data after learning.

Type: list

best_reglog:

The best logistic regression on quantized data found with best_link.

Type: sklearn.linear_model.LogisticRegression

affectations¶

The label encoder of each original feature. best_encoder_emap (list): The label encoder of each of the best_link.

Type: list

performance:

The best ‘criterion’ obtained.

Type: list

splitting¶

The line rows corresponding to the splits.

Type: list

Methods

`best_formula`()	Returns the best quantization found by the MCMC and prints it.
`discrete_data`()	Returns the best discrete data (train, validation or test) found by the MCMC.
`discretize`(predictors_cont, predictors_qual)	Discretizes new continuous and categorical features using a previously fitted glmdisc object.
`discretize_dummy`(predictors_cont, …)	Discretizes new continuous and categorical features using a previously fitted glmdisc object as Dummy Variables usable with the best_reglog object.
`fit`(predictors_cont, predictors_qual, labels)	Fits the Glmdisc object.
`generate_data`(n, d[, theta, plot])	Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.
`plot`([predictors_cont_number, …])	Plots the stepwise function associating the continuous features to their discretization, the groupings made and the interactions.
`predict`(predictors_cont, predictors_qual)	Predicts the label values with new continuous and categorical features using a previously fitted glmdisc object.

__init__(algorithm='SEM', test=True, validation=True, criterion='bic', m_start=20)[source]¶

Initializes self by checking if its arguments are appropriately specified.

Parameters

algorithm (str) – Algorithm to use (SEM or NN).
test (bool) – Boolean specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.
validation (bool) – Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the discretization at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.
criterion (str) – The criterion to be used to assess the goodness-of-fit of the discretization: “bic” or “aic” if no validation set, else “gini”.
iter (int) – Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically. Defaults to 100.
m_start (int) – Number of initial discretization intervals for all variables. If m_start is bigger than the number of factor levels for a given variable in predictors_qual, m_start is set (for this variable only) to this variable’s number of factor levels. Defaults to 20.

Todo

Gérer un try catch pour warm start ?

Methods

`__init__`([algorithm, test, validation, …])	Initializes self by checking if its arguments are appropriately specified.
`best_formula`()	Returns the best quantization found by the MCMC and prints it.
`discrete_data`()	Returns the best discrete data (train, validation or test) found by the MCMC.
`discretize`(predictors_cont, predictors_qual)	Discretizes new continuous and categorical features using a previously fitted glmdisc object.
`discretize_dummy`(predictors_cont, …)	Discretizes new continuous and categorical features using a previously fitted glmdisc object as Dummy Variables usable with the best_reglog object.
`fit`(predictors_cont, predictors_qual, labels)	Fits the Glmdisc object.
`generate_data`(n, d[, theta, plot])	Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.
`plot`([predictors_cont_number, …])	Plots the stepwise function associating the continuous features to their discretization, the groupings made and the interactions.
`predict`(predictors_cont, predictors_qual)	Predicts the label values with new continuous and categorical features using a previously fitted glmdisc object.