glmdisc.Glmdisc

class glmdisc.Glmdisc(algorithm='SEM', test=True, validation=True, criterion='bic', m_start=20)[source]

This class implements a supervised multivariate discretization method, factor levels grouping and interaction discovery for logistic regression.

test

Boolean (T/F) specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.

Type

bool

validation

Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the discretization at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.

Type

bool

criterion

The criterion to be used to assess the goodness-of-fit of the discretization: “bic” or “aic” if no validation set, else “gini”.

Type

str

iter

Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically.

Type

int

m_start

Number of initial discretization intervals for all variables. If m_start is bigger than the number of factor levels for a given variable in predictors_qual, m_start is set (for this variable only) to this variable’s number of factor levels.

Type

int

criterion_iter

The value of the criterion wished to be optimized over the iterations.

Type

list

The best link function between the original features and their quantized counterparts that allows to quantize the data after learning.

Type

list

best_reglog:

The best logistic regression on quantized data found with best_link.

Type

sklearn.linear_model.LogisticRegression

affectations

The label encoder of each original feature. best_encoder_emap (list): The label encoder of each of the best_link.

Type

list

performance:

The best ‘criterion’ obtained.

Type

list

splitting

The line rows corresponding to the splits.

Type

list

Methods

best_formula()

Returns the best quantization found by the MCMC and prints it.

discrete_data()

Returns the best discrete data (train, validation or test) found by the MCMC.

discretize(predictors_cont, predictors_qual)

Discretizes new continuous and categorical features using a previously fitted glmdisc object.

discretize_dummy(predictors_cont, …)

Discretizes new continuous and categorical features using a previously fitted glmdisc object as Dummy Variables usable with the best_reglog object.

fit(predictors_cont, predictors_qual, labels)

Fits the Glmdisc object.

generate_data(n, d[, theta, plot])

Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.

plot([predictors_cont_number, …])

Plots the stepwise function associating the continuous features to their discretization, the groupings made and the interactions.

predict(predictors_cont, predictors_qual)

Predicts the label values with new continuous and categorical features using a previously fitted glmdisc object.

__init__(algorithm='SEM', test=True, validation=True, criterion='bic', m_start=20)[source]

Initializes self by checking if its arguments are appropriately specified.

Parameters
  • algorithm (str) – Algorithm to use (SEM or NN).

  • test (bool) – Boolean specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.

  • validation (bool) – Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the discretization at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.

  • criterion (str) – The criterion to be used to assess the goodness-of-fit of the discretization: “bic” or “aic” if no validation set, else “gini”.

  • iter (int) – Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically. Defaults to 100.

  • m_start (int) – Number of initial discretization intervals for all variables. If m_start is bigger than the number of factor levels for a given variable in predictors_qual, m_start is set (for this variable only) to this variable’s number of factor levels. Defaults to 20.

Todo

Gérer un try catch pour warm start ?

Methods

__init__([algorithm, test, validation, …])

Initializes self by checking if its arguments are appropriately specified.

best_formula()

Returns the best quantization found by the MCMC and prints it.

discrete_data()

Returns the best discrete data (train, validation or test) found by the MCMC.

discretize(predictors_cont, predictors_qual)

Discretizes new continuous and categorical features using a previously fitted glmdisc object.

discretize_dummy(predictors_cont, …)

Discretizes new continuous and categorical features using a previously fitted glmdisc object as Dummy Variables usable with the best_reglog object.

fit(predictors_cont, predictors_qual, labels)

Fits the Glmdisc object.

generate_data(n, d[, theta, plot])

Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.

plot([predictors_cont_number, …])

Plots the stepwise function associating the continuous features to their discretization, the groupings made and the interactions.

predict(predictors_cont, predictors_qual)

Predicts the label values with new continuous and categorical features using a previously fitted glmdisc object.