lrtree

This module is dedicated to logistic regression trees

class lrtree.Lrtree(algo: str = 'SEM', test: bool = False, validation: bool = False, criterion: str = 'bic', ratios: tuple = (0.7,), class_num: int = 10, max_iter: int = 100, data_treatment: bool = False, discretization: bool = False, leaves_as_segment: bool = False, early_stopping=False, burn_in: int = 30)[source]

The class implements a supervised method based in logistic trees. Its attributes:

test
Type

bool

Boolean (T/F) specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.

validation
Type

bool

Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the model at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.

criterion
Type

str

The criterion to be used to assess the goodness-of-fit of the model: “bic” or “aic” if no validation set, else “gini”.

max_iter
Type

int

Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically.

num_clas
Type

int

Number of initial segments.

criterion_iter
Type

list

The value of the criterion wished to be optimized over the iterations.

Type

sklearn.tree.DecisionTreeClassifier

The best decision tree.

best_reglog:
Type

list

The list of the best logistic regression on each segment (found with best_link).

ratios
Type

tuple

The float ratio values for splitting of a dataset in test, validation.

Methods

fit(X, y[, solver, nb_init, tree_depth, ...])

Fits the Lrtree object.

generate_data(n, d[, seed, theta])

Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.

precision(X_test, y_test)

Scores the precision of the prediction on the test set

predict(X)

Predicts the labels for new values using previously fitted lrtree object

predict_proba(X)

Predicts the probability of the labels for new values using previously fitted lrtree object

fit(X, y, solver: str = 'lbfgs', nb_init: int = 1, tree_depth: int = 10, min_impurity_decrease: float = 0.0, optimal_size: bool = True, tol: float = 0.005, categorical=None)

Fits the Lrtree object.

Parameters
  • X (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector to be scored, where n_samples is the number of samples and n_features is the number of features

  • y (numpy.ndarray) – Boolean (0/1) labels of the observations. Must be of the same length as X (numpy “numeric” array).

  • solver (str) – sklearn’s solver for LogisticRegression (default ‘lbfgs’)

  • nb_init (int) – Number of different random initializations

  • tree_depth (int) – Maximum depth of the tree used

  • min_impurity_decrease (float) – Parameter used to split (or not) the decision Tree

  • optimal_size (bool) – Whether to use the tree parameters, or to take the optimal tree (used only with a validation set)

  • tol (float) – Tolerance to observe an improvement and stop early

  • categorical (list) – List of names of categorical features

static generate_data(n: int, d: int, seed=None, theta: Optional[ndarray] = None) Tuple[ndarray, ndarray, ndarray, float]

Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.

Parameters
  • n (int) – Number of observations to draw.

  • d (int) – Number of features to draw.

  • theta (numpy.ndarray) – Logistic regression coefficient to use (if None, use the one provided).

  • seed (int) – numpy random seed

Returns

generated data x and y, coefficient theta and bic

Return type

numpy.ndarray, numpy.ndarray, numpy.ndarray, float

precision(X_test: ndarray, y_test: ndarray) float

Scores the precision of the prediction on the test set

Parameters
  • X_test (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector used to predict values of y

  • y_test (numpy.ndarray) – array_like of shape (n_samples, 1) Vector of the value, aimed to be predicted, in the data

Returns

precision

Return type

float

predict(X: ndarray) ndarray

Predicts the labels for new values using previously fitted lrtree object

Parameters

X (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector to be scored, where n_samples is the number of samples and n_features is the number of features

predict_proba(X: ndarray) ndarray

Predicts the probability of the labels for new values using previously fitted lrtree object

Parameters

X (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector to be scored, where n_samples is the number of samples and n_features is the number of features

Classes

Lrtree([algo, test, validation, criterion, ...])

The class implements a supervised method based in logistic trees.

Exceptions

NotFittedError

Exception class to raise if estimator is used before fitting.

Modules

lrtree.discretization

implements data preprocessing (merging and discretization)

lrtree.fit

fit module for the Lrtree class

lrtree.generate_data

generate_data module for the Lrtree class: generating some data to test the algorithm on.

lrtree.logreg

implements segment-specific, possibly single-class, logistic regression

lrtree.predict

Predict, predict_proba and precision methods for the Lrtree class