lrtree

This module is dedicated to logistic regression trees

class lrtree.Lrtree(algo: str = 'SEM', test: bool = False, validation: bool = False, criterion: str = 'bic', ratios: tuple = (0.7,), class_num: int = 10, max_iter: int = 100, data_treatment: bool = False, discretization: bool = False, leaves_as_segment: bool = False, early_stopping=False, burn_in: int = 30)[source]

The class implements a supervised method based in logistic trees. Its attributes:

test

Type: bool

Boolean (T/F) specifying if a test set is required. If True, the provided data is split to provide 20% of observations in a test set and the reported performance is the Gini index on test set.

validation

Type: bool

Boolean (T/F) specifying if a validation set is required. If True, the provided data is split to provide 20% of observations in a validation set and the reported performance is the Gini index on the validation set (if no test=False). The quality of the model at each step is evaluated using the Gini index on the validation set, so criterion must be set to “gini”.

criterion

Type: str

The criterion to be used to assess the goodness-of-fit of the model: “bic” or “aic” if no validation set, else “gini”.

max_iter

Type: int

Number of MCMC steps to perform. The more the better, but it may be more intelligent to use several MCMCs. Computation time can increase dramatically.

num_clas

Type: int

Number of initial segments.

criterion_iter

Type: list

The value of the criterion wished to be optimized over the iterations.

best_link

Type: sklearn.tree.DecisionTreeClassifier

The best decision tree.

best_reglog:

Type: list

The list of the best logistic regression on each segment (found with best_link).

ratios

Type: tuple

The float ratio values for splitting of a dataset in test, validation.

Methods

`fit`(X, y[, solver, nb_init, tree_depth, ...])	Fits the Lrtree object.
`generate_data`(n, d[, seed, theta])	Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.
`precision`(X_test, y_test)	Scores the precision of the prediction on the test set
`predict`(X)	Predicts the labels for new values using previously fitted lrtree object
`predict_proba`(X)	Predicts the probability of the labels for new values using previously fitted lrtree object

fit(X, y, solver: str = 'lbfgs', nb_init: int = 1, tree_depth: int = 10, min_impurity_decrease: float = 0.0, optimal_size: bool = True, tol: float = 0.005, categorical=None)

Fits the Lrtree object.

Parameters

X (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector to be scored, where n_samples is the number of samples and n_features is the number of features
y (numpy.ndarray) – Boolean (0/1) labels of the observations. Must be of the same length as X (numpy “numeric” array).
solver (str) – sklearn’s solver for LogisticRegression (default ‘lbfgs’)
nb_init (int) – Number of different random initializations
tree_depth (int) – Maximum depth of the tree used
min_impurity_decrease (float) – Parameter used to split (or not) the decision Tree
optimal_size (bool) – Whether to use the tree parameters, or to take the optimal tree (used only with a validation set)
tol (float) – Tolerance to observe an improvement and stop early
categorical (list) – List of names of categorical features

static generate_data(n: int, d: int, seed=None, theta: Optional[ndarray] = None) → Tuple[ndarray, ndarray, ndarray, float]

Generates some toy continuous data that gets discretized, and a label is drawn from a logistic regression given the discretized features.

Parameters

n (int) – Number of observations to draw.
d (int) – Number of features to draw.
theta (numpy.ndarray) – Logistic regression coefficient to use (if None, use the one provided).
seed (int) – numpy random seed

Returns

generated data x and y, coefficient theta and bic

Return type

numpy.ndarray, numpy.ndarray, numpy.ndarray, float

precision(X_test: ndarray, y_test: ndarray) → float

Scores the precision of the prediction on the test set

Parameters

X_test (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector used to predict values of y
y_test (numpy.ndarray) – array_like of shape (n_samples, 1) Vector of the value, aimed to be predicted, in the data

Returns

precision

Return type

float

predict(X: ndarray) → ndarray

Predicts the labels for new values using previously fitted lrtree object

Parameters: X (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector to be scored, where n_samples is the number of samples and n_features is the number of features

predict_proba(X: ndarray) → ndarray

Predicts the probability of the labels for new values using previously fitted lrtree object

Parameters: X (numpy.ndarray) – array_like of shape (n_samples, n_features) Vector to be scored, where n_samples is the number of samples and n_features is the number of features

Classes

Lrtree([algo, test, validation, criterion, ...])

The class implements a supervised method based in logistic trees.

Exceptions

NotFittedError

Exception class to raise if estimator is used before fitting.

Modules

`lrtree.discretization`	implements data preprocessing (merging and discretization)
`lrtree.fit`	fit module for the Lrtree class
`lrtree.generate_data`	generate_data module for the Lrtree class: generating some data to test the algorithm on.
`lrtree.logreg`	implements segment-specific, possibly single-class, logistic regression
`lrtree.predict`	Predict, predict_proba and precision methods for the Lrtree class