Welcome to lrtree’s documentation!

lrtree

This module is dedicated to logistic regression trees

PyPI version PyPI pyversions PyPi Downloads Build Status Python package codecov

Logistic regression trees

Table of Contents

Motivation

The goal of lrtree is to build decision trees with logistic regressions at their leaves, so that the resulting model mixes non parametric VS parametric and stepwise VS linear approaches to have the best predictive results, yet maintaining interpretability.

This is the implementation of glmtree as described in Formalization and study of statistical problems in Credit Scoring, Ehrhardt A. (see manuscript or web article)

Getting started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

This code is supported on Python 3.8, 3.9, 3.10.

Installing the package

Installing the development version

If git is installed on your machine, you can use:

pipenv install git+https://github.com/adimajo/lrtree.git

If git is not installed, you can also use:

pipenv install --upgrade https://github.com/adimajo/lrtree/archive/master.tar.gz
Installing through the pip command

You can install a stable version from PyPi by using:

pip install lrtree

To run the provided scripts, lrtree-consistency and lrtree-realdata, you need a few additional dependencies:

pip install lrtree[scripts]
Installation guide for Anaconda

The installation with the pip or pipenv command should work. If not, please raise an issue.

For people behind proxy(ies)…

A lot of people, including myself, work behind a proxy at work…

A simple solution to get the package is to use the --proxy option of pip:

pip --proxy=http://username:password@server:port install lrtree

where username, password, server and port should be replaced by your own values.

If environment variables http_proxy and / or https_proxy and / or (unfortunately depending on applications…) HTTP_PROXY and HTTPS_PROXY are set, the proxy settings should be picked up by pip.

Over the years, I’ve found CNTLM to be a great tool in this regard.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This research has been financed by Crédit Agricole Consumer Finance through a CIFRE PhD.

This research was supported by Inria Lille - Nord-Europe and Lille University as part of a PhD.

References

Ehrhardt, A. (2019), Formalization and study of statistical problems in Credit Scoring: Reject inference, discretization and pairwise interactions, logistic regression trees (PhD thesis).

Contribute

You can clone this project using:

git clone https://github.com/adimajo/lrtree.git

You can install all dependencies, including development dependencies, using (note that this command requires pipenv which can be installed by typing pip install pipenv):

pipenv install -d

You can build the documentation by going into the docs directory and typing make html.

You can run the tests by typing coverage run -m pytest, which relies on packages coverage and pytest.

To run the tests in different environments (one for each version of Python), install pyenv (see the instructions here), install all versions you want to test (see tox.ini), e.g. with pyenv install 3.7.0 and run pipenv run pyenv local 3.7.0 [...] (and all other versions) followed by pipenv run tox.

Python Environment

The project uses pipenv. An interesting resource.

To download all the project dependencies in order to then port them to a machine that had limited access to the internet, you must use the command pipenv lock -r > requirements.txt which will transform the Pipfile into a requirements.txt.

Installation

To install a virtual environment as well as all the necessary dependencies, you must use the pipenv install command for production use or the command pipenv install -d for development use.

Tests

The tests are based on pytest and are stored in the tests folder. They can all be launched with the command pytest in at the root of the project. The test coverage can be calculated thanks to the coverage package, which is also responsible for launching the tests. The command to use is coverage run -m pytest. We can then obtain a graphic summary in the form of an HTML page using the coverage html command which creates or updates the htmlcov folder from which we can open the index.html file.

Utilization

The package provides sklearn-like interface.

Loading sample data for regression task:

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

The trained model consists of a fitted sklearn.tree.DecisionTreeClassifier class for segmentation of a data and sklearn.linear_model.LogisticRegression regressions for each node a of a tree in a form of python list.

The snippet to train the model and make a prediction:

from lrtree import Lrtree

model = Lrtree(criterion="bic", ratios=(0.7,), class_num=2, max_iter=100)

# Fitting the model
model.fit(X_train, y_train)

# Make a prediction on a fitted model
model.predict(X_test)

If you installed the additional dependencies for scripts, you can also run directly from the command line:

LOGURU_LEVEL="ERROR" DEBUG="True" lrtree-consistency

or

LOGURU_LEVEL="ERROR" TQDM_DISABLE="1" lrtree-realdata

Beware: if you don’t set LOGURU_LEVEL then it is implicitly set on DEBUG which will yield a lot of prints. Also, both scripts will take very long to complete as they test the consistency of the method for various hyperparameters and run cross-validation on 3 real datasets respectively.

Indices and tables