This function discretizes a training set using the chiMerge method and the user-provided parameters and chooses the best discretization scheme among them based on a user-provided criterion and eventually a test set.

chiM_iter(
predictors,
labels,
test = FALSE,
validation = FALSE,
proportions = c(0.3, 0.3),
criterion = "gini",
param = list(alpha = 0.05)
)

## Arguments

predictors The matrix array containing the numeric attributes to discretize. The actual labels of the provided predictors (0/1). Boolean : True if the algorithm should use predictors to construct a test set on which to search for the best discretization scheme using the provided criterion (default: TRUE). Boolean : True if the algorithm should use predictors to construct a validation set on which to calculate the provided criterion using the best discretization scheme (chosen thanks to the provided criterion on either the test set (if true) or the training set (otherwise)) (default: TRUE). The list of the (2) proportions wanted for test and validation set. Only the first is used when there is only one of either test or validation that is set to TRUE. Produces an error when the sum is greater to one. Useless if both test and validation are set to FALSE. Default: list(0.2,0.2). The criterion ('gini','aic','bic') to use to choose the best discretization scheme among the generated ones (default: 'gini'). Nota Bene: it is best to use 'gini' only when test is set to TRUE and 'aic' or 'bic' when it is not. When using 'aic' or 'bic' with a test set, the likelihood is returned as there is no need to penalize for generalization purposes. List providing the parameters to test (see ?discretization::chiM, default=list(alpha = 0.05)).

## Details

This function discretizes a dataset containing continuous features $$X$$ in a supervised way, i.e. knowing observations of a binomial random variable $$Y$$ which we would like to predict based on the discretization of $$X$$. To do so, the ChiMerge alorithm starts by putting each unique values of $$X$$ in a separate value of the ‘‘discretized'' categorical feature $$E$$. It then tests if two adjacent values of $$E$$ are significantly different using the $$\chi^2$$-test. In the context of Credit Scoring, a logistic regression is fitted between the ‘‘discretized'' features $$E$$ and the response feature $$Y$$. As a consequence, the output of this function is the discretized features $$E$$, the logistic regression model of $$E$$ on $$Y$$ and the parameters used to get this fit.

Enea, M. (2015), speedglm: Fitting Linear and Generalized Linear Models to Large Data Sets, https://CRAN.R-project.org/package=speedglm

HyunJi Kim (2012). discretization: Data preprocessing, discretization for classification. R package version 1.0-1. https://CRAN.R-project.org/package=discretization

Kerber, R. (1992). ChiMerge : Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.

glm, speedglm, discretization

## Examples

# Simulation of a discretized logit model
x <- matrix(runif(300), nrow = 100, ncol = 3)
cuts <- seq(0, 1, length.out = 4)
xd <- apply(x, 2, function(col) as.numeric(cut(col, cuts)))
theta <- t(matrix(c(0, 0, 0, 2, 2, 2, -2, -2, -2), ncol = 3, nrow = 3))
log_odd <- rowSums(t(sapply(seq_along(xd[, 1]), function(row_id) {
sapply(
seq_along(xd[row_id, ]),
function(element) theta[xd[row_id, element], element]
)
})))
y <- stats::rbinom(100, 1, 1 / (1 + exp(-log_odd)))

chiM_iter(x, y)
#> New names:
#> *  -> ...1
#> *  -> ...2
#> *  -> ...3#> New names:
#> *  -> ...1
#> *  -> ...2
#> *  -> ...3#> New names:
#> *  -> ...1
#> *  -> ...2
#> *  -> ...3#> Generalized Linear Model of class 'speedglm':
#>
#> Call:  speedglm::speedglm(formula = stats::formula("labels ~ ."), data = Filter(function(x) (length(unique(x)) >      1), cbind(data.frame(sapply(disc\$Disc.data, as.factor), stringsAsFactors = TRUE),      data_train[, sapply(data_train, is.factor), drop = FALSE])),      family = stats::binomial(link = "logit"), weights = NULL,      fitted = TRUE)
#>
#> Coefficients:
#> (Intercept)         X110          X12          X13          X14          X15
#>       20.94       -42.68        65.52       -21.72        18.20       -22.58
#>         X16          X17          X18          X19         X210         X211
#>       42.72       -21.93       -20.52       -63.76       -63.66      -108.11
#>        X212         X213         X214         X215          X22          X23
#>      -66.57       -43.91       -39.08      -133.31       -42.70        45.24
#>         X24          X25          X26          X27          X28          X29
#>      -24.59       -40.81       -24.08       -41.00       -45.52         3.23
#>        X310         X311         X312         X313         X314          X32
#>      -83.34        -1.25       -38.10       -87.48       -40.62        27.68
#>         X33          X34          X35          X36          X37          X38
#>       46.46        43.96        -2.80        24.53       -23.13        21.03
#>         X39
#>       -0.51
#>