This function discretizes a training set using an SEM-Gibbs based method (see References section). It detects numerical features of the dataset and discretizes them ; values of categorical features (of type factor) are regrouped. This is done in a multivariate supervised way. Assessment of the correct model is done via AIC, BIC or test set error (see parameter criterion). Second-order interactions can be searched through the optional interaction parameter using a Metropolis-Hastings algorithm (see References section).

glmdisc(
  predictors,
  labels,
  interact = TRUE,
  validation = TRUE,
  test = TRUE,
  criterion = "gini",
  iter = 1000,
  m_start = 20,
  reg_type = "poly",
  proportions = c(0.2, 0.2),
  verbose = FALSE,
  fast = FALSE
)

Arguments

predictors

The matrix array containing the numerical or factor attributes to discretize.

labels

The actual labels of the provided predictors (0/1).

interact

Boolean : True (default) if interaction detection is wanted (Warning: may be very memory/time-consuming).

validation

Boolean : True if the algorithm should use predictors to construct a validation set on which to search for the best discretization scheme using the provided criterion (default: TRUE).

test

Boolean : True if the algorithm should use predictors to construct a test set on which to calculate the provided criterion using the best discretization scheme (chosen thanks to the provided criterion on either the test set (if true) or the training set (otherwise)) (default: TRUE).

criterion

The criterion ('gini','aic','bic') to use to choose the best discretization scheme among the generated ones (default: 'gini'). Nota Bene: it is best to use 'gini' only when test is set to TRUE and 'aic' or 'bic' when it is not. When using 'aic' or 'bic' with a test set, the likelihood is returned as there is no need to penalize for generalization purposes.

iter

The number of iterations to do in the SEM protocole (default: 1000).

m_start

The maximum number of resulting categories for each variable wanted (default: 20).

reg_type

The model to use between discretized and continuous features (currently, only multinomial logistic regression ('poly') and ordered logistic regression ('polr') are supported ; default: 'poly'). WARNING: 'poly' requires the mnlogit package, 'polr' requires the MASS package.

proportions

The list of the proportions wanted for test and validation set. Not used when both test and validation are false. Only the first is used when there is only one of either test or validation that is set to TRUE. Produces an error when the sum is greater to one. Default: list(0.2,0.2) so that the training set has 0.6 of the input observations.

verbose

Print information while fitting (default: FALSE).

fast

Use packages Rfast and NPflow (C++ implementations of sampling functions - somewhat unstable but speeds up by a factor 2 to 3 - default: FALSE).

Details

This function finds the most appropriate discretization scheme for logistic regression. When provided with a continuous variable \(X\), it tries to convert it to a categorical variable \(Q\) which values uniquely correspond to intervals of the continuous variable \(X\). When provided with a categorical variable \(X\), it tries to find the best regroupement of its values and subsequently creates categorical variable \(Q\). The goal is to perform supervised learning with logistic regression so that you have to specify a target variable \(Y\) denoted by labels. The ‘‘discretization'' process, i.e. the transformation of \(X\) to \(Q\) is done so as to achieve the best logistic regression model \(p(y|e;\theta)\). It can be interpreted as a special case feature engineering algorithm. Subsequently, its outputs are: the optimal discretization scheme and the logistic regression model associated with it. We also provide the parameters that were provided to the function and the evolution of the criterion with respect to the algorithm's iterations.

References

Celeux, G., Chauveau, D., Diebolt, J. (1995), On Stochastic Versions of the EM Algorithm. [Research Report] RR-2514, INRIA. 1995. <inria-00074164>

Agresti, A. (2002) Categorical Data. Second edition. Wiley.

See also

glm, multinom, polr

Author

Adrien Ehrhardt.

Examples

# Simulation of a discretized logit model x <- matrix(runif(300), nrow = 100, ncol = 3) cuts <- seq(0, 1, length.out = 4) xd <- apply(x, 2, function(col) as.numeric(cut(col, cuts))) theta <- t(matrix(c(0, 0, 0, 2, 2, 2, -2, -2, -2), ncol = 3, nrow = 3)) log_odd <- rowSums(t(sapply(seq_along(xd[, 1]), function(row_id) { sapply( seq_along(xd[row_id, ]), function(element) theta[xd[row_id, element], element] ) }))) y <- rbinom(100, 1, 1 / (1 + exp(-log_odd))) sem_disc <- glmdisc(x, y, iter = 50, m_start = 4, test = FALSE, validation = FALSE, criterion = "aic" ) print(sem_disc)
#> $coefficients #> [1] 0.12112317 -0.80141580 0.02678556 -1.45494641 0.74786822 #> #> $fitted.values #> [1] 0.20852766 0.20852766 0.53690992 0.53690992 0.70453578 0.09480209 #> [7] 0.53690992 0.20852766 0.70453578 0.70453578 0.70453578 0.20852766 #> [13] 0.09480209 0.70453578 0.70453578 0.53690992 0.70453578 0.70453578 #> [19] 0.48661630 0.09480209 0.20852766 0.70453578 0.70453578 0.48661630 #> [25] 0.48661630 0.48661630 0.70453578 0.70453578 0.70453578 0.48661630 #> [31] 0.48661630 0.20852766 0.09480209 0.09480209 0.48661630 0.48661630 #> [37] 0.53690992 0.70453578 0.70453578 0.48661630 0.70453578 0.53690992 #> [43] 0.53690992 0.09480209 0.48661630 0.70453578 0.20852766 0.70453578 #> [49] 0.53690992 0.70453578 0.20852766 0.48661630 0.70453578 0.53690992 #> [55] 0.70453578 0.70453578 0.48661630 0.20852766 0.70453578 0.20852766 #> [61] 0.70453578 0.70453578 0.20852766 0.31547835 0.70453578 0.20852766 #> [67] 0.20852766 0.48661630 0.70453578 0.70453578 0.20852766 0.70453578 #> [73] 0.48661630 0.48661630 0.48661630 0.48661630 0.31547835 0.53690992 #> [79] 0.70453578 0.70453578 0.48661630 0.48661630 0.70453578 0.70453578 #> [85] 0.70453578 0.70453578 0.48661630 0.20852766 0.20852766 0.09480209 #> [91] 0.70453578 0.48661630 0.70453578 0.70453578 0.70453578 0.48661630 #> [97] 0.70453578 0.70453578 0.70453578 0.70453578 #> #> $linear.predictors #> [1] -1.33382323 -1.33382323 0.14790874 0.14790874 0.86899139 -2.25636220 #> [7] 0.14790874 -1.33382323 0.86899139 0.86899139 0.86899139 -1.33382323 #> [13] -2.25636220 0.86899139 0.86899139 0.14790874 0.86899139 0.86899139 #> [19] -0.05354758 -2.25636220 -1.33382323 0.86899139 0.86899139 -0.05354758 #> [25] -0.05354758 -0.05354758 0.86899139 0.86899139 0.86899139 -0.05354758 #> [31] -0.05354758 -1.33382323 -2.25636220 -2.25636220 -0.05354758 -0.05354758 #> [37] 0.14790874 0.86899139 0.86899139 -0.05354758 0.86899139 0.14790874 #> [43] 0.14790874 -2.25636220 -0.05354758 0.86899139 -1.33382323 0.86899139 #> [49] 0.14790874 0.86899139 -1.33382323 -0.05354758 0.86899139 0.14790874 #> [55] 0.86899139 0.86899139 -0.05354758 -1.33382323 0.86899139 -1.33382323 #> [61] 0.86899139 0.86899139 -1.33382323 -0.77463024 0.86899139 -1.33382323 #> [67] -1.33382323 -0.05354758 0.86899139 0.86899139 -1.33382323 0.86899139 #> [73] -0.05354758 -0.05354758 -0.05354758 -0.05354758 -0.77463024 0.14790874 #> [79] 0.86899139 0.86899139 -0.05354758 -0.05354758 0.86899139 0.86899139 #> [85] 0.86899139 0.86899139 -0.05354758 -1.33382323 -1.33382323 -2.25636220 #> [91] 0.86899139 -0.05354758 0.86899139 0.86899139 0.86899139 -0.05354758 #> [97] 0.86899139 0.86899139 0.86899139 0.86899139 #> #> $loglikelihood #> [1] -59.87907 #> #> $converged #> [1] TRUE #>