missForest: modified missForest with onehot probability

missForest is a modified version of the function missForest by Daniel Stekhoven. Please find the detailed documentation of missForest in the missForest package. Only the modifications are explained on this page. The original missForest function returns the final imputation result after convergence or maxiter iterations. The results of categorical columns are returned in form of vector. In missForest function, during the last iteration, not only the final result, but also the onehot probability for each category is returned.

missForest(
  xmis,
  maxiter = 10,
  ntree = 100,
  variablewise = FALSE,
  decreasing = FALSE,
  verbose = FALSE,
  mtry = floor(sqrt(ncol(xmis))),
  replace = TRUE,
  classwt = NULL,
  cutoff = NULL,
  strata = NULL,
  sampsize = NULL,
  nodesize = NULL,
  maxnodes = NULL,
  xtrue = NA,
  parallelize = c("no", "variables", "forests"),
  col_cat = c()
)

Arguments

xmis: data matrix with missing values.
maxiter: stop after how many iterations (default = 10).
ntree: how many trees are grown in the forest (default = 100).
variablewise: (boolean) return OOB errors for each variable separately.
decreasing: (boolean) if TRUE the columns are sorted with decreasing amount of missing values.
verbose: (boolean) if TRUE then missForest returns error estimates, runtime and if available true error during iterations.
mtry: how many variables should be tried randomly at each node.
replace: (boolean) if TRUE bootstrap sampling (with replacements) is performed, else subsampling (without replacements).
classwt: list of priors of the classes in the categorical variables.
cutoff: list of class cutoffs for each categorical variable.
strata: list of (factor) variables used for stratified sampling.
sampsize: list of size(s) of sample to draw
nodesize: minimum size of terminal nodes, vector of length 2, with number for continuous variables in the first entry and number for categorical variables in the second entry.
maxnodes: maximum number of terminal nodes for individual trees
xtrue: complete data matrix
parallelize: TODO
col_cat: index of categorical columns.

Value

ximp imputed data matrix of same type as 'xmis'. ximp.disj imputed data matrix of same type as 'xmis' for the numeric columns. For the categorical columns, the prediction of probability for each category is shown in form of onehot vector. OOBerror estimated OOB imputation error. For the set of continuous variables in 'xmis' the NRMSE and for the set of categorical variables the proportion of falsely classified entries is returned. See Details for the exact definition of these error measures. If 'variablewise' is set to 'TRUE' then this will be a vector of length 'p' where 'p' is the number of variables and the entries will be the OOB error for each variable separately. error true imputation error. This is only available if 'xtrue' was supplied. The error measures are the same as for 'OOBerror'.