MissImp: A package for imputing missing values This package provides missing data generation method with a given mechanism and a given proportion, various single and multiple imputation methods combined with bootstrap or jackknife resampling method, as well as evaluation matrix of the imputation result.

MissImp is a function that could impute the mixed-type missing values with a chosen imputation method (single imputation or multiple imputation). It can be used to impute continuous, integer and/or categorical data. Bootstrap or Jackknife resampling method could also be applied to estimate the variance of prediction. This function returns not only the final imputation result df_result, but also the estimated variance for each imputed value df_result_var. For the categorical variables, the factor formed result is returned in df_result, while the probability vector (onehot) formed result is returned in df_result_disj. If the original complete dataset is also given, a MSE calculated on numerical columns is returned, as well as a F1-score calculated on the categorical columns.

MissImp(
  df,
  imp_method = "missRanger",
  resample_method = "bootstrap",
  n_resample = 2 * round(log(nrow(df))),
  col_cat = c(),
  col_dis = c(),
  maxiter_tree = 10,
  maxiter_pca = 100,
  maxiter_mice = 10,
  ncp_pca = ncol(df)/2,
  learn_ncp = FALSE,
  cat_combine_by = "factor",
  var_cat = "wilcox_va",
  df_complete = NULL,
  num_mi = NULL
)

Arguments

df

A data.frame with missing values to impute. For categorical variables, most of the preprocessing is included in the package, but no special character ('+', '$', '-' for example) is allowed in the levels of each categorical variable.

imp_method

Imputation method to be chosen.

missRanger: Random-forest based imputation method. This is a fast version of 'missForest'
missForest: Random-forest based imputation method.
kNN: k-Nearest Neighbors based imputation method.
PCA: Principle Component Analysis based imputation method. Factor Analysis of Mixed Data (FAMD) is used for mixed-type data. With large proportion of missing data, this method may generate some convergence error.
EM: Expectation-Maximization imputation method. This method can not deal with large number of categorical variables.
MI_EM: Multiple imputation with EM method.
MI_PCA: Multiple imputation with PCA-based method.
MICE: Multiple Imputation by Chained Equations (MICE). This is a bayesian based method.
MI_Ranger: Multiple imputation with Random Forest based method.

resample_method

Resampling method to be chosen.

bootstrap: Generate n_resample incomplete datasets by draws with replacement. Each incomplete dataset has the same number of rows as the original one.
jackknife: Generate n_resample incomplete datasets by removing nrows/n_resample rows of the original dataset.
none: No resampling.

n_resample

Number of datasets created by resampling.

col_cat

Index of categorical columns.

col_dis

Index of discrete columns with integer value.

maxiter_tree

Max number of iterations with tree-based methods ("missRanger", "missForest", "MI_Ranger", "MI_Ranger_bis").

maxiter_pca

Max number of iterations with PCA-based methods ("PCA", "MI_PCA").

maxiter_mice

Max number of iterations with MICE imputation method.

ncp_pca

Number of component for PCA-based methods ("PCA", "MI_PCA").

learn_ncp

If ncp_pca is learned. This could lead to long execution time.

cat_combine_by

Combine method for the categorical part of imputed resampled datasets.

factor The mode value of several predictions is taken as the final result.
onehot The average of the probability vectors is taken before choosing the category that maximise the probability as the final result.

var_cat

Method used to caculate the 'variance' on a prediction of categorical variable. 'wilcox_va' means the Wilcox's VarNC index and 'unalike' means unalikeability.

df_complete

Complete dataset without missing values (if known). With this dataset, the performance of each imputation method could be estimated.

num_mi

Number of multiple imputation.

Value

df_result The final imputation result. df_result_disj The imputation result with probability vector for the categorical columns. df_result_var The variance for each imputed value. These variances are calculated based on imputation results of the incomplete datasets after resampling. df_result_var_disj The variance for each imputed value and for the probability vector of the categorical columns. MSE MSE calculated with the numerical imputations. F1 F1-score calculated with the categorical imputations.