MissImp is a function that could impute the mixed-type missing values with a chosen imputation method (single imputation or multiple imputation). It can be used to impute continuous, integer and/or categorical data. Bootstrap or Jackknife resampling method could also be applied to estimate the variance of prediction. This function returns not only the final imputation result df_result, but also the estimated variance for each imputed value df_result_var. For the categorical variables, the factor formed result is returned in df_result, while the probability vector (onehot) formed result is returned in df_result_disj. If the original complete dataset is also given, a MSE calculated on numerical columns is returned, as well as a F1-score calculated on the categorical columns.

MissImp(
  df,
  imp_method = "missRanger",
  resample_method = "bootstrap",
  n_resample = 2 * round(log(nrow(df))),
  col_cat = c(),
  col_dis = c(),
  maxiter_tree = 10,
  maxiter_pca = 100,
  maxiter_mice = 10,
  ncp_pca = ncol(df)/2,
  learn_ncp = FALSE,
  cat_combine_by = "factor",
  var_cat = "wilcox_va",
  df_complete = NULL,
  num_mi = NULL
)

Arguments

df

A data.frame with missing values to impute. For categorical variables, most of the preprocessing is included in the package, but no special character ('+', '$', '-' for example) is allowed in the levels of each categorical variable.

imp_method

Imputation method to be chosen.

  • missRanger: Random-forest based imputation method. This is a fast version of 'missForest'

  • missForest: Random-forest based imputation method.

  • kNN: k-Nearest Neighbors based imputation method.

  • PCA: Principle Component Analysis based imputation method. Factor Analysis of Mixed Data (FAMD) is used for mixed-type data. With large proportion of missing data, this method may generate some convergence error.

  • EM: Expectation-Maximization imputation method. This method can not deal with large number of categorical variables.

  • MI_EM: Multiple imputation with EM method.

  • MI_PCA: Multiple imputation with PCA-based method.

  • MICE: Multiple Imputation by Chained Equations (MICE). This is a bayesian based method.

  • MI_Ranger: Multiple imputation with Random Forest based method.

resample_method

Resampling method to be chosen.

  • bootstrap: Generate n_resample incomplete datasets by draws with replacement. Each incomplete dataset has the same number of rows as the original one.

  • jackknife: Generate n_resample incomplete datasets by removing nrows/n_resample rows of the original dataset.

  • none: No resampling.

n_resample

Number of datasets created by resampling.

col_cat

Index of categorical columns.

col_dis

Index of discrete columns with integer value.

maxiter_tree

Max number of iterations with tree-based methods ("missRanger", "missForest", "MI_Ranger", "MI_Ranger_bis").

maxiter_pca

Max number of iterations with PCA-based methods ("PCA", "MI_PCA").

maxiter_mice

Max number of iterations with MICE imputation method.

ncp_pca

Number of component for PCA-based methods ("PCA", "MI_PCA").

learn_ncp

If ncp_pca is learned. This could lead to long execution time.

cat_combine_by

Combine method for the categorical part of imputed resampled datasets.

  • factor The mode value of several predictions is taken as the final result.

  • onehot The average of the probability vectors is taken before choosing the category that maximise the probability as the final result.

var_cat

Method used to caculate the 'variance' on a prediction of categorical variable. 'wilcox_va' means the Wilcox's VarNC index and 'unalike' means unalikeability.

df_complete

Complete dataset without missing values (if known). With this dataset, the performance of each imputation method could be estimated.

num_mi

Number of multiple imputation.

Value

df_result The final imputation result. df_result_disj The imputation result with probability vector for the categorical columns. df_result_var The variance for each imputed value. These variances are calculated based on imputation results of the incomplete datasets after resampling. df_result_var_disj The variance for each imputed value and for the probability vector of the categorical columns. MSE MSE calculated with the numerical imputations. F1 F1-score calculated with the categorical imputations.