MissImp is a function that could impute the mixed-type missing values with a chosen imputation method (single imputation or multiple imputation). It can be used to impute continuous, integer and/or categorical data. Bootstrap or Jackknife resampling method could also be applied to estimate the variance of prediction. This function returns not only the final imputation result df_result, but also the estimated variance for each imputed value df_result_var. For the categorical variables, the factor formed result is returned in df_result, while the probability vector (onehot) formed result is returned in df_result_disj. If the original complete dataset is also given, a MSE calculated on numerical columns is returned, as well as a F1-score calculated on the categorical columns.

  imp_method = "missRanger",
  resample_method = "bootstrap",
  n_resample = 2 * round(log(nrow(df))),
  col_cat = c(),
  col_dis = c(),
  maxiter_tree = 10,
  maxiter_pca = 100,
  maxiter_mice = 10,
  ncp_pca = ncol(df)/2,
  learn_ncp = FALSE,
  cat_combine_by = "factor",
  var_cat = "wilcox_va",
  df_complete = NULL,
  num_mi = NULL



A data.frame with missing values to impute. For categorical variables, most of the preprocessing is included in the package, but no special character ('+', '$', '-' for example) is allowed in the levels of each categorical variable.


Imputation method to be chosen.

  • missRanger: Random-forest based imputation method. This is a fast version of 'missForest'

  • missForest: Random-forest based imputation method.

  • kNN: k-Nearest Neighbors based imputation method.

  • PCA: Principle Component Analysis based imputation method. Factor Analysis of Mixed Data (FAMD) is used for mixed-type data. With large proportion of missing data, this method may generate some convergence error.

  • EM: Expectation-Maximization imputation method. This method can not deal with large number of categorical variables.

  • MI_EM: Multiple imputation with EM method.

  • MI_PCA: Multiple imputation with PCA-based method.

  • MICE: Multiple Imputation by Chained Equations (MICE). This is a bayesian based method.

  • MI_Ranger: Multiple imputation with Random Forest based method.


Resampling method to be chosen.

  • bootstrap: Generate n_resample incomplete datasets by draws with replacement. Each incomplete dataset has the same number of rows as the original one.

  • jackknife: Generate n_resample incomplete datasets by removing nrows/n_resample rows of the original dataset.

  • none: No resampling.


Number of datasets created by resampling.


Index of categorical columns.


Index of discrete columns with integer value.


Max number of iterations with tree-based methods ("missRanger", "missForest", "MI_Ranger", "MI_Ranger_bis").


Max number of iterations with PCA-based methods ("PCA", "MI_PCA").


Max number of iterations with MICE imputation method.


Number of component for PCA-based methods ("PCA", "MI_PCA").


If ncp_pca is learned. This could lead to long execution time.


Combine method for the categorical part of imputed resampled datasets.

  • factor The mode value of several predictions is taken as the final result.

  • onehot The average of the probability vectors is taken before choosing the category that maximise the probability as the final result.


Method used to caculate the 'variance' on a prediction of categorical variable. 'wilcox_va' means the Wilcox's VarNC index and 'unalike' means unalikeability.


Complete dataset without missing values (if known). With this dataset, the performance of each imputation method could be estimated.


Number of multiple imputation.


df_result The final imputation result. df_result_disj The imputation result with probability vector for the categorical columns. df_result_var The variance for each imputed value. These variances are calculated based on imputation results of the incomplete datasets after resampling. df_result_var_disj The variance for each imputed value and for the probability vector of the categorical columns. MSE MSE calculated with the numerical imputations. F1 F1-score calculated with the categorical imputations.