MissImp.Rd
MissImp
is a function that could impute the mixed-type
missing values with a chosen imputation method
(single imputation or multiple imputation). It can be used to impute
continuous, integer and/or categorical data.
Bootstrap or Jackknife resampling method could also be applied to estimate
the variance of prediction.
This function returns not only the final imputation result df_result
,
but also the estimated variance for each imputed value df_result_var
.
For the categorical variables, the factor formed result is returned in
df_result
, while the probability vector (onehot) formed result
is returned in df_result_disj
. If the original complete dataset is
also given, a MSE calculated on numerical columns is returned,
as well as a F1-score calculated on the categorical columns.
MissImp(
df,
imp_method = "missRanger",
resample_method = "bootstrap",
n_resample = 2 * round(log(nrow(df))),
col_cat = c(),
col_dis = c(),
maxiter_tree = 10,
maxiter_pca = 100,
maxiter_mice = 10,
ncp_pca = ncol(df)/2,
learn_ncp = FALSE,
cat_combine_by = "factor",
var_cat = "wilcox_va",
df_complete = NULL,
num_mi = NULL
)
A data.frame
with missing values to impute. For categorical
variables, most of the preprocessing is included in the package,
but no special character ('+', '$', '-' for example) is allowed in the
levels of each categorical variable.
Imputation method to be chosen.
missRanger: Random-forest based imputation method. This is a fast version of 'missForest'
missForest: Random-forest based imputation method.
kNN: k-Nearest Neighbors based imputation method.
PCA: Principle Component Analysis based imputation method. Factor Analysis of Mixed Data (FAMD) is used for mixed-type data. With large proportion of missing data, this method may generate some convergence error.
EM: Expectation-Maximization imputation method. This method can not deal with large number of categorical variables.
MI_EM: Multiple imputation with EM method.
MI_PCA: Multiple imputation with PCA-based method.
MICE: Multiple Imputation by Chained Equations (MICE). This is a bayesian based method.
MI_Ranger: Multiple imputation with Random Forest based method.
Resampling method to be chosen.
bootstrap: Generate n_resample
incomplete datasets by
draws with replacement.
Each incomplete dataset has the same number of rows as the original one.
jackknife: Generate n_resample
incomplete datasets by
removing nrows/n_resample
rows of the original dataset.
none: No resampling.
Number of datasets created by resampling.
Index of categorical columns.
Index of discrete columns with integer value.
Max number of iterations with tree-based methods ("missRanger", "missForest", "MI_Ranger", "MI_Ranger_bis").
Max number of iterations with PCA-based methods ("PCA", "MI_PCA").
Max number of iterations with MICE imputation method.
Number of component for PCA-based methods ("PCA", "MI_PCA").
If ncp_pca
is learned. This could lead to long
execution time.
Combine method for the categorical part of imputed resampled datasets.
factor The mode value of several predictions is taken as the final result.
onehot The average of the probability vectors is taken before choosing the category that maximise the probability as the final result.
Method used to caculate the 'variance' on a prediction of categorical variable. 'wilcox_va' means the Wilcox's VarNC index and 'unalike' means unalikeability.
Complete dataset without missing values (if known). With this dataset, the performance of each imputation method could be estimated.
Number of multiple imputation.
df_result
The final imputation result.
df_result_disj
The imputation result with probability vector
for the categorical columns.
df_result_var
The variance for each imputed value. These
variances are calculated based on imputation results of the incomplete
datasets after resampling.
df_result_var_disj
The variance for each imputed value and
for the probability vector of the categorical columns.
MSE
MSE calculated with the numerical imputations.
F1
F1-score calculated with the categorical imputations.