Impute the missing values of a dataset with Multiple Factor Analysis (MFA). The variables are structured a priori into groups of variables. The variables can be continuous or categorical but within a group the nature of the variables is the same. Can be used as a preliminary step before performing MFA on an incomplete dataset.

This function is nearly identical with imputeMFA function in 'missMDA' package. The only difference is that in this function impute_mod is used. In impute_mod, some changes have been made to avoid the convergence error.

imputeMFA_mod(
  X,
  group,
  ncp = 2,
  type = rep("s", length(group)),
  method = c("Regularized", "EM"),
  row.w = NULL,
  coeff.ridge = 1,
  threshold = 1e-06,
  ind.sup = NULL,
  num.group.sup = NULL,
  seed = NULL,
  maxiter = 1000,
  ...
)

Arguments

X

a data.frame with continuous and categorical variables containing missing values.

group

a vector indicating the number of variables in each group.

ncp

integer corresponding to the number of components used to predict the missing entries.

type

the type of variables in each group; three possibilities: "c" or "s" for continuous variables (for "c" the variables are centered and for "s" variables are scaled to unit variance), "n" for categorical variables

method

"Regularized" by default or "EM".

row.w

row weights (by default, uniform row weights).

coeff.ridge

1 by default to perform the regularized imputeFAMD algorithm; useful only if method="Regularized". Other regularization terms can be implemented by setting the value to less than 1 in order to regularized less (to get closer to the results of the EM method) or more than 1 to regularized more.

threshold

the threshold for assessing convergence

ind.sup

a vector indicating the indexes of the supplementary individuals.

num.group.sup

a vector indicating the group of variables that are supplementary.

seed

integer, by default seed = NULL implies that missing values are initially imputed by the mean of each variable for the continuous variables and by the proportion of the category for the categorical variables coded with indicator matrices of dummy variables. Other values leads to a random initialization.

maxiter

max iteration number for imputeFAMD

...

further arguments passed to or from other methods.

Value

tab.disj the imputed matrix; the observed values are kept for the non-missing entries and the missing values are replaced by the predicted ones. The categorical variables are coded with the indicator matrix of dummy variables. In this indicator matrix, the imputed values are real numbers but they met the constraint that the sum of the entries corresponding to one individual and one variable is equal to one. Consequently they can be seen as degree of membership to the corresponding category. completeObs the mixed imputed dataset; the observed values are kept for the non-missing entries and the missing values are replaced by the predicted ones. For the continuous variables, the values are the same as in the tab.disj output; for the categorical variables missing values are imputed with the most plausible categories according to the values in the tab.disj output. call the matched call.

References

F. Husson, J. Josse (2013) Handling missing values in multiple factor analysis. Food Quality and Preferences, 30 (2), 77-85.

Josse, J. and Husson, F. missMDA (2016). A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70 (1), pp 1-31 <doi:10.18637/jss.v070.i01>