A_Generate_Missingness_and_Tests_MCAR.Rmd
library(MissImp)
In the first section, some data are generated. first, a compete mixed-type dataset is drawn with 5 continuous variables, 1 integer variable and 2 categorical variables. Then given the desired missing data proportion and the missing data mechanism, based on the complete dataset, an incomplete dataset with missing values is created.
In the second section of this document "Statistical tests for MCAR", three tests are performed to show if an incomplete dataset follows a Missing Completely At Random (MCAR) mechanism. These three tests are independently used on several generated incomplete datasets with different mechanisms, in order to show their capacity.
We generate a complete data set \((Y_j)_{1 \leq j \leq 8}\) with \((Y_1, Y_2, Y_3) \sim \mathcal{N}(\mu_1, \Sigma_1)\), \((Y_4, Y_5) \sim \mathcal{N}(\mu_2, \Sigma_2)\), \(Y_6 \sim P(\lambda)\) and \(Y_7, Y_8 \sim \text{Bin}(\gamma)\).
n <- 10000
complete_df_generator <- function(n) {
mu.X <- c(1, 2, 3)
Sigma.X <- matrix(c(
9, 3, 2,
3, 4, 0,
2, 0, 1
), nrow = 3)
# multivariate normal distribution
X.complete.cont <- MASS::mvrnorm(n, mu.X, Sigma.X)
mu1.X <- c(9, 8)
Sigma1.X <- matrix(c(
16, 14,
14, 25
), nrow = 2)
# multivariate normal distribution
X.complete.cont1 <- MASS::mvrnorm(n, mu1.X, Sigma1.X)
lambda <- 4.3
# poisson distribution
X.complete.discr <- stats::rpois(n, lambda)
# binomial
X.complete.cat <- stats::rbinom(n, size = 5, prob = 0.4)
# binomial
X.complete.cat2 <- stats::rbinom(n, size = 7, prob = 0.6)
X.complete <- data.frame(cbind(X.complete.cont, X.complete.cont1,
X.complete.discr,
... = X.complete.cat, X.complete.cat2
))
X.complete[, 7] <- as.factor(X.complete[, 7])
levels(X.complete[, 7]) <- c("F", "E", "D", "C", "B", "A")
X.complete[, 8] <- as.factor(X.complete[, 8])
colnames(X.complete) <- c("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8")
return(X.complete)
}
X.complete <- complete_df_generator(n)
We could generate incomplete dataframe from the complete dataframe, with a chosen percentage of missing data and a chosen missing mechanism. The difference between mechanisms is explained in the documentation of generate_miss
.
miss_perc <- 0.4
rs <- generate_miss(X.complete, miss_perc, mechanism = "MNAR1")
head(rs$X.incomp)
## Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
## 1 NA 2.174226 1.766466 15.7526786 NA 8 <NA> <NA>
## 2 4.298325 NA NA 0.8051812 2.506792 NA F <NA>
## 3 NA 2.057108 NA NA NA NA <NA> <NA>
## 4 2.376023 3.586055 3.323690 NA 9.899501 3 E 4
## 5 -1.510226 -1.180591 2.927971 9.4324005 2.309095 1 D 3
## 6 3.320883 1.208749 4.283920 NA 12.171468 2 F 3
head(rs$R.mask)
## Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
## 1 TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## 2 FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
## 3 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## 4 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
print(rs$real_miss_perc)
## [1] 0.397875
Next, a list of incomplete dataframes is generated with every mechanism that is implemented in generate_miss
. First a list of incomplete dataframes is generated on the continuous variables \((Y_1, \dots, Y_5)\), with the result in rs.con
; then on the continous + discrete variables \((Y_1, \dots, Y_6)\), with the result in rs.con_dis
; and at last on continuous + discrete + categorical variables, with the result in rs.mix
.
miss_perc <- 0.3
rs.con <- generate_miss_ls(X.complete[, 1:5], miss_perc)
rs.con_dis <- generate_miss_ls(X.complete[, 1:6], miss_perc)
rs.mix <- generate_miss_ls(X.complete, miss_perc)
list_df.con <- list(rs.con$mcar$X.incomp, rs.con$mar1$X.incomp, rs.con$mar2$X.incomp, rs.con$mar3$X.incomp, rs.con$mnar1$X.incomp, rs.con$mnar2$X.incomp)
list_df.con_dis <- list(rs.con_dis$mcar$X.incomp, rs.con_dis$mar1$X.incomp, rs.con_dis$mar2$X.incomp, rs.con_dis$mar3$X.incomp, rs.con_dis$mnar1$X.incomp, rs.con_dis$mnar2$X.incomp)
list_df.mix <- list(rs.mix$mcar$X.incomp, rs.mix$mar1$X.incomp, rs.mix$mar2$X.incomp, rs.mix$mar3$X.incomp, rs.mix$mnar1$X.incomp, rs.mix$mnar2$X.incomp)
p_val <- 0.05
After the generation of incomplete datasets, we test if the missing data mechanism is MCAR by performing three classical statistic tests: the dummy variable test , Little's MCAR test from package and the normality tests (Hawkin's test and non-parametric test included) from package . Our null hypothesis H0 is that the missing data mechanism is MCAR. As the normality test is only written for numerical data, here the test is performed on only the numerical part of the input dataframe.
t <- mcar_test_combined(list_df.mix[[5]], col_cat = c(7, 8), p_val = p_val)
## [1] "The normality test is written for numerical dataframes.\n Here the test is performed on only the numerical part of the input\n dataframe.\n"
## Warning: 97 Cases with all variables missing
## have been removed from the data.
print(t$test_results)
## Dummy.variable.test Little.s.MCAR.test MissMech.test
## 1 FALSE FALSE TRUE
# A dataframe that records the p-values for each test
test_p_vals <- data.frame(matrix(rep(0, 4), nrow = 1))
colnames(test_p_vals) <- c("Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test")
ls_test <- list("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2")
i <- 1
t <- 1
col_cat <- c(7:8)
for (ls in list(list_df.con, list_df.con_dis, list_df.mix)) {
for (df in ls) {
if (i == 3) {
out <- suppressWarnings(mcar_test_combined(df, col_cat, p_val))
}
else {
out <- suppressWarnings(mcar_test_combined(df, c(), p_val))
}
test_p_vals[t, ] <- out$p_values[1, ]
t <- t + 1
}
i <- i + 1
}
## Warning: 157 Cases with all variables missing
## have been removed from the data.
## Warning: 39 Cases with all variables missing
## have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n Here the test is performed on only the numerical part of the input\n dataframe.\n"
## Warning: 8 Cases with all variables missing
## have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n Here the test is performed on only the numerical part of the input\n dataframe.\n"
## Warning: 75 Cases with all variables missing
## have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n Here the test is performed on only the numerical part of the input\n dataframe.\n"
## [1] "The normality test is written for numerical dataframes.\n Here the test is performed on only the numerical part of the input\n dataframe.\n"
## [1] "The normality test is written for numerical dataframes.\n Here the test is performed on only the numerical part of the input\n dataframe.\n"
## Warning: 97 Cases with all variables missing
## have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n Here the test is performed on only the numerical part of the input\n dataframe.\n"
## Warning: 39 Cases with all variables missing
## have been removed from the data.
test_p_vals[["type"]] <- c(rep("continuous", 6), rep("continuous+discrete", 6), rep("mix", 6))
test_p_vals[["mechanism"]] <- rep(c("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2"), 3)
test_results <- data.frame(test_p_vals[, c(1, 2)] > p_val)
test_results[["Miss_Mech_test"]] <- c((test_p_vals[["Hawkin.s.test"]] < p_val) * (test_p_vals[["Non.parametric.test"]] < p_val) == 0)
test_p_vals <- test_p_vals[c("type", "mechanism", "Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test")]
test_p_vals
## type mechanism Dummy.variable.test Little.s.MCAR.test
## 1 continuous MCAR 0.92128242 0.8632321
## 2 continuous MAR1 0.00000000 0.0000000
## 3 continuous MAR2 0.00000000 0.0000000
## 4 continuous MAR3 0.00000000 0.0000000
## 5 continuous MNAR1 0.00000000 0.0000000
## 6 continuous MNAR2 0.00000000 0.0000000
## 7 continuous+discrete MCAR 0.51841496 0.7363194
## 8 continuous+discrete MAR1 0.00000000 0.0000000
## 9 continuous+discrete MAR2 0.00000000 0.0000000
## 10 continuous+discrete MAR3 0.00000000 0.0000000
## 11 continuous+discrete MNAR1 0.00000000 0.0000000
## 12 continuous+discrete MNAR2 0.00000000 0.0000000
## 13 mix MCAR 0.04261469 0.6520939
## 14 mix MAR1 0.00000000 0.0000000
## 15 mix MAR2 0.00000000 0.0000000
## 16 mix MAR3 0.00000000 0.0000000
## 17 mix MNAR1 0.00000000 0.0000000
## 18 mix MNAR2 0.00000000 0.0000000
## Hawkin.s.test Non.parametric.test
## 1 4.939305e-01 7.300509e-01
## 2 3.180676e-01 9.472987e-02
## 3 6.875941e-10 1.369197e-221
## 4 1.201183e-64 0.000000e+00
## 5 7.947096e-01 5.922212e-01
## 6 6.436254e-48 8.224947e-224
## 7 1.030676e-01 2.987409e-01
## 8 4.515278e-01 5.065550e-01
## 9 1.360136e-09 1.278570e-123
## 10 4.355048e-56 0.000000e+00
## 11 5.731775e-02 1.685605e-02
## 12 1.286846e-30 1.277741e-61
## 13 1.288393e-02 2.407563e-02
## 14 5.815969e-01 3.483533e-01
## 15 1.360136e-09 1.278570e-123
## 16 3.543378e-63 0.000000e+00
## 17 9.929395e-01 8.705730e-01
## 18 1.286846e-30 1.277741e-61
test_results
## Dummy.variable.test Little.s.MCAR.test Miss_Mech_test
## 1 TRUE TRUE TRUE
## 2 FALSE FALSE TRUE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE TRUE
## 6 FALSE FALSE FALSE
## 7 TRUE TRUE TRUE
## 8 FALSE FALSE TRUE
## 9 FALSE FALSE FALSE
## 10 FALSE FALSE FALSE
## 11 FALSE FALSE TRUE
## 12 FALSE FALSE FALSE
## 13 FALSE TRUE FALSE
## 14 FALSE FALSE TRUE
## 15 FALSE FALSE FALSE
## 16 FALSE FALSE FALSE
## 17 FALSE FALSE TRUE
## 18 FALSE FALSE FALSE
The result of MissMech test is not very accurate. This may due to the problem of implementation in MissMech package, which is not available on CRAN anymore.