Generation of Missingness and Tests of MCAR

library(MissImp)

In the first section, some data are generated. first, a compete mixed-type dataset is drawn with 5 continuous variables, 1 integer variable and 2 categorical variables. Then given the desired missing data proportion and the missing data mechanism, based on the complete dataset, an incomplete dataset with missing values is created.

In the second section of this document "Statistical tests for MCAR", three tests are performed to show if an incomplete dataset follows a Missing Completely At Random (MCAR) mechanism. These three tests are independently used on several generated incomplete datasets with different mechanisms, in order to show their capacity.

Generate data

Complete data

We generate a complete data set \((Y_j)_{1 \leq j \leq 8}\) with \((Y_1, Y_2, Y_3) \sim \mathcal{N}(\mu_1, \Sigma_1)\), \((Y_4, Y_5) \sim \mathcal{N}(\mu_2, \Sigma_2)\), \(Y_6 \sim P(\lambda)\) and \(Y_7, Y_8 \sim \text{Bin}(\gamma)\).

n <- 10000
complete_df_generator <- function(n) {
  mu.X <- c(1, 2, 3)
  Sigma.X <- matrix(c(
    9, 3, 2,
    3, 4, 0,
    2, 0, 1
  ), nrow = 3)
  # multivariate normal distribution
  X.complete.cont <- MASS::mvrnorm(n, mu.X, Sigma.X)

  mu1.X <- c(9, 8)
  Sigma1.X <- matrix(c(
    16, 14,
    14, 25
  ), nrow = 2)
  # multivariate normal distribution
  X.complete.cont1 <- MASS::mvrnorm(n, mu1.X, Sigma1.X)

  lambda <- 4.3
  # poisson distribution
  X.complete.discr <- stats::rpois(n, lambda)
  # binomial
  X.complete.cat <- stats::rbinom(n, size = 5, prob = 0.4)
  # binomial
  X.complete.cat2 <- stats::rbinom(n, size = 7, prob = 0.6)

  X.complete <- data.frame(cbind(X.complete.cont, X.complete.cont1,
    X.complete.discr,
    ... = X.complete.cat, X.complete.cat2
  ))
  X.complete[, 7] <- as.factor(X.complete[, 7])
  levels(X.complete[, 7]) <- c("F", "E", "D", "C", "B", "A")
  X.complete[, 8] <- as.factor(X.complete[, 8])
  colnames(X.complete) <- c("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8")
  return(X.complete)
}
X.complete <- complete_df_generator(n)

Add missingness

We could generate incomplete dataframe from the complete dataframe, with a chosen percentage of missing data and a chosen missing mechanism. The difference between mechanisms is explained in the documentation of generate_miss.

miss_perc <- 0.4
rs <- generate_miss(X.complete, miss_perc, mechanism = "MNAR1")
head(rs$X.incomp)

##          Y1        Y2       Y3         Y4        Y5 Y6   Y7   Y8
## 1        NA  2.174226 1.766466 15.7526786        NA  8 <NA> <NA>
## 2  4.298325        NA       NA  0.8051812  2.506792 NA    F <NA>
## 3        NA  2.057108       NA         NA        NA NA <NA> <NA>
## 4  2.376023  3.586055 3.323690         NA  9.899501  3    E    4
## 5 -1.510226 -1.180591 2.927971  9.4324005  2.309095  1    D    3
## 6  3.320883  1.208749 4.283920         NA 12.171468  2    F    3

head(rs$R.mask)

##      Y1    Y2    Y3    Y4    Y5    Y6    Y7    Y8
## 1  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## 2 FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
## 3  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 4 FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

print(rs$real_miss_perc)

## [1] 0.397875

Next, a list of incomplete dataframes is generated with every mechanism that is implemented in generate_miss. First a list of incomplete dataframes is generated on the continuous variables \((Y_1, \dots, Y_5)\), with the result in rs.con; then on the continous + discrete variables \((Y_1, \dots, Y_6)\), with the result in rs.con_dis; and at last on continuous + discrete + categorical variables, with the result in rs.mix.

miss_perc <- 0.3
rs.con <- generate_miss_ls(X.complete[, 1:5], miss_perc)
rs.con_dis <- generate_miss_ls(X.complete[, 1:6], miss_perc)
rs.mix <- generate_miss_ls(X.complete, miss_perc)

list_df.con <- list(rs.con$mcar$X.incomp, rs.con$mar1$X.incomp, rs.con$mar2$X.incomp, rs.con$mar3$X.incomp, rs.con$mnar1$X.incomp, rs.con$mnar2$X.incomp)

list_df.con_dis <- list(rs.con_dis$mcar$X.incomp, rs.con_dis$mar1$X.incomp, rs.con_dis$mar2$X.incomp, rs.con_dis$mar3$X.incomp, rs.con_dis$mnar1$X.incomp, rs.con_dis$mnar2$X.incomp)

list_df.mix <- list(rs.mix$mcar$X.incomp, rs.mix$mar1$X.incomp, rs.mix$mar2$X.incomp, rs.mix$mar3$X.incomp, rs.mix$mnar1$X.incomp, rs.mix$mnar2$X.incomp)

Statistical tests for MCAR

p_val <- 0.05

After the generation of incomplete datasets, we test if the missing data mechanism is MCAR by performing three classical statistic tests: the dummy variable test , Little's MCAR test from package and the normality tests (Hawkin's test and non-parametric test included) from package . Our null hypothesis H0 is that the missing data mechanism is MCAR. As the normality test is only written for numerical data, here the test is performed on only the numerical part of the input dataframe.

t <- mcar_test_combined(list_df.mix[[5]], col_cat = c(7, 8), p_val = p_val)

## [1] "The normality test is written for numerical dataframes.\n          Here the test is performed on only the numerical part of the input\n          dataframe.\n"
## Warning: 97 Cases with all variables missing
##     have been removed from the data.

print(t$test_results)

##   Dummy.variable.test Little.s.MCAR.test MissMech.test
## 1               FALSE              FALSE          TRUE

# A dataframe that records the p-values for each test
test_p_vals <- data.frame(matrix(rep(0, 4), nrow = 1))
colnames(test_p_vals) <- c("Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test")

ls_test <- list("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2")
i <- 1
t <- 1
col_cat <- c(7:8)
for (ls in list(list_df.con, list_df.con_dis, list_df.mix)) {
  for (df in ls) {
    if (i == 3) {
      out <- suppressWarnings(mcar_test_combined(df, col_cat, p_val))
    }
    else {
      out <- suppressWarnings(mcar_test_combined(df, c(), p_val))
    }
    test_p_vals[t, ] <- out$p_values[1, ]
    t <- t + 1
  }
  i <- i + 1
}

## Warning: 157 Cases with all variables missing
##     have been removed from the data.
## Warning: 39 Cases with all variables missing
##     have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n          Here the test is performed on only the numerical part of the input\n          dataframe.\n"
## Warning: 8 Cases with all variables missing
##     have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n          Here the test is performed on only the numerical part of the input\n          dataframe.\n"
## Warning: 75 Cases with all variables missing
##     have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n          Here the test is performed on only the numerical part of the input\n          dataframe.\n"
## [1] "The normality test is written for numerical dataframes.\n          Here the test is performed on only the numerical part of the input\n          dataframe.\n"
## [1] "The normality test is written for numerical dataframes.\n          Here the test is performed on only the numerical part of the input\n          dataframe.\n"
## Warning: 97 Cases with all variables missing
##     have been removed from the data.
## [1] "The normality test is written for numerical dataframes.\n          Here the test is performed on only the numerical part of the input\n          dataframe.\n"
## Warning: 39 Cases with all variables missing
##     have been removed from the data.

test_p_vals[["type"]] <- c(rep("continuous", 6), rep("continuous+discrete", 6), rep("mix", 6))
test_p_vals[["mechanism"]] <- rep(c("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2"), 3)
test_results <- data.frame(test_p_vals[, c(1, 2)] > p_val)
test_results[["Miss_Mech_test"]] <- c((test_p_vals[["Hawkin.s.test"]] < p_val) * (test_p_vals[["Non.parametric.test"]] < p_val) == 0)
test_p_vals <- test_p_vals[c("type", "mechanism", "Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test")]

test_p_vals

##                   type mechanism Dummy.variable.test Little.s.MCAR.test
## 1           continuous      MCAR          0.92128242          0.8632321
## 2           continuous      MAR1          0.00000000          0.0000000
## 3           continuous      MAR2          0.00000000          0.0000000
## 4           continuous      MAR3          0.00000000          0.0000000
## 5           continuous     MNAR1          0.00000000          0.0000000
## 6           continuous     MNAR2          0.00000000          0.0000000
## 7  continuous+discrete      MCAR          0.51841496          0.7363194
## 8  continuous+discrete      MAR1          0.00000000          0.0000000
## 9  continuous+discrete      MAR2          0.00000000          0.0000000
## 10 continuous+discrete      MAR3          0.00000000          0.0000000
## 11 continuous+discrete     MNAR1          0.00000000          0.0000000
## 12 continuous+discrete     MNAR2          0.00000000          0.0000000
## 13                 mix      MCAR          0.04261469          0.6520939
## 14                 mix      MAR1          0.00000000          0.0000000
## 15                 mix      MAR2          0.00000000          0.0000000
## 16                 mix      MAR3          0.00000000          0.0000000
## 17                 mix     MNAR1          0.00000000          0.0000000
## 18                 mix     MNAR2          0.00000000          0.0000000
##    Hawkin.s.test Non.parametric.test
## 1   4.939305e-01        7.300509e-01
## 2   3.180676e-01        9.472987e-02
## 3   6.875941e-10       1.369197e-221
## 4   1.201183e-64        0.000000e+00
## 5   7.947096e-01        5.922212e-01
## 6   6.436254e-48       8.224947e-224
## 7   1.030676e-01        2.987409e-01
## 8   4.515278e-01        5.065550e-01
## 9   1.360136e-09       1.278570e-123
## 10  4.355048e-56        0.000000e+00
## 11  5.731775e-02        1.685605e-02
## 12  1.286846e-30        1.277741e-61
## 13  1.288393e-02        2.407563e-02
## 14  5.815969e-01        3.483533e-01
## 15  1.360136e-09       1.278570e-123
## 16  3.543378e-63        0.000000e+00
## 17  9.929395e-01        8.705730e-01
## 18  1.286846e-30        1.277741e-61

test_results

##    Dummy.variable.test Little.s.MCAR.test Miss_Mech_test
## 1                 TRUE               TRUE           TRUE
## 2                FALSE              FALSE           TRUE
## 3                FALSE              FALSE          FALSE
## 4                FALSE              FALSE          FALSE
## 5                FALSE              FALSE           TRUE
## 6                FALSE              FALSE          FALSE
## 7                 TRUE               TRUE           TRUE
## 8                FALSE              FALSE           TRUE
## 9                FALSE              FALSE          FALSE
## 10               FALSE              FALSE          FALSE
## 11               FALSE              FALSE           TRUE
## 12               FALSE              FALSE          FALSE
## 13               FALSE               TRUE          FALSE
## 14               FALSE              FALSE           TRUE
## 15               FALSE              FALSE          FALSE
## 16               FALSE              FALSE          FALSE
## 17               FALSE              FALSE           TRUE
## 18               FALSE              FALSE          FALSE

The result of MissMech test is not very accurate. This may due to the problem of implementation in MissMech package, which is not available on CRAN anymore.

Xuwen Liu

2022-03-31

Generate data

Complete data

Add missingness

Statistical tests for MCAR