kNN — kNN • MissImp

k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables. The original function is kNN in package VIM by Alexander Kowarik and Statistik Austria. Here only the difference will be explained. In kNN, not only the imputed result will be returned, but also the disjunctive imputed result. For each observation of one categorical column, the value of the Nearest Neighbors will be recorded, and with this values, the probability vector for each category is constructed.

kNN(
  data,
  variable = colnames(data),
  metric = NULL,
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = stats::median,
  catFun = VIM::maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE,
  col_cat = c()
)

Arguments

data: data.frame or matrix
variable: variables where missing values should be imputed
metric: metric to be used for calculating the distances between
k: number of Nearest Neighbours used
dist_var: names or variables to be used for distance calculation
weights: weights for the variables for distance calculation. If `weights = "auto"` weights will be selected based on variable importance from random forest regression, using function [ranger::ranger()]. Weights are calculated for each variable seperately.
numFun: function for aggregating the k Nearest Neighbours in the case of a numerical variable
catFun: function for aggregating the k Nearest Neighbours in the case of a categorical variable
makeNA: list of length equal to the number of variables, with values, that should be converted to NA for each variable
NAcond: list of length equal to the number of variables, with a condition for imputing a NA
impNA: TRUE/FALSE whether NA should be imputed
donorcond: condition for the donors e.g. list(">5"), must be NULL or a list of same length as variable
mixed: names of mixed variables
mixed.constant: vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability
trace: TRUE/FALSE if additional information about the imputation process should be printed
imp_var: TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status
imp_suffix: suffix for the TRUE/FALSE variables showing the imputation status
addRF: TRUE/FALSE each variable will be modelled using random forest regression ([ranger::ranger()]) and used as additional distance variable.
onlyRF: TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables.
addRandom: TRUE/FALSE if an additional random variable should be added for distance calculation
useImputedDist: TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables.
weightDist: TRUE/FALSE if the distances of the k nearest neighbors should be used as weights in the aggregation step
col_cat: Column indices for categorical columns

Value

ximp The imputed data set. ximp.disj The imputed data set with categorical columns in one-hot form. R.mask The indicator matrix of missingness/imputation.

References

A. Kowarik, M. Templ (2016) Imputation with R package VIM. *Journal of Statistical Software*, 74(7), 1-16.