The secondary data on 40 genotypes of soybean grown at Almora, are used for this study. The data are available in the All India coordinated research project on Soybean 2020-2021, ICAR-Indian Institute of Soybean Research, Indore. The dataset consists of 3 morphological quantitative characters such as yield (kg/ha), plant height (cm), 100 seed weight (g). In order to conduct this investigation, complete records of 40 genotypes with 3 morphological characters are considered to create missing datasets, which is subsequently used in this study. For this purpose, we used a missing complete at random (MCAR) mechanism to create various proportions of missing values in the original data. The MCAR can be described as: to create data with α% of missing values, α% observation is deleted randomly. Here, we took α = 10, 20 and 30 to create 10%, 20% and 30% missing datasets (incomplete datasets). After that the missing values are imputed using the various imputation techniques each with five iterations. Selection criteria such as, root mean square error (RMSE) and mean absolute error (MAE) are used to determine the best missing data imputation technique. Further, for this purpose a program was written in R studio to create various proportions of missing data and for application of these imputation techniques.
Mechanisms of missing data
In any study, if we encounter missing value problems, the first thing we should do is to examine the types of missingness. According to
Little and Rubin (1987), missingness generally fall into one of three types, “Missing Completely at Random”, “Missing at Random” and “Missing Not at Random”.
1. Missing completely at random (MCAR)
This shows that the data’s missingness is unrelated to either observed or unobserved variables. When all variables have an equal chance of missing data, as described by
Van Buuren (2012), MCAR data is present. The MCAR assumption is ideal in that unbiased estimates can be obtained regardless of missing values.
2. Missing at random (MAR)
In contrast to unobserved variables, missingness is related to the observed variable. An estimate that comes from a dataset with the MAR assumption may or may not be biased. The definition of MAR data by
Van Buuren (2012) is when all observable data variables have an equal chance of missing data.
3. Missing not at random (MNAR)
Missing values result from incidents or unidentified factors that are not measured, which is connected to unobserved variables. MNAR is defined by
Van Buuren (2012) as data that is neither MAR nor MCAR data.
Missing dataset and patterns
The pattern simply defines which values in the data set are observed and which are missing. For our analysis purpose, soybean data with 3 morphological characters each with 40 genotypes is chosen. The dataset’s missing pattern is identified using the Naniar function vs miss() and a visual picture of the data pattern is provided in Fig 1.
As Fig 1, shows all morphological character of soybean with 10%, 20% and 30% of missing data in all data set where 90%, 80% and 70% is present respectively. The missing portions are represented as black to indicate how many values are missing in each dataset. Since the amount of missing data fluctuates in real-time circumstances, the following strategies are used to impute missing values into all datasets with varying percentages of missing data.
Imputation technique
Multivariate imputation by chained equation (MICE)
MICE technique is one of the most powerful imputation techniques. The initial stage in MICE is to generate numerous imputed datasets. To fill in missing values, this imputation method employs a set of regression models. It works in iteration, where imputation is performed for each variable separately. The user should provide a conditional model for each variable utilising the other variables as predictors. By default, we employed a polytomous logistic regression model for categorical variable with more than two levels, a logistic regression model for binary variables and a linear regression model for continuous variables. The approach works iteratively by imputing the missing values based on the fitted conditional models until a stopping criterion is met. In this respect, it is quite comparable to the missForest algorithm; the primary distinction is that missForest employs more adaptable decision trees for each conditional model.
Multiple imputation with diagnostics (MI)
According to
Su, Gelman et al.(2011), mi imputations technique is derived from MICE, but one of the significant differences is that it imputes from a conditional distribution of a variable while other variables are either imputed or observed. MI has an advantage over MICE in that it can handle data irregularities like multicollinearity inside a dataset.
The process is broken down into four steps to impute a variable (
Su et al., 2011)- First, setup does pre-processing, discovers conditional models and examines missing data patterns to identify problems with dataset. Second, examines imputed values for conditionality, acceptability and convergence while iterating over MICE-based imputations but with a conditional model. The third step in the analysis gathers a variety of imputed complete datasets and combines them for the whole case analysis. Fourth, cross-validation is done, sensitivity is examined and compatibility is checked.
Miss forest
Stekhoven and Buhlmann (2012) described this as a non-parametric technique where the variables are pair wise independent. The random forest approach serves as the foundation for the algorithm. For each variable, missForest generates random forest using the observed values and forecasts to impute missing values. The algorithm iterates until the stopping requirement is met or the maximum number of iterations is achieved. The random forest model has the advantages of handling both continuous and categorical responses, requiring little tuning and offering an internally cross-validated error estimate.
Amelia
Amelia assumes that the data have a multivariate normal distribution and uses it to generate m imputed datasets from an incomplete dataset. This imputation method first generates a bootstrapped version of the original data, estimates the necessary statistics using “Expectation-Maximization” (EM) and then uses the estimated necessary statistics to impute the missing values in the original data. To create the m complete datasets, it repeats this process m times with the identical observed values and unobserved values derived from their posterior distributions.
Measures of performance for imputation techniques
For each morphological character in the genotypic data of soybean at various missingness frequencies, we calculate root mean square error (RMSE) and mean absolute error (MAE) in order to evaluate the performance of the imputation techniques. The various imputation techniques are compared using RMSE and MAE, which assess the gap between actual and imputed values. The formula for calculation of RMSE and MAE for each variable is given as follows.
Root mean square error (RMSE)
Square root of the MSE. The RMSE is a valuable measure of accuracy and describes the standard deviation between observed and imputed value.
Where,
y= Imputed value.
y = True value.
m = Number of observation in each variable.
Mean absolute error (MAE)
Measure of error’s average magnitude and can help to know how each imputation technique is performing.
In general, the more efficient imputation technique would have a lower RMSE and MAE.