The descriptive statistics including mean, standard deviation (SD) for the sugar component of chickpea samples is shown in Table 1.
Out of 237 samples, it was discovered that 184 samples had reflectance for sugar component that could be studied in NIR spectra. In 184 samples, the mean value is 7.01. Since we had a small sample size, determining the distribution of the sugar component was important. An analysis using the Shapiro-Wilk test revealed that the distribution of the sugar component considerably deviated from normality (W = 0.96, p-value 0.001).
Table 2 shows effective wavelength ranges for sugar prediction. RMSECV and R
2 value was calculated on these wavelength ranges. The range of lower RMSECV and highest R
2 which is 1800-1938 is considered for developing models for sugar component. This range is characterised by O-H, N-H combinations.
Boxplots were used to detect outliers. Preprocessing on a specific wavelength range of 1800-1938 nm using various mathematical techniques led to the development of the models. Model developed with treatment 1,2,0,2 which means spectra is passed to first derivative with gap size of 2 nm. It is smoothed by moving average with 2 nm gap size. In addition to that spectra is passed to standard normal variate for scatter correction. There were 181 samples for 69 wavelengths after preprocessing.
Multicollinearity was detected and dimensionality reduction was done using PCA.
• A correlation matrix was created for NIR spectral data with 181 samples for 69 predictor wavelengths. It was found that the 44 predictor variables correlate more significant than 0.9 per cent indicating multicollinearity in data.
• The preprocessed spectra was decomposed into latent vectors that are ranked according to the amount of spectral variance explained by PCA. The first ten principal components (PCs) account for approximately 90% of the variation in NIR spectra of chickpea samples.
Prior to model development, data was split into two sets: a calibration set made up of 80% of the data (145 samples), and a validation set made up of 20% of the data (36 samples). The calibration data set, which is split into two parts: training and testing data, contains 75% of the data (109 samples) and 25% of the data (36 samples) respectively. Model development take place with 10 PCs from PCA. The model generation process for each of the five algorithms produced a distinctive and durable model.
Predictor and response variables are connected by an equation in linear regression model, where the exponent (power) of both of these variables is 1. When plotted as a graph, a linear connection mathematically depicts a straight line. The general mathematical equation for a linear regression is:
y= ax+b
y= Response variable.
x= Predictor variable.
a, b= Constants.
Mathematical equation for LR model is given in following equation:
y= 0.914-0.734 x 1+0.072 x 2-0.006 x 3-0.048 x 4-0.008 x 5+0.013 x 6 +0.036 x 7-0.111 x 8-0.058 x 9+0.039 x 10 ....(1)
Where;
y= Sugar component.
xi = 10 PCs from PCA.
Large-scale regression trees are combined by the RF algorithm. The random forest model has two parameters: ntree, which is equal to 500 trees and mtry, which is the number of input variables per node, which is 3.
Neurons are arranged in a neural network’s three layers-input, hidden and output. Equation 2 represents the neural network, where W stands for the weights vector, X for the inputs vector, and b for the bias. Equation 3 contains the sigmoid activation function that is applied.
.....(2)
.....(3)
Where,
f(z) and y= An activation function. The activation function’s output ranges from 0 to 1. Here, the neural network was configured to the parameters of the neural network were set to 2, 8, 0.02, 0.3, 1500 for the hidden layers, nodes for each layer, learning rate, momentum and iteration.
Decision tree model was developed using rpart algorithm. There are two major processes of rpart (1) tree growing (2) splitting. Tree growing is expansion of the tree at specific decision points and tree pruning is to ignore the subtree with poor decision scores. To develop decision tree model, rpart (formula= Sugar~.data= traindata) command is used.
The accuracy of SVR models depends on how well the loss function, error penalty factor C and SVR meta-parameters are configured. Additionally, the final models are significantly impacted by the choice of the kernel function. In this study, the commonly used radial basis kernel function (RBF), K(x, x')= exp(-|x-x'2/σ
2), was used. SVR model was created with C as 1, ε as 0.1 and 0.1 for the RBF kernel parameter. Number of support vectors for SVR model is 89.
RSE values of all models lie in the range of 0.06-0.09. To assess the randomness of residues obtained from generated models, a residual plot was created and discovered that all residues are roughly evenly distributed around zero in the plot with no apparent pattern, implying that residues are random. The Shapiro-Wilk test was used to determine the normality of residues in developed models. Table 3 displays the p-value obtained for each model. All models’ p-values are greater than 0.05, indicating that the data did not show any evidence of non-normality.
It was discovered that RF and ANN prediction models perform best in the wavelength range 1800-1938 nm, to predict sugar concentration of chickpea. The model created by the RF algorithm was discovered to have the lowest RMSE value of 0.053 and RSE values of 0.062. The R
2 and adjusted R
2 values of 0.953 and 0.937, respectively, are the highest of the five models. According to
(Sagar et al., 2018). The ensemble-based RF approach improves model accuracy. The RF algorithm checks all variables at each node to assess how well they split due to the randomization.
The RMSE, RSE, R
2 and adjusted R
2 values of the ANN model are 0.056, 0.066, 0.951, and 0.935, respectively, with only minimal deviations from those of the RF model in terms of accuracy. Activation function gives NN their non-linearity and expressiveness. An ANN learns throughout the learning phase by changing the weights to forecast the value of the inputs’ responses. It is worth noting that the tests were carried out in the lab, and the models’ dependability can only be proven once they’ve been applied to real-world procedures. The specific performance measures for each model are shown in Table 3.
All models were assessed using validated data that included (36 samples). The accuracy of the random forest model, measured as RMSE and RSE, was confirmed to be 0.09 and 0.108, respectively. The calculated R
2 and adjusted R
2 values were found to be 0.816 and 0.752, indicating that the RF model is effective at predicting chickpea sugar concentration in the chosen range of 1800-1938 nm.