To develop precision NIR models for predicting phosphorus concentration in chickpea flour, 237 chickpea germplasm accessions were selected from the National Gene Bank of ICAR-National Bureau of Plant Genetic Resources.
Collection of spectral reflectance data
For accurate analysis, a near-infrared scanning monochromator in reflectance mode was used. Chickpea samples were homogenised with a Foss Cyclotec grinder to ensure consistency before being placed in a circular cuvette for best results. The Foss NIRS 6500 cuvette spinning model collected spectra at 2 nm wavelengths, calibrated them against white mica and performed 32 scans. The reflectance logarithm was used to collect data from 400-2498 nm. Accurate measurements of phosphorus concentration in chickpea were critical for developing and testing prediction models in the chemical laboratory.
Spectra preprocessing
Interval Partial Least Squares (iPLS), introduced by
Norgaard et al., (2000), enhances spectral data analysis by focusing on informative spectral regions. It divides the spectrum into intervals, developing local PLS models for each, which often improves prediction accuracy over the full spectrum. IPLS has been successfully applied in various fields, including food science
(Xiaobo et al., 2010), pharmaceuticals
(Karande et al., 2010) and soil analysis
(Peng et al., 2020).
Key considerations for
iPLS implementation include:
1. Optimal interval size selection.
2. Effective interval selection strategies.
3. Rigorous model validation.
4. Comparison with other spectral analysis techniques.
iPLS offers improved predictive accuracy and model interpretability, making it a valuable tool for extracting insights from complex spectral data across diverse applications.
Chemometric analysis: A comprehensive framework for spectral data modeling
The development of robust chemometric models follows a structured procedural framework encompassing several critical stages: anomalous data point identification, data preparation, dimensionality reduction, model formation and refinement and performance evaluation. Each stage is designed to enhance model efficacy and precision, ultimately leading to more reliable outcomes. Fig 1 illustrates this model development process through a comprehensive flowchart.
Outlier identification
Detecting outliers is a crucial step in data analysis, as these aberrant data points can significantly impact the model training during the supervised phase. Outliers, that deviate substantially from the rest of the sample, can lead to prolonged training times and reduced model accuracy. In this study, we employ box plot visualization as an effective tool for outlier identification, allowing for a clear graphical representation of data distribution and anomalies.
Spectra preprocessing
Spectral preprocessing is an essential component in material analysis, serving to enhance data quality and extract meaningful information. This step requires careful consideration of noise and light scattering effects, which can significantly impact data precision and accuracy. Near-infrared spectroscopy (NIRS) is particularly susceptible to scattering effects, necessitating the application of various correction techniques. Several preprocessing methods were employed to mitigate these effects:
· Multiplicative scatter correction (MSC).
· Standard normal variate (SNV).
· SNV-Detrend.
These techniques effectively minimize scattering and improve overall data quality. Additionally, spectral derivation techniques, such as Savitzky-Golay polynomial derivative and derivative filters, are utilized to extract valuable insights from the spectral data. It is crucial to note that the selection of preprocessing techniques should align with the subsequent modeling phase. While incorporating multiple preprocessing steps can potentially yield more information, it may also introduce unnecessary complexity and potentially impair predictive accuracy.
Multicollinearity removal using Principal Component Analysis (PCA)
Principal component analysis (PCA) effectively addresses multicollinearity in datasets, particularly in regression analysis. It reduces dimensionality while preserving significant information by identifying principal components-linear combinations of original variables capturing maximum variance. In multiple regression models, PCA enhances stability and reliability by excluding redundant variables. This approach, combined with outlier detection and spectra preprocessing forms a robust framework for developing accurate spectral data models, improving their quality and interpretability.
Model development: A comprehensive approach using r packages
Recent work by
Verma et al., (2023) in the Asian Journal of Dairy and Food Research validated the use of spectroscopic methods for rapid quality assessment in food products. Their findings showed strong correlations between spectral data and chemical composition, supporting the viability of NIR-based analysis methods.
This study employed diverse machine learning algorithms to predict target variables from preprocessed spectral data, using R packages:
1. Linear regression: lm() function (
Chambers, 1992).
2. Neural networks: neuralnet() package, exploring
Tanh and
sigmoid activations.
3. Random forest: randomForest() function from random Forest package.
4. Support vector regression: svm() function from e1071 package.
5. Decision tree regression: rpart() function from RPART package.
This comprehensive approach enables thorough evaluation of different modeling techniques, facilitating identification of the most accurate and reliable method for spectral data analysis in our specific context. The systematic comparison of these models contributes to the broader field of chemometric analysis and spectroscopic prediction.
Model evaluation: A comprehensive approach
Model evaluation is crucial for assessing predictive performance and reliability. This study employs a suite of statistical metrics to evaluate regression models, including Root Mean Square Error (RMSE), Residual Standard Error (RSE), Residual Prediction Deviation (RPD), coefficient of determination (R²) and Adjusted R². Mean Squared Error (MSE) is a fundamental metric for regression analysis, representing the average squared difference between predicted and actual values:
Where: .
N: Total number of observations.
Pi: Observed values.
ri: Predicted values.
However, MSE’s sensitivity to outliers necessitates the use of more robust metrics. Root Mean Square Error (RMSE), the square root of MSE provides an error measure in the same unit as the target variable:
Lower RMSE values indicate superior predictive capability, making it a widely adopted metric in predictive modeling.
The residual prediction Deviation (RPD), introduced by
Williams and Sobering (1995), offers an objective assessment of model validity by considering both prediction errors and data variability:
Where:
SD (σ): Standard deviation of observed values.
RMSE: Root mean square error of prediction.
Higher RPD values correlate with improved predictive capacity, facilitating comparison across different model validation studies. Residual analysis is essential for evaluating regression model adequacy. The Residual Standard Error (RSE) quantifies the average deviation of residuals from the regression line, with lower values indicating better model fit. In this study, we employ the Shapiro-Wilk test (α = 0.05) to assess residual normality, ensuring the stochastic nature of residuals across all models.
The coefficient of determination (R²) quantifies the proportion of variance in the dependent variable explained by the independent variables:
Where:
SSres: Sum of squared residuals.
SStot: Total sum of squares.
Adjusted R² accounts for the number of predictors in the model, providing a more conservative estimate of the model fit:
Where:
n: Sample size.
k: Number of predictors.
Model selection in this study is based on minimizing RMSE and RSE while maximizing R² and Adjusted R². This multi-criteria approach ensures the identification of models that optimally balance predictive accuracy and model complexity.
Table 1 presents comprehensive evaluation statistics for five models, demonstrating their efficacy in predicting phosphorus content in chickpea flour. This rigorous validation process underscores the reliability and applicability of our modeling approach in materials science.
This refined methodology for model evaluation provides a robust framework for assessing predictive performance in materials science, ensuring the selection of optimal models for accurate phosphorus prediction in chickpea flour.