Prediction based on univariate linear regression analysis
The univariate linear regression analysis revealed that the highest prediction accuracy was exhibited by peak milk yield records with 64.15% R
2 and 256.0293 RMSE. TD-5 showed the highest accuracy amongst the monthly test-day milk yield records with 62.53% R
2, followed by TD-7 with 62.30% R
2. It was observed that the extremes of the monthly test-day milk yield records (TD-1, TD-11, TD-10, TD-9, TD-2) showed lower prediction accuracy, whereas, the mid-lactation monthly test-day records (TD-5, TD-7, TD-3, TD-4, TD-6) showed higher prediction accuracy. Furthermore, the regression coefficient (bi) was found to be highest for TD-7. It could be inferred from the results that mid-lactation monthly test-day milk yield records up to TD-7 hold the potential to predict FL305DMY in conjunction with peak milk yield records, hence, were utilized further for multiple linear regression analysis. This would also facilitate early prediction of FL305DMY. The results of univariate linear regression analysis for the prediction of FL305DMY based on peak milk yield and monthly test-day milk yield records are presented in Table 1.
Sharma et al., (2019) performed the univariate linear regression analysis for the prediction of FL305DMY in 384 crossbred cattle and reported that peak milk yield alone demonstrated 47.10% R
2, which was lower than the estimate observed in the present study. They also reported that higher prediction accuracy was exhibited by mid-lactation monthly test-day records, with TD-7 demonstrating the highest R
2 value of 75.15%. Similarly,
Sah et al., (2013) based on the study on 300 Kankrej cows, reported that peak milk yield alone exhibited 49.9% prediction accuracy and the highest R
2 (67.10%) was exhibited by TD-8.
Singh and Rana (2008) and
Elmaghraby (2009) reported that the highest R
2 was exhibited by TD-6 in Murrah and Egyptian buffaloes, respectively.
Prediction based on multiple linear regression analysis
The regression analysis involving two independent variables (bivariate linear regression), including peak milk yield as one of the variables, generated a total of seven prediction equations (Table 2). The prediction accuracy based on bivariate linear regression analysis ranged from 65.84% to 79.80%. A perusal of the results revealed that the most optimal prediction equation featured PY and TD-7 with 79.80% R
2 and 192.4454 RMSE. Interestingly, the initial analysis showed that PY alone yielded a prediction accuracy of 64.15% (Table 1). However, upon the addition of one more variable to the equation, the accuracy surged to 79.80%, representing a significant increase of 15.65%. Furthermore, it was observed that PY and TD-5 yielded a prediction accuracy of 73.55% and stood in third position, however, initial analysis showed that TD-5 alone emerged as the most accurate predictor amongst the monthly test-day milk yields in univariate linear regression analysis (Table 1). It could be inferred from the results that the variable demonstrating the highest prediction accuracy in univariate regression analysis might yield different results when considered in combination with others. While examining the top five prediction equations generated from bivariate linear regression analysis, it was observed that the variations in accuracy were evident, on the contrary, no such significant variations in accuracy were found in top five prediction equations generated from three, four and subsequent combinations of independent variables regression analysis.
Sharma et al., (2019) based on the study on 384 crossbred cattle, reported that the most optimal bivariate linear regression equation consisted of PY and TD-7, which was in agreement with the present study. They reported the range of R
2 between 50.80% (PY and TD-1) and 85.20% (PY and TD-7). Interestingly, a very high increase in R
2 estimate (38.10%) was observed on the introduction of an additional TD to the univariate peak milk yield equation, which was more than double the increase observed in the present study on Murrah buffalo. This could be attributed to the genetic variations present between the species. Similarly,
Sah et al., (2013) also reported that the most optimal bivariate linear regression equation consisted of PY and TD-7 with 75.10% prediction accuracy in Kankrej cows.
Singh and Kumar (2007) studied peak milk yield along with pre-peak period in 284 Karan Fries cows and reported that both together demonstrated 61.85% accuracy in the prediction of FL305DMY. Furthermore, based on the magnitude of regression coefficients they also reported that in the observed prediction accuracy the contribution of pre-peak period was negligible and PY emerged as a key predictor for FL305DMY.
In-depth analysis of prediction equations
The stepwise backward multiple linear regression (MLR) analysis of monthly test-day milk yield records with and without peak milk yield was performed and the most optimal equations for the prediction of FL305DMY are presented in Table 3 and Table 4, respectively. The most optimal prediction equation with three independent variables including peak milk yield (PY, TD-4 and TD-7) showed an accuracy of 84.79%, whereas, TD-4 and TD-7 (without peak milk yield) showed 81.24% accuracy. While considering four independent variables including PY, the most optimal prediction equation featured PY, TD-2, TD-4 and TD-7 yielded 87.24% R
2, whereas, the prediction equation incorporating only TD-2, TD-4 and TD-7 showed 86.28% R
2. The most optimal prediction equation with five independent variables including PY (PY, TD-2, TD-4, TD-6 and TD-7) demonstrated an accuracy of 89.30%, whereas, when considering only TD-2, TD-4, TD-6 and TD-7 in a regression analysis without PY, the accuracy was 88.87%. Similarly, for six and seven independent variables analysis including PY, the R
2 was 89.74% and 90.13%, respectively. In contrast, the same equations without PY, yielded an accuracy of 89.54% and 90.04%, respectively. While considering PY with all the seven TDs, the prediction accuracy was found to be 90.45%, whereas, without PY the seven TDs yielded an accuracy of 90.37%.
The graphical presentation for comparative analysis of the observed results has been depicted in Fig 1. A perusal of the illustration revealed an increasing trend in prediction accuracy (R
2) with each successive introduction of an additional TD into the regression equation. Interestingly, it was observed that while the increase in R
2 was evident with each successive addition of a TD, however, the trend for increase in R
2 showed a declining pattern with each successive step. The most significant increase in R
2 (15.65%) was observed when the investigation switched from univariate to bivariate regression analysis. The R
2 was significantly increasing up to five variables regression analysis, however, in-depth observation revealed that the contribution of PY in the prediction accuracy of this five variables regression analysis was relatively limited. The comparison between the most optimal prediction equations with PY and the same equations without PY revealed that the contribution of PY in prediction accuracy ranged between 0.08% and 17.50%. The most significant contribution of PY (17.50%) was observed when the investigation switched from univariate to bivariate regression analysis. The contribution of PY in prediction accuracy was found to be significant up to three variables regression analysis, thereafter no major contribution was observed. Therefore, it could be inferred from the results that for the prediction of FL305DMY, the most optimal prediction equation including peak milk yield was the MLR equation consisting of three variables (PY, TD-4 and TD-7) with 84.79% R
2. Beyond this three variables equation, considering peak milk yield in the prediction of FL305DMY would not be significant and unnecessarily increase the cost incurred on data recording.
Singh et al., (2022) reported that the optimal equation for the prediction of FL305DMY consisted of age at first calving (AFC), PY, TD-1, TD-2, TD-3 and TD-4 showed 80.53% accuracy. However, the regression coefficient in this equation indicated that the role of AFC was negligible as a predictor for FL305DMY in Murrah buffalo. A higher R
2 estimate of 83.91% for prediction equation consisting of TD-4 and TD-7 only (without PY) was reported by
Sharma et al., (2019) in crossbred cattle. In addition, they reported that the best prediction equation incorporating peak milk yield consisted of PY, TD-6 and TD-7 showed 86.30% accuracy, whereas, early prediction could be achieved by PY, TD-5 and TD-6 with 80.30% accuracy.
Sah et al., (2013) suggested that the most optimal prediction equation including PY consisted of three variables (PY, TD-6 and TD-7) exhibited 78.10% accuracy.