An attempt was made to develop the model for identify suitable morpho metric variables which project seed yield of soybean germplasm accessions which is having higher genetic variability (
Shruthi et al., 2021). The results from the Multiple linear regression (MLR) indicates the VIF values of the most of the variables are more than 10 (Table 1) and high correlation between independent variables (Fig 2) indicting multicollinearity problem in the data set. So this problem effecting the results of MLR and leads to wrong interpretation. Even 85.2 per cent of variation of seed yield explained by selected cause variables (R
2 = 0.852), only few variables (number of pod per plant (0.70**) and days to maturity (-0.28**) are significantly contributing to changes in the seed yield (Table 1). To overcome this problem of multicollinearity observed among biometric data and for the identification of major factors of influence, the principal component regression is used (
Goyal and Verma 2018). The eigenvalues corresponding to each principal component represents the variance connected with the particular principal component. The first four eigenvalues had eigen value more than 1 and explains a total of 80.28% variability present in the data. So, the first four eigenvalues are selected to build principle component regression model. The rotated component factor loadings are presented in Table 1. The factor loadings represent the weights assigned to each of the variables in the linear combination corresponding to each eigen value.
The linear combination of these factor loadings with the corresponding variables gives the corresponding principal components. To assess the degree of relationship between principal components and seed yield, we tried principal component regression by considering the principal components as independent variables and seed yield as the dependent variable. Here first (5.78**) and fourth principal component regression coefficients (6.30**) are significantly contributing to seed yield, so variables which having factor loadings more than 0.7 under first and fourth principal component considered as important variables for seed yield improvement. So as per Table 1 principle component regression showed quantitative variables like shoot length (0.87), root length (0.87), hypocotyl length (0.81), epicotyl length (0.74), plant height at 30 days (0.84), plant height at 40 days (0.87), plant height at harvest (0.83), number of pods per plant (0.70) and shoot length(0.87) are significantly contributing to the seed yield. Even though PCR works better under the multicollinearity situation but prediction accuracy of this model is less (R2=0.582) so further we tried regression tree and random forest models which works well under multicollinearity situation with high prediction accuracy.
The regression tree ranks the variables based on its contribution to predicting the seed yield using the classification and regression tree (CRT) method, the part algorithm of R software used to build the model. Fig 3 represents the results of regression tree modeling about the importance of morphological character on seed yield. Which defined higher the importance of variable when it possesses higher importance score. The order of performance of the variables was as follows number of pods per plant (9425.73) > number of branches per plant (2153.73) > plant height at harvest (1823.25)
etc. as given in Fig 3. While, seed size, seed thickness, days to maturity having importance scores near to zero so they never apperars as primary or a surrogate splitters and regression tree model eliminate this varibles from tree. Number of pods per plant, number of branches per plant, plant height at harvest, plant height at 30 and plant height at 40 days will be considered as important traits besed on the high impotance score as given in Fig 3. Overall prediction accuracy of this model is (R
2=0.766) much better than Principle component regression (R
2=0.582) as given in Table 2 hence, further we are trying random forest model.
Random forest predict the seed yield using the random forest algorithm of R software. Tune grid function used to identify optimal number of variables available for splitting at each tree node (mtry), Number of trees to grow (ntree) and the minimum number of observations in a terminal node (max nodes) of the model. Among all possible combinations optimal parametrs, mtry=10 (R
2=0.74, RMSE=6.15, MSE=4.95), ntree=130 (R
2=0.79, RMSE=5.15, MSE=4.90) and MAX nodes=8 (R
2=0.71, RMSE=6.12, MSE=4.94) having high level of accuracy of prediction. Overall prediction accuracy of this model (R
2=0.925) is much better than all other models as given in Table 2. Fig 3 also explain the rankings of the relative importance of each morphological charcter on seed yield. Higher the value of purity indicates the higher the importance of variable. Here number of pods per plant possess most importance with higher rank (4765.41). The importance of variables according to purity values obtained by random forest is number of pods per plant (4765.41) > number of branches per plant (1265.36) > plant height at harvest (1113.83)
etc. as given in Fig 3. Hence, number of pods per plant, plant height at harvest, number of branches per plant, plant height at 30 days and plant height at 40 days will be considered as important variables as like in regression tree model because of significantly high purity values and this parameters have positive significant relation with seed yield as given in Fig 1.
To check the capability of each model to predict seed yield the data was divided into 2 sets
viz. training and testing data. 80%
i.e. 83 genotypes observations were used for training and 20%
i.e. 20 genoypes observations were used for testing models. The models were trained saperatly to build model and the best model was selected on the basis of its prediction accuracy in the testing period. The comparative results for the best model between multiple linear regression, Principle component regression, Regression tree and Random forest models are given in Table 2. As assessed by prediction accuracy measures like RMSE, MAPE and R
2 statistic indicates the superiority of the random forest for prediction of soybean seed yield for germplasm accessions. It indicates number of pods per plant, number of branches per plant, plant height at harvest, plant height at 30 days and plant height at 40 days will be considered as most influencing morphological characters on seed yield. Finally, we tried to identify genotypes that possess superiority about most influencing morphological characters on seed yield using cluster analysis. Fig 4 displays the k mean clusters analysis results based on the major morphological characters identified from the best model (random forest) across all the genotypes using f
viz_cluster function in r. Here, genotypes were made into three final clusters having 49, 2 and 47 genotypes respectively. The seed yield (gram/plant) mean values of each group (Cluster 1: 18.999±0.658, Cluster 2: 45.940± 3.920, Cluster 3: 34.170±3.653) are varying significantly and second group gentotypes showing superirioty in seed yield. CAT-586 and JS-SH-1310 genotypes of second group has superiority of seed yield (45.94 gm/plant), Number of pod per plant (57.00), Plant height at harvest (85.50 cm), Plant height at 30 days (52.10 cm), Plant height at 40 days (66.15 cm) and Root length(12.89 cm ) compare to other two groups.