Principal Component Analysis with Association Analysis and Clustering of Genotypes by K-means Clustering in Greengram [Vigna radiata (L.) Wilczek]

P
P. Shanthi1
M
M. Umadevi1,*
N
N. Manivannan1
M
M. Gunasekaran1
D
D. Sassikumar1
1Centre for Plant Breeding and Genetics, Tamil Nadu Agricultural University, Coimbatore-641 003, Tamil Nadu, India.
  • Submitted06-03-2025|

  • Accepted03-04-2026|

  • First Online 25-04-2026|

  • doi 10.18805/LR-5488

Background: India is recognized as the prime global producer of greengram, which is grown in nearly across the states. This legume is a remarkable source of protein, fiber and iron. The identification of superior genotypes is primarily dependent on genetic diversity. Analytical techniques, including correlation, principal component analysis (PCA) and cluster analysis, are essential for clarifying the diversity found among various genotypes.

Methods: The study was carried out over two consecutive years, 2020 and 2021, during the Kharif season at the National Pulses Research Centre (NPRC) located in Vamban, Pudukkottai, Tamil Nadu, utilizing a randomized block design.

Result: The association analysis among the five yield-contributing traits revealed that the  two traits, namely plant height and total number of pods per plant, exhibited a significant positive association. Additionally, 100 seed weight demonstrated a significant positive correlation with plant height and showing a significant negative correlation with both 50 per cent flowering and the number of primary branches per plant. The principal components associated with the dataset, namely PC1 (Dim.1) and PC2 (Dim.2), are crucial in elucidating the variability within the dataset. PC1, with an eigen value of 2.3, accounted for a substantial portion of the total variance (36.27%), indicating a strong reflection of plant height. Conversely, PC2, with an eigen value of 1.75, contributed an additional 29.78 per cent to the total variance, highlighting a significant loading related to grain yield per hectare, which aligns with the observed positive correlations. A K-means cluster analysis was conducted; these genotypes were grouped into four distinct clusters. The majority of the genotypes originated from diverse parental crosses and was allocated to different clusters, suggesting a lack of correlation between genetic and geographical diversity. The high-yielding genotypes identified in cluster three, including VGG 20-010, VGG 20-066, VGG 20-069, VGG 20-068, VGG 20-071, VGG 20-088, VGG 20-091, VGG 20-092, VGG 20-230, VGG 20-232, VGG 20-233, VGG 20-234 and VBN(Gg)2 were significantly contributed to the overall diversity in terms of the number of primary branches and pods per plant, indicating their potential utility in future breeding programs.

Greengram is an important pulse crop in India, extensively cultivated across various states and widely preferred by consumers. It occupies about 3.79 million hectares, with a total production of 2.92 million tones and an average productivity of 670 kg per hectare, as reported in the (Annual Report 2023) Crop Outlook Reports of Andhra Pradesh Greengram. This crop contributes nearly 11 percent of India’s total pulse production. Furthermore, it is a popular legume that well adapted to arid and semi-arid environments and plays a significant role in imporving soil fertility by fixing approximately 30 to 40 kilograms of nitrogen per hectare (Majhi et al., 2020 and Muchomba et al., 2023). As a result, the nitrogen requirements for succeeding crops by about 25 percent of the recommended dose. To meet the growing demand for greengram, it is essential to overcome yield limitations through the development of high-yielding varieties with tolerant to  biotic and abiotic stresses.
       
Yield is a complex trait influenced by several interrelated components. A clear understanding of the associations between yield and its contributing traits is necessary, as selection based solely on yield may not result in substantial improvement (Sadimantara et al., 2023). Yield enhancement can be achieved by emphasizing traits that show a strong positive positively correlated with seed yield. Therefore, improving yield can be accomplished by prioritizing the selection of traits that are closely related to yield components. Principal component analysis (PCA) is a useful statistical approach for identifying key traits that account for maximum variation by reducing high- dimensional data into a smaller set of correlated variables (Nainu et al., 2020). When combined with cluster analysis, PCA effectively aids in identifying genetically diverse genotypes suitable for crop improvement programs (Duppala et al., 2017). Selecting highly diverse parents is crucial for generating superior segregants during hybridization and selection processes of plant breeding (Manojkumar et al., 2018). K-means clustering is an effective analytical method for evaluating genetic diversity, as it groups germplasm accessions into genetically distinct clusters or heterotic groups based on measures of genetic distance. After these heterotic groups are established, clusters that are highly divergent from one another can be readily identified. The germplasm accessions within such widely separated clusters represent a high level of genetic variability (Kanavi et al., 2020).  Accordingly, the present study was undertaken to assess genetic diversity and the factors contributing seed yield in greengram through correlation analysis and PCA, with the objective of identifying promising genotypes for future breeding and improvement efforts.
The experimental was conducted on 107 advanced greengram [Vigna radiata (L.) Wilczek] genotypes, evaluated through a randomized block design (RBD) with two replications during the Kharif seasons of  2020 and 2021 at the National Pulses Research Centre (NPRC) located in Vamban, Pudukkottai, Tamil Nadu. Each entry was sown in plots measuring 2 meters length and 2.4 meters width, with a planting density of 30 cm × 10 cm. Standard agronomic practices were employed to ensure optimal crop establishment. Biometric observations were recorded for six quantitative traits: namely days to 50 per cent flowering, plant height, number of primary branches per plant, number of pods per plant, 100-seed weight and single plant yield based on five randomly selected plants from each entry. The averages of these five plants over the two seasons were utilized for statistical analysis. All data analyses were conducted using R statistical software version 4.4.2. Principal Component Analysis was performed to ascertain which plant traits significantly contributed to the observed variation among the genotypes, adhering to the methodology established by Upadhyaya et al. (2007). Furthermore, correlation and genetic diversity analyses were executed to identify the most promising and diverse genotypes for the development of stable and well-adapted varieties. The advanced breeding lines were categorized using the ‘k-means clustering’ model as described by Macqueen (1967) and Forgy (1965). K-means clustering serves as a machine learning technique that evaluates genetic diversity by grouping germplasm accessions based on genetic distances. This method facilitates the creation of heterotic groupings, enabling the straightforward identification of genetically distant clusters among various germplasm accessions. Consequently, plant breeders can readily identify genetically diverse accessions for further evaluation and utilize them as parental lines in crossing programs.
The analysis of variance (ANOVA) for all the traits under study revealed significant difference among genotypes for all the characters in both seasons confirmed the presence of substantial variability in the experimental material. The correlation coefficient measures the interdependence between various plant characters and determines the component characters on which selection can be relied upon for genetic improvement of yield. The results of correlation are presented in the Fig 1.  Among the five yield contributing characters evaluated, plant height and total number of pods per plant illustrated a significant positive correlation with grain yield, similar results were also reported by Sandhiya and Saravanan (2018).  Whereas the primary branches per plant and test weight (100 seed weight) were recorded positive but non-significant correlations with grain yield, which is in agreement with the findings of Sheetal et al., (2014).

Fig 1: Correlation coefficients among six quantitative traits in greengram genotypes.


       
With respect to inter trait correlations among yield attributing traits, 100 seed weight showed a significant positive association with plant height whereas it exhibited a significant negative correlation with 50 percent flowering and primary branches per plant. These findings in accordance with the results of Din et al., (2015). Furthermore, primary branches per plant and total number of pods per plants had a significant positive correlation with days to 50 per cent flowering, corroborating earlier reports by Punia et al., (2014) and Shanthi et al., (2024).
       
When evaluating a large number of advanced genotypes, the presence of a broad spectrum of genetic variability is essential for the effective identification of superior genotypes for further advancement. Multivariate analysis serves as a robust statistical approach for the comprehensive evaluation of advanced homozygous genotypes and facilitates the identification of promising genotypes based on key agronomic traits (Rabbani et al., 1998). Principal Component analysis (PCA) biplot is a graphical technique that enables the simultaneous assessment of relationships among genotypes and their associated traits. The results of PCAs are presented in Table 1.  The PCA clearly revealed that the first two principal components, with egienvalues greater than unity,  accounted for 68.56 percentage of  the total variation present  in the dataset (Table 1; Fig 2). Amongst the two PCs, the PC1 with an eigenvalue of 2.3 explained the highest proportion of the total variance (36.27) and was primarily associated with plant height, indicating that this trait contributed substantially to variability. Similar report was also mentioned by Thippani et al., (2017). The second  principal component (PC2), with an eigenvalue of 1.75 accounted for  29.78 per cent of total variance reflected the significant loading of grain yield per hectare, which was also reflected in its positive correlation with other yield related traits. The remaining principal components (PC3 to PC6) contributed marginally and exhibited limited discriminatory power. Consequently, the most important yield and yield contributing characters, particularly total number of pods per plant and plant height were predominantly associated with first two principle components. Earlier studies by Mahalingam et al., (2020); Mohan et al., (2021); Nayak et al., (2021); Jakhar and Kumar (2018) and Gayathri et al., (2023) similarly emphasized the importance of these traits in enhancing efficiency in future breeding programmes.

Table 1: Eigen values, percentage of variance and cumulative variance six principal components of greengram.



Fig 2: Screen plot showing six principal components.


       
All variables included in the study were aggregated and exhibited positive correlations with one another. Grain yield exhibited a significant positive association with total number of pods per plant and plant height, which corresponded to higher loadings of these traits on the first principal component (Fig 3). Traits located farther from the origin in the biplot, namely grain yield, total number of pods per plant and plant height, contributed more substantially to overall variability compared to primary branches per plant, days to 50% flowering and 100-seed weight. Notably, none of the variables displayed negative correlations. PCA was conducted using a correlation matrix to transform correlated quantitative variables into a reduced set of uncorrelated principal components, thereby elucidating the underlying structure of trait relationships (Johnson and Wichern, 2007). The summary of the characters studied for correlation and principal component analysis are presented in Table 2.

Table 2: Summary of characters studied for correlation and Principal component analysis.



Fig 3: Biplot of the variables with respects to the first two PCs.


       
The representation quality of the variables on the factor map is referred to as Cos2 (Fig 4), indicated that plant height, total number of pods per plant and grain yield exhibited the greatest contribution to total variation, where as 100 grain weight and primary branches per plant contributed the least to the principal components. This finding is corroborated by the research conducted by Mahalingam et al., (2020). The variables strongly associated with PC1 (Dim.1) and PC2 (Dim.2) are pivotal in elucidating the variability within the dataset. Conversely, variables with weak or negligible association with the principal components contributed minimally and may  be excluded to simplify the analysis without compromising interpretability. 

Fig 4: Variables’ contribution to principal components.


       
In this present study, the integration of biplot analysis and attribute significance facilitated the construction of  unified biplot, wherein attributes exhibiting similar cos2 scores were represented by analogous color codes (Fig 5). Attributes with high cos2 values, notably plant height and grain yield, were depicted in green, indicating their strong contribution to the principal components. In contrast, attributes with moderate cos2 values, including the total number of pods and 100 seed weight, are illustrated in orange, reflecting their intermediate influence. Finally, attributes with low cos2 values, namely primary branches per plant and days to 50% flowering, were represented in black, signifying their limited contribution to the principal components. Consequently, primary branches per plant and days to 50% flowering were identified as traits of lesser importance in explaining the overall variability. These findings are consistent with the results reported by Nayak et al., (2021).

Fig 5: Combination of biplot and cos2 score.


 
Genetic diversity analysis
 
K-means clustering is a centroid- based, non-hierarchical classification technique used to partition n genotypes into k distinct cluster, wherein each genotype is assigned to the cluster with the nearest mean. In this method, ‘K’ represents predefined number of clusters and genotypes are grouped by minimizing within cluster variance while maximizing inter-cluster divergence (Kanavi et al., 2020).
       
In the present investigation, K-means cluster analysis was performed, resulting in the classification of genotypes into four distinct cluster, illustrated in Fig 6 and Fig 7. Earlier studies by Mohan et al., (2021) and Kanavi et al., (2020) reported the formation of four and seven cluster groups respectively, in greengram. Cluster II comprised the highest number of genotypes (43), predominantly characterized by increased plant height, Cluster I included 36 genotypes, most of which exhibited moderate to high yield. Cluster III consisted 13 genotypes, all identified as high-yielding, whereas cluster plot IV contained 15 genotypes that were primarily characterized by low yielding (Table 3). The high-yielding genotypes in cluster III, namely VBN (Gg)2, VGG 20-088, VGG 20-232, VGG 20-233, VGG 20-069, VGG 20-230, VGG 20-092, VGG 20-068, VGG 20-234, VGG 20-071, VGG 20-010, VGG 20-091 and VGG 20-066, were identified as promising candidate for multi-location trials aimed at assessing their adaptablility and suitability for variety release. To enhance crop adoption, it is crucial to select diverse parental lines based on component traits (Katiyar et al., 2020). The distribution of genotypes derived from diverse parental crosses across multiple clusters indicates a weak association between genetic divergence and geographical origin. Similar conclusions have been drawn by Katiyar et al., (2009) and Singh et al., (2013), who emphasized that a high degree of genetic diversity is critical for generating substantial variability and achieving effective genetic gains through selection.

Fig 6: Cluster plot of distribution of 107 genotypes into four clusters.



Fig 7: Hnclust distribution of 107 genotypes into four clusters.



Table 3: Distribution of 107 genotypes into four clusters as per K-means clustering.

The correlation coefficient analysis among the five yield-contributing traits indicated that selection based on more plant height and more total number of pods per plant can help to improved the yield because of its significant positive correlation with yield. Therefore, selection for increased plant height and higher pod number per plant could be effectively utilized to enhance grain yield. Among the yield-attributing traits, 100-seed weight showed a significant positive correlation with plant height, whereas it exhibited significant negative correlations with days to 50% flowering and number of primary branches per plant, indicating that taller plants with moderate primary branches tended to produce heavier seeds. The variables associated with the principal components, specifically PC1 (Dim.1) and PC2 (Dim.2), played a crucial role in explaining the majority of variability among genotypes. The K-means cluster analysis categorized the genotypes into four distinct clusters. Cluster III comprising high yielding  genotypes  such as VBN (Gg)2, VGG 20-234, VGG 20-230, VGG 20-232, VGG 20-233, VGG 20-092, VGG 20-091, VGG 20-088, VGG 20-066. VGG 20-069, VGG 20-068, VGG 20-071 and VGG 20-010. These genotypes contributed substantially to genetic diversity, particularly for primary branches and number of pods per plant and thus represent promising candidates for future breeding programs. Further multi-location evaluation of these genotypes is recommended to assess their stability, adaptability and yield performance under diverse agro-climatic conditions prior to their consideration for varietal release.
 
Disclaimers
 
The views and conclusions expressed in this article are solely those of the authors and do not necessarily represent the views of their affiliated institutions. The authors are responsible for the accuracy and completeness of the information provided, but do not accept any liability for any direct or indirect losses resulting from the use of this content.
The authors declare that there are no conflicts of interest regarding the publication of this article. No funding or sponsorship influenced the design of the study, data collection, analysis, decision to publish, or preparation of the manuscript.

  1. Annual Report (2023-24). ANGRAU - Crop Outlook Reports of Andhra Pradesh Green gram-June to May, 2023-24. Pp 1-20. https://angrau.ac.in/downloads/AMIC/Outlook Reports/2023_24/ greengram%20outlook%20-June-july- 2023-24.pdf.

  2. Din, N., Rabani, G., Tariq, M., Naeem, M.K. and Iqbal, M.S. (2015). Character association and path analysis of yield and yield components in mungbean [Vigna radiate (L.) Wilczek].  J. Agric. Res. 53(2): 165-293. 

  3. Duppala, M.K., Beena, N. and Gowtham, K.S. (2017). Evaluation of advanced lines of mustard spp. Using hierarchical cluster analysis. The Bioscan. 10(1): 299-307.

  4. Forgy, E.W. (1965). Cluster analysis of multivariate data efficiency vs. interpretability of classifications. Biometrics. 21: 768-769.

  5. Gayathridevi, G., Shanthi, P., Suresh, R., Manonmani S., Geetha, S., Sathyabama, K. and Geetha, P. (2023). Genetic variability, association and multivariate analysis for yield and yield parameters in rice Oryza sativa L. landraces. Electronic Journal of Plant Breeding. 143: 991-999.

  6. Jakhar, N.K. and Kumar, A. (2018). Principal component analysis and character association for yield components in green gram [Vigna radiata (L.) Wilczek] genotypes. Journal of Pharmacognosy and Phytochemistry. 7(2): 3665-3669.

  7. Johnson, R.A. and Wichern, D.W. (2007). Applied Multivariate Statistical Analysis (6th edition). London: Pearson International.

  8. Kanavi, M.S.P., Koler, P., Somu, G. and Marappa, N. (2020). Genetic diversity study through K-means clustering in germplasm accessions of green gram (Vigna radiata L.) under drought condition. International Journal of Bio-resource and Stress Management. 11(2): 138-147. https://doi.org/10.23910/IJBSM/2020.11.2.2078.

  9. Katiyar, P.K., Dixit, G.P., Singh, B.B., Ali, H. and Dubey, M.K. (2009). Non-hierarchical ecclidean cluster analysis for genetic divergence in mungbean cultivars. J. Food Legumes. 22: 34-36.

  10. Macqueen, J.B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium. Mathematical Statistics and Probability. University of California Press. 1: 281-297 

  11. Mahalingam, A., Manivanan, N., Kumar, K.B., Ramakrishnan, P. and Vadivel, K. (2020). Character association and principal component analysis for seed yield and its contributing characters in greengram [Vigna radiata (L.) Wilczek]. Electronic Journal of Plant Breeding. 11(1): 259-262.

  12. Majhi, P.K., Mogali, S.C. and Abhisheka, L. (2020). Genetic variability, heritability, genetic advance and correlation studies for seed yield and yield components in early segregating lines (F3) of greengram [Vigna radiata (L). Wilczek]. International Journal of Chemical Studies. 8(4): 1283- 1288. doi: 10.22271/chemi.2020.v8.i4k.9779.

  13. Manojkumar, D., Beena, N., Gowtham, K.S. and Rathod, A. (2018). Evaluation of advanced breeding lines of Indian mustard using principal component analysis. Journal of Oilseed Brassica. 9(1): 45-48. https://epubs.icar.org.in/index.php/JOB/article/view/158615.

  14. Mohan, S., Sheeba A. and Kalaimagal T. (2021). Genetic diversity and association studies in greengram [Vigna Radiata (L.) Wilczek]. Legume Research. 44(7): 779-784. doi: 10.18805/LR-4176.

  15. Muchomba, M.K., Muindi, E.M. and Mulinge, J.M. (2023). Overview of green gram (Vigna radiata L.) crop, its economic importance, ecological production constraints in Kenya. J. Agric. Ecol. Res. Int. 24(2): 1-11. doi: 10.9734/jaeri/2023/v24i2520.

  16. Nainu, A.J., Murugan, S., Kumar, N.S. and Christopher, D.J. (2020). Principal component analysis for yield contributingtraits in mung bean [Vigna radiata (L.) Wilczek] genotypes. Plant Archives. 20(2): 3585-3590. e-ISSN:2581-6063 (online), ISSN:0972-5210. 

  17. Nayak, G., Lenka, D., Dash, M. and Tripathy, K. (2021). Character association and principal component analysis for yield and its attributing characters in green gram. Journal of Crop and Weed. 17(3): 169-175.

  18. Punia, S.S., Gautam, N.K., Baldev, R., Verma, P., Meenakshi, H., Jain, N.K., Koli, N.R., Mahavar, R. and Jat, V.S. (2014). Genetic variability and correlation studies in urdbean (Vigna mungo L.). Legume Research. 37(6): 580-584. doi: 10.5958/0976-0571.2014.00680.8.

  19. Rabbani, M.A., Iwabuchi, A., Murakami, Y., Suzuki, T.  and Takayanagi, K. (1998). Phenotypic variation and the relationships among mustard (Brassica juncea L.) germplasm from Pakistan. Euphytica. 101: 357-366.

  20. Sadimantara, G.R., Muhidin and Nuraida, W. (2023). Evaluation of Yield Attributing Characters and Grain Yield in Some Promising Red Rice Lines (Oryza sativa L.) Grown in the Lowland Conditions. The 6th International Conference on Agriculture, Environment and Food Security. IOP Conf. Series: Earth and Environmental Science. IOP Publishing. 1241(2023): 012038. doi: 10.1088/1755-1315/1241/1/ 012038.

  21. Sandhiya, V. and Saravanan, S. (2018). Genetic variability and correlation studies in greengram [Vigna radiata (L.) Wilczek]. Electronic Journal of Plant Breeding. 9(3): 1094-1099. doi: 10.5958/0975-928X.2018.00136.9.

  22. Shanthi, P., Ramesh, P., Parameshwaran, M., Umadevi, M., Sakaravarthy, S.K. and Vivekananthan, T. (2024). Morphological and yield attribute of blackgram genotypes under different salinity stress conditions. Indian Journal of Agricultural Research. 58(3): 444-449. doi: 10.18805/IJARe.A-5697.

  23. Sheetal, R.P., Patel, K.K. and Hitiksha, K.P. (2014). Genetic variability, correlation and pathanalysis for seed yield and its components in green gram [Vigna radiata (L.) Wilczek]. The Bioscan. 9(4): 1847-1852. 

  24. Singh, R., Ali, H. and Pathak, B. (2013). Non-hierarchical euclidean cluster analysis in mungbean. Trends Biosci. 3: 135- 136.

  25. Thippani, S., Eshwari, K.B. and Bhave, M.H.V. (2017). Principal component analysis for yield components in greengram accessions (Vigna radiata L.). Int. J. Pure App. Biosci. 5(4): 246-253.

  26. Upadhyaya, H.D., Dwivedi, S.L., Gowda, C.L.L. and Singh, S. (2007). Identification of diverse gremplasm lines for agronomic traits in a chickpea (Cicer arietinum L.) core collection for use in crop improvement. Field Crops Res. 100: 320-326.

Principal Component Analysis with Association Analysis and Clustering of Genotypes by K-means Clustering in Greengram [Vigna radiata (L.) Wilczek]

P
P. Shanthi1
M
M. Umadevi1,*
N
N. Manivannan1
M
M. Gunasekaran1
D
D. Sassikumar1
1Centre for Plant Breeding and Genetics, Tamil Nadu Agricultural University, Coimbatore-641 003, Tamil Nadu, India.
  • Submitted06-03-2025|

  • Accepted03-04-2026|

  • First Online 25-04-2026|

  • doi 10.18805/LR-5488

Background: India is recognized as the prime global producer of greengram, which is grown in nearly across the states. This legume is a remarkable source of protein, fiber and iron. The identification of superior genotypes is primarily dependent on genetic diversity. Analytical techniques, including correlation, principal component analysis (PCA) and cluster analysis, are essential for clarifying the diversity found among various genotypes.

Methods: The study was carried out over two consecutive years, 2020 and 2021, during the Kharif season at the National Pulses Research Centre (NPRC) located in Vamban, Pudukkottai, Tamil Nadu, utilizing a randomized block design.

Result: The association analysis among the five yield-contributing traits revealed that the  two traits, namely plant height and total number of pods per plant, exhibited a significant positive association. Additionally, 100 seed weight demonstrated a significant positive correlation with plant height and showing a significant negative correlation with both 50 per cent flowering and the number of primary branches per plant. The principal components associated with the dataset, namely PC1 (Dim.1) and PC2 (Dim.2), are crucial in elucidating the variability within the dataset. PC1, with an eigen value of 2.3, accounted for a substantial portion of the total variance (36.27%), indicating a strong reflection of plant height. Conversely, PC2, with an eigen value of 1.75, contributed an additional 29.78 per cent to the total variance, highlighting a significant loading related to grain yield per hectare, which aligns with the observed positive correlations. A K-means cluster analysis was conducted; these genotypes were grouped into four distinct clusters. The majority of the genotypes originated from diverse parental crosses and was allocated to different clusters, suggesting a lack of correlation between genetic and geographical diversity. The high-yielding genotypes identified in cluster three, including VGG 20-010, VGG 20-066, VGG 20-069, VGG 20-068, VGG 20-071, VGG 20-088, VGG 20-091, VGG 20-092, VGG 20-230, VGG 20-232, VGG 20-233, VGG 20-234 and VBN(Gg)2 were significantly contributed to the overall diversity in terms of the number of primary branches and pods per plant, indicating their potential utility in future breeding programs.

Greengram is an important pulse crop in India, extensively cultivated across various states and widely preferred by consumers. It occupies about 3.79 million hectares, with a total production of 2.92 million tones and an average productivity of 670 kg per hectare, as reported in the (Annual Report 2023) Crop Outlook Reports of Andhra Pradesh Greengram. This crop contributes nearly 11 percent of India’s total pulse production. Furthermore, it is a popular legume that well adapted to arid and semi-arid environments and plays a significant role in imporving soil fertility by fixing approximately 30 to 40 kilograms of nitrogen per hectare (Majhi et al., 2020 and Muchomba et al., 2023). As a result, the nitrogen requirements for succeeding crops by about 25 percent of the recommended dose. To meet the growing demand for greengram, it is essential to overcome yield limitations through the development of high-yielding varieties with tolerant to  biotic and abiotic stresses.
       
Yield is a complex trait influenced by several interrelated components. A clear understanding of the associations between yield and its contributing traits is necessary, as selection based solely on yield may not result in substantial improvement (Sadimantara et al., 2023). Yield enhancement can be achieved by emphasizing traits that show a strong positive positively correlated with seed yield. Therefore, improving yield can be accomplished by prioritizing the selection of traits that are closely related to yield components. Principal component analysis (PCA) is a useful statistical approach for identifying key traits that account for maximum variation by reducing high- dimensional data into a smaller set of correlated variables (Nainu et al., 2020). When combined with cluster analysis, PCA effectively aids in identifying genetically diverse genotypes suitable for crop improvement programs (Duppala et al., 2017). Selecting highly diverse parents is crucial for generating superior segregants during hybridization and selection processes of plant breeding (Manojkumar et al., 2018). K-means clustering is an effective analytical method for evaluating genetic diversity, as it groups germplasm accessions into genetically distinct clusters or heterotic groups based on measures of genetic distance. After these heterotic groups are established, clusters that are highly divergent from one another can be readily identified. The germplasm accessions within such widely separated clusters represent a high level of genetic variability (Kanavi et al., 2020).  Accordingly, the present study was undertaken to assess genetic diversity and the factors contributing seed yield in greengram through correlation analysis and PCA, with the objective of identifying promising genotypes for future breeding and improvement efforts.
The experimental was conducted on 107 advanced greengram [Vigna radiata (L.) Wilczek] genotypes, evaluated through a randomized block design (RBD) with two replications during the Kharif seasons of  2020 and 2021 at the National Pulses Research Centre (NPRC) located in Vamban, Pudukkottai, Tamil Nadu. Each entry was sown in plots measuring 2 meters length and 2.4 meters width, with a planting density of 30 cm × 10 cm. Standard agronomic practices were employed to ensure optimal crop establishment. Biometric observations were recorded for six quantitative traits: namely days to 50 per cent flowering, plant height, number of primary branches per plant, number of pods per plant, 100-seed weight and single plant yield based on five randomly selected plants from each entry. The averages of these five plants over the two seasons were utilized for statistical analysis. All data analyses were conducted using R statistical software version 4.4.2. Principal Component Analysis was performed to ascertain which plant traits significantly contributed to the observed variation among the genotypes, adhering to the methodology established by Upadhyaya et al. (2007). Furthermore, correlation and genetic diversity analyses were executed to identify the most promising and diverse genotypes for the development of stable and well-adapted varieties. The advanced breeding lines were categorized using the ‘k-means clustering’ model as described by Macqueen (1967) and Forgy (1965). K-means clustering serves as a machine learning technique that evaluates genetic diversity by grouping germplasm accessions based on genetic distances. This method facilitates the creation of heterotic groupings, enabling the straightforward identification of genetically distant clusters among various germplasm accessions. Consequently, plant breeders can readily identify genetically diverse accessions for further evaluation and utilize them as parental lines in crossing programs.
The analysis of variance (ANOVA) for all the traits under study revealed significant difference among genotypes for all the characters in both seasons confirmed the presence of substantial variability in the experimental material. The correlation coefficient measures the interdependence between various plant characters and determines the component characters on which selection can be relied upon for genetic improvement of yield. The results of correlation are presented in the Fig 1.  Among the five yield contributing characters evaluated, plant height and total number of pods per plant illustrated a significant positive correlation with grain yield, similar results were also reported by Sandhiya and Saravanan (2018).  Whereas the primary branches per plant and test weight (100 seed weight) were recorded positive but non-significant correlations with grain yield, which is in agreement with the findings of Sheetal et al., (2014).

Fig 1: Correlation coefficients among six quantitative traits in greengram genotypes.


       
With respect to inter trait correlations among yield attributing traits, 100 seed weight showed a significant positive association with plant height whereas it exhibited a significant negative correlation with 50 percent flowering and primary branches per plant. These findings in accordance with the results of Din et al., (2015). Furthermore, primary branches per plant and total number of pods per plants had a significant positive correlation with days to 50 per cent flowering, corroborating earlier reports by Punia et al., (2014) and Shanthi et al., (2024).
       
When evaluating a large number of advanced genotypes, the presence of a broad spectrum of genetic variability is essential for the effective identification of superior genotypes for further advancement. Multivariate analysis serves as a robust statistical approach for the comprehensive evaluation of advanced homozygous genotypes and facilitates the identification of promising genotypes based on key agronomic traits (Rabbani et al., 1998). Principal Component analysis (PCA) biplot is a graphical technique that enables the simultaneous assessment of relationships among genotypes and their associated traits. The results of PCAs are presented in Table 1.  The PCA clearly revealed that the first two principal components, with egienvalues greater than unity,  accounted for 68.56 percentage of  the total variation present  in the dataset (Table 1; Fig 2). Amongst the two PCs, the PC1 with an eigenvalue of 2.3 explained the highest proportion of the total variance (36.27) and was primarily associated with plant height, indicating that this trait contributed substantially to variability. Similar report was also mentioned by Thippani et al., (2017). The second  principal component (PC2), with an eigenvalue of 1.75 accounted for  29.78 per cent of total variance reflected the significant loading of grain yield per hectare, which was also reflected in its positive correlation with other yield related traits. The remaining principal components (PC3 to PC6) contributed marginally and exhibited limited discriminatory power. Consequently, the most important yield and yield contributing characters, particularly total number of pods per plant and plant height were predominantly associated with first two principle components. Earlier studies by Mahalingam et al., (2020); Mohan et al., (2021); Nayak et al., (2021); Jakhar and Kumar (2018) and Gayathri et al., (2023) similarly emphasized the importance of these traits in enhancing efficiency in future breeding programmes.

Table 1: Eigen values, percentage of variance and cumulative variance six principal components of greengram.



Fig 2: Screen plot showing six principal components.


       
All variables included in the study were aggregated and exhibited positive correlations with one another. Grain yield exhibited a significant positive association with total number of pods per plant and plant height, which corresponded to higher loadings of these traits on the first principal component (Fig 3). Traits located farther from the origin in the biplot, namely grain yield, total number of pods per plant and plant height, contributed more substantially to overall variability compared to primary branches per plant, days to 50% flowering and 100-seed weight. Notably, none of the variables displayed negative correlations. PCA was conducted using a correlation matrix to transform correlated quantitative variables into a reduced set of uncorrelated principal components, thereby elucidating the underlying structure of trait relationships (Johnson and Wichern, 2007). The summary of the characters studied for correlation and principal component analysis are presented in Table 2.

Table 2: Summary of characters studied for correlation and Principal component analysis.



Fig 3: Biplot of the variables with respects to the first two PCs.


       
The representation quality of the variables on the factor map is referred to as Cos2 (Fig 4), indicated that plant height, total number of pods per plant and grain yield exhibited the greatest contribution to total variation, where as 100 grain weight and primary branches per plant contributed the least to the principal components. This finding is corroborated by the research conducted by Mahalingam et al., (2020). The variables strongly associated with PC1 (Dim.1) and PC2 (Dim.2) are pivotal in elucidating the variability within the dataset. Conversely, variables with weak or negligible association with the principal components contributed minimally and may  be excluded to simplify the analysis without compromising interpretability. 

Fig 4: Variables’ contribution to principal components.


       
In this present study, the integration of biplot analysis and attribute significance facilitated the construction of  unified biplot, wherein attributes exhibiting similar cos2 scores were represented by analogous color codes (Fig 5). Attributes with high cos2 values, notably plant height and grain yield, were depicted in green, indicating their strong contribution to the principal components. In contrast, attributes with moderate cos2 values, including the total number of pods and 100 seed weight, are illustrated in orange, reflecting their intermediate influence. Finally, attributes with low cos2 values, namely primary branches per plant and days to 50% flowering, were represented in black, signifying their limited contribution to the principal components. Consequently, primary branches per plant and days to 50% flowering were identified as traits of lesser importance in explaining the overall variability. These findings are consistent with the results reported by Nayak et al., (2021).

Fig 5: Combination of biplot and cos2 score.


 
Genetic diversity analysis
 
K-means clustering is a centroid- based, non-hierarchical classification technique used to partition n genotypes into k distinct cluster, wherein each genotype is assigned to the cluster with the nearest mean. In this method, ‘K’ represents predefined number of clusters and genotypes are grouped by minimizing within cluster variance while maximizing inter-cluster divergence (Kanavi et al., 2020).
       
In the present investigation, K-means cluster analysis was performed, resulting in the classification of genotypes into four distinct cluster, illustrated in Fig 6 and Fig 7. Earlier studies by Mohan et al., (2021) and Kanavi et al., (2020) reported the formation of four and seven cluster groups respectively, in greengram. Cluster II comprised the highest number of genotypes (43), predominantly characterized by increased plant height, Cluster I included 36 genotypes, most of which exhibited moderate to high yield. Cluster III consisted 13 genotypes, all identified as high-yielding, whereas cluster plot IV contained 15 genotypes that were primarily characterized by low yielding (Table 3). The high-yielding genotypes in cluster III, namely VBN (Gg)2, VGG 20-088, VGG 20-232, VGG 20-233, VGG 20-069, VGG 20-230, VGG 20-092, VGG 20-068, VGG 20-234, VGG 20-071, VGG 20-010, VGG 20-091 and VGG 20-066, were identified as promising candidate for multi-location trials aimed at assessing their adaptablility and suitability for variety release. To enhance crop adoption, it is crucial to select diverse parental lines based on component traits (Katiyar et al., 2020). The distribution of genotypes derived from diverse parental crosses across multiple clusters indicates a weak association between genetic divergence and geographical origin. Similar conclusions have been drawn by Katiyar et al., (2009) and Singh et al., (2013), who emphasized that a high degree of genetic diversity is critical for generating substantial variability and achieving effective genetic gains through selection.

Fig 6: Cluster plot of distribution of 107 genotypes into four clusters.



Fig 7: Hnclust distribution of 107 genotypes into four clusters.



Table 3: Distribution of 107 genotypes into four clusters as per K-means clustering.

The correlation coefficient analysis among the five yield-contributing traits indicated that selection based on more plant height and more total number of pods per plant can help to improved the yield because of its significant positive correlation with yield. Therefore, selection for increased plant height and higher pod number per plant could be effectively utilized to enhance grain yield. Among the yield-attributing traits, 100-seed weight showed a significant positive correlation with plant height, whereas it exhibited significant negative correlations with days to 50% flowering and number of primary branches per plant, indicating that taller plants with moderate primary branches tended to produce heavier seeds. The variables associated with the principal components, specifically PC1 (Dim.1) and PC2 (Dim.2), played a crucial role in explaining the majority of variability among genotypes. The K-means cluster analysis categorized the genotypes into four distinct clusters. Cluster III comprising high yielding  genotypes  such as VBN (Gg)2, VGG 20-234, VGG 20-230, VGG 20-232, VGG 20-233, VGG 20-092, VGG 20-091, VGG 20-088, VGG 20-066. VGG 20-069, VGG 20-068, VGG 20-071 and VGG 20-010. These genotypes contributed substantially to genetic diversity, particularly for primary branches and number of pods per plant and thus represent promising candidates for future breeding programs. Further multi-location evaluation of these genotypes is recommended to assess their stability, adaptability and yield performance under diverse agro-climatic conditions prior to their consideration for varietal release.
 
Disclaimers
 
The views and conclusions expressed in this article are solely those of the authors and do not necessarily represent the views of their affiliated institutions. The authors are responsible for the accuracy and completeness of the information provided, but do not accept any liability for any direct or indirect losses resulting from the use of this content.
The authors declare that there are no conflicts of interest regarding the publication of this article. No funding or sponsorship influenced the design of the study, data collection, analysis, decision to publish, or preparation of the manuscript.

  1. Annual Report (2023-24). ANGRAU - Crop Outlook Reports of Andhra Pradesh Green gram-June to May, 2023-24. Pp 1-20. https://angrau.ac.in/downloads/AMIC/Outlook Reports/2023_24/ greengram%20outlook%20-June-july- 2023-24.pdf.

  2. Din, N., Rabani, G., Tariq, M., Naeem, M.K. and Iqbal, M.S. (2015). Character association and path analysis of yield and yield components in mungbean [Vigna radiate (L.) Wilczek].  J. Agric. Res. 53(2): 165-293. 

  3. Duppala, M.K., Beena, N. and Gowtham, K.S. (2017). Evaluation of advanced lines of mustard spp. Using hierarchical cluster analysis. The Bioscan. 10(1): 299-307.

  4. Forgy, E.W. (1965). Cluster analysis of multivariate data efficiency vs. interpretability of classifications. Biometrics. 21: 768-769.

  5. Gayathridevi, G., Shanthi, P., Suresh, R., Manonmani S., Geetha, S., Sathyabama, K. and Geetha, P. (2023). Genetic variability, association and multivariate analysis for yield and yield parameters in rice Oryza sativa L. landraces. Electronic Journal of Plant Breeding. 143: 991-999.

  6. Jakhar, N.K. and Kumar, A. (2018). Principal component analysis and character association for yield components in green gram [Vigna radiata (L.) Wilczek] genotypes. Journal of Pharmacognosy and Phytochemistry. 7(2): 3665-3669.

  7. Johnson, R.A. and Wichern, D.W. (2007). Applied Multivariate Statistical Analysis (6th edition). London: Pearson International.

  8. Kanavi, M.S.P., Koler, P., Somu, G. and Marappa, N. (2020). Genetic diversity study through K-means clustering in germplasm accessions of green gram (Vigna radiata L.) under drought condition. International Journal of Bio-resource and Stress Management. 11(2): 138-147. https://doi.org/10.23910/IJBSM/2020.11.2.2078.

  9. Katiyar, P.K., Dixit, G.P., Singh, B.B., Ali, H. and Dubey, M.K. (2009). Non-hierarchical ecclidean cluster analysis for genetic divergence in mungbean cultivars. J. Food Legumes. 22: 34-36.

  10. Macqueen, J.B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium. Mathematical Statistics and Probability. University of California Press. 1: 281-297 

  11. Mahalingam, A., Manivanan, N., Kumar, K.B., Ramakrishnan, P. and Vadivel, K. (2020). Character association and principal component analysis for seed yield and its contributing characters in greengram [Vigna radiata (L.) Wilczek]. Electronic Journal of Plant Breeding. 11(1): 259-262.

  12. Majhi, P.K., Mogali, S.C. and Abhisheka, L. (2020). Genetic variability, heritability, genetic advance and correlation studies for seed yield and yield components in early segregating lines (F3) of greengram [Vigna radiata (L). Wilczek]. International Journal of Chemical Studies. 8(4): 1283- 1288. doi: 10.22271/chemi.2020.v8.i4k.9779.

  13. Manojkumar, D., Beena, N., Gowtham, K.S. and Rathod, A. (2018). Evaluation of advanced breeding lines of Indian mustard using principal component analysis. Journal of Oilseed Brassica. 9(1): 45-48. https://epubs.icar.org.in/index.php/JOB/article/view/158615.

  14. Mohan, S., Sheeba A. and Kalaimagal T. (2021). Genetic diversity and association studies in greengram [Vigna Radiata (L.) Wilczek]. Legume Research. 44(7): 779-784. doi: 10.18805/LR-4176.

  15. Muchomba, M.K., Muindi, E.M. and Mulinge, J.M. (2023). Overview of green gram (Vigna radiata L.) crop, its economic importance, ecological production constraints in Kenya. J. Agric. Ecol. Res. Int. 24(2): 1-11. doi: 10.9734/jaeri/2023/v24i2520.

  16. Nainu, A.J., Murugan, S., Kumar, N.S. and Christopher, D.J. (2020). Principal component analysis for yield contributingtraits in mung bean [Vigna radiata (L.) Wilczek] genotypes. Plant Archives. 20(2): 3585-3590. e-ISSN:2581-6063 (online), ISSN:0972-5210. 

  17. Nayak, G., Lenka, D., Dash, M. and Tripathy, K. (2021). Character association and principal component analysis for yield and its attributing characters in green gram. Journal of Crop and Weed. 17(3): 169-175.

  18. Punia, S.S., Gautam, N.K., Baldev, R., Verma, P., Meenakshi, H., Jain, N.K., Koli, N.R., Mahavar, R. and Jat, V.S. (2014). Genetic variability and correlation studies in urdbean (Vigna mungo L.). Legume Research. 37(6): 580-584. doi: 10.5958/0976-0571.2014.00680.8.

  19. Rabbani, M.A., Iwabuchi, A., Murakami, Y., Suzuki, T.  and Takayanagi, K. (1998). Phenotypic variation and the relationships among mustard (Brassica juncea L.) germplasm from Pakistan. Euphytica. 101: 357-366.

  20. Sadimantara, G.R., Muhidin and Nuraida, W. (2023). Evaluation of Yield Attributing Characters and Grain Yield in Some Promising Red Rice Lines (Oryza sativa L.) Grown in the Lowland Conditions. The 6th International Conference on Agriculture, Environment and Food Security. IOP Conf. Series: Earth and Environmental Science. IOP Publishing. 1241(2023): 012038. doi: 10.1088/1755-1315/1241/1/ 012038.

  21. Sandhiya, V. and Saravanan, S. (2018). Genetic variability and correlation studies in greengram [Vigna radiata (L.) Wilczek]. Electronic Journal of Plant Breeding. 9(3): 1094-1099. doi: 10.5958/0975-928X.2018.00136.9.

  22. Shanthi, P., Ramesh, P., Parameshwaran, M., Umadevi, M., Sakaravarthy, S.K. and Vivekananthan, T. (2024). Morphological and yield attribute of blackgram genotypes under different salinity stress conditions. Indian Journal of Agricultural Research. 58(3): 444-449. doi: 10.18805/IJARe.A-5697.

  23. Sheetal, R.P., Patel, K.K. and Hitiksha, K.P. (2014). Genetic variability, correlation and pathanalysis for seed yield and its components in green gram [Vigna radiata (L.) Wilczek]. The Bioscan. 9(4): 1847-1852. 

  24. Singh, R., Ali, H. and Pathak, B. (2013). Non-hierarchical euclidean cluster analysis in mungbean. Trends Biosci. 3: 135- 136.

  25. Thippani, S., Eshwari, K.B. and Bhave, M.H.V. (2017). Principal component analysis for yield components in greengram accessions (Vigna radiata L.). Int. J. Pure App. Biosci. 5(4): 246-253.

  26. Upadhyaya, H.D., Dwivedi, S.L., Gowda, C.L.L. and Singh, S. (2007). Identification of diverse gremplasm lines for agronomic traits in a chickpea (Cicer arietinum L.) core collection for use in crop improvement. Field Crops Res. 100: 320-326.
In this Article
Published In
Legume Research

Editorial Board

View all (0)