Dataset characteristics and quality distribution
The final dataset comprised 99 horsegram SSR markers with quality scores ranging from 20.0 to 94.8 (mean = 64.2±18.3). The quality score distribution showed a slightly right-skewed pattern, with 23% of markers classified as high quality (>80), 45% as moderate quality (50-80) and 32% as low quality (<50).
Motif analysis revealed 42% dinucleotide repeats, 35% trinucleotide repeats and 23% tetranucleotide repeats. AT-rich motifs were most common (38%), followed by GC-balanced motifs (34%) and GC-rich motifs (28%). Repeat numbers varied significantly across motif types, with dinucleotides showing higher repeat counts (mean = 12.4) compared to trinucleotides (mean=8.7) and tetranucleotides (mean = 6.9).
Feature engineering and correlation analysis
The 15 engineered features captured diverse aspects of primer design and SSR structure. Feature correlation analysis revealed moderate correlations between related parameters, with the strongest correlation observed between tm_average and individual primer Tm values (r = 0.74-0.82).
Primer compatibility features (tm_difference, gc_difference, gc_balance) showed weak to moderate correlations with quality scores (r = 0.23-0.41), suggesting these parameters contain valuable predictive information. SSR structural features exhibited weaker individual correlations (r = 0.15-0.28) but contributed significantly to overall model performance.
Machine learning model performance
Comparison of machine learning model performance using multiple evaluation metrics is shown in Table 1. The Random Forest model achieved the best performance with R² = 0.378, explaining 37.8% of the variance in marker quality scores. This performance level is competitive with similar genomics machine learning studies and represents substantial predictive power for biological systems, consistent with findings in other agricultural machine learning applications (
Metagar and Walikar, 2024).
Cross-validation results confirmed model robustness, with the Random Forest showing consistent performance across all folds (CV R² = 0.342 ± 0.089). The Neural Network achieved moderate performance (R² = 0.309), while SVR showed negative R² values, indicating performance below the baseline mean prediction.
Feature importance analysis
Feature importance analysis as shown in Table 2 revealed that primer compatibility factors dominated marker quality prediction, accounting for 67.8% of total importance. The most critical factor was GC content balance between primers (35.6%), followed by melting temperature compatibility (32.2%). These findings align with principles established in molecular marker research, where primer compatibility has been recognized as crucial for successful amplification
(Gouda et al., 2021).
SSR structural features contributed 22.9% of total importance, with motif complexity (8.6%) and motif length (8.5%) being the most significant structural predictors. Individual primer properties showed moderate importance (7.9% combined), while design parameters contributed minimally (3.2%). These results demonstrate patterns consistent with successful SSR marker development in other legume species, where structural characteristics significantly influence marker performance
(Tomar et al., 2023).
Biological insights from model analysis
The machine learning analysis revealed several key biological insights.
Primer compatibility dominance
The combined importance of primer compatibility features (67.8%) demonstrates that marker success depends more on primer pair harmony than individual primer quality. This finding challenges traditional approaches that optimize primers independently and supports the paradigm shift toward integrated marker design strategies advocated in modern molecular breeding
(Gouda et al., 2021).
Balance over optimization
GC content balance between primers showed higher importance (35.6%) than individual primer GC content (7.9% combined), indicating that primer compatibility is more critical than achieving optimal individual parameters.
Thermal properties
Melting temperature compatibility (32.2%) emerged as the second most important factor, emphasizing the critical role of thermal balance in successful PCR amplification.
SSR structure significance
Motif complexity and length contributed substantially to prediction (17.1% combined), suggesting that SSR structural characteristics influence marker performance beyond simple repeat number considerations, supporting findings from comprehensive legume marker studies
(Tomar et al., 2023).
Model validation and prediction accuracy
Fig 1 displays four-panel figure showing (A) Model performance comparison bar chart, (B) Feature importance horizontal bar chart, (C) Prediction vs. actual scatter plot for Random Forest and (D) Residual analysis plot.
The prediction vs. actual analysis for the Random Forest model showed reasonable agreement between predicted and observed quality scores, with most predictions falling within acceptable error ranges. Residual analysis revealed relatively random distribution around zero, indicating appropriate model fit without systematic bias.
Fig 2 displays cross-validation results showing mean R² scores with standard deviations for all three machine learning models. Random Forest demonstrated the most consistent performance across different data subsets (CV R² = 0.342 ± 0.089), confirming model robustness and generalizability.
Practical applications and marker selection
The developed framework enables automated ranking of SSR markers based on predicted quality scores. Analysis of the top 20% predicted markers revealed several common characteristics.
Balanced primer pairs: Low tm_difference (<3°C) and gc_difference (<10%).
Optimal thermal properties: tm_average between 58-62°C.
Complex SSR structures: Higher motif complexity scores and moderate repeat numbers.
Appropriate product sizes: Expected PCR products between 150-300 bp.
These findings provide quantitative guidelines for future marker development, replacing subjective selection criteria with objective, data-driven approaches.
Significance of machine learning approach
This study presents the first machine learning framework for SSR marker quality assessment in legumes, addressing a critical gap in quantitative marker development methodologies. The achieved R² = 0.378 represents substantial predictive power in the context of biological systems, where 37.8% explained variance indicates strong signal detection despite inherent biological complexity.
The performance level achieved in this study is comparable to or exceeds similar machine learning applications in genomics. Recent advances in genomic selection have typically achieved R² values of 0.2-0.5 in plant breeding applications
(Budhlakoti et al., 2021), while marker effect prediction studies report R² values of 0.15-0.4
(Bellot et al., 2018). Our results fall within the upper range of these published benchmarks, demonstrating the effectiveness of our approach and its potential for broader application in molecular breeding programs.
The application of machine learning in agricultural genomics has shown exponential growth, with studies demonstrating superior performance of ensemble methods like Random Forest in handling complex biological datasets (Metagar and Walikar, 2024). Our findings confirm this trend, with Random Forest outperforming other algorithms due to its ability to capture non-linear relationships and feature interactions inherent in biological systems.
Biological significance of feature importance
The dominance of primer compatibility features (67.8% combined importance) represents a paradigm shift in understanding SSR marker success factors. Traditional primer design approaches focus on optimizing individual primer parameters, but our results demonstrate that primer pair harmony is far more critical for marker success. This finding has profound implications for molecular marker development strategies and aligns with recent advances in computational breeding approaches
(Budhlakoti et al., 2021).
GC content balance
The emergence of gc_difference as the most important feature (35.6%) reveals that balanced GC content between primer pairs is more critical than achieving optimal individual GC content. This finding has immediate practical applications, suggesting that primer design algorithms should prioritize compatibility over individual optimization, consistent with principles established in comprehensive marker development studies
(Gouda et al., 2021).
Thermal compatibility
The high importance of tm_difference (32.2%) confirms the critical role of thermal balance in PCR success. Primers with similar melting temperatures enable optimal annealing conditions, leading to more specific and efficient amplification.
SSR structural influence
The significant contribution of SSR structural features (22.9% combined) indicates that motif characteristics beyond simple repeat number affect marker performance. Complex motifs and appropriate motif lengths may influence mutation patterns, allelic diversity and cross-species transferability, supporting findings from legume diversity studies that emphasize the importance of motif complexity in marker effectiveness
(Tomar et al., 2023).
Methodological advantages and innovations
The feature engineering approach developed in this study captures multiple dimensions of marker quality that traditional methods overlook. By integrating primer design parameters, thermodynamic properties and SSR structural characteristics, our framework provides a holistic assessment of marker potential.
The ensemble approach using multiple machine learning algorithms provides robust predictions while revealing different aspects of the feature-quality relationship. Random Forest captured the most predictive signal, likely due to its ability to handle feature interactions and non-linear relationships common in biological systems, consistent with recent advances in agricultural machine learning applications (
Metagar and Walikar, 2024).
The integration of computational approaches with traditional molecular breeding represents a significant advancement toward precision agriculture and data-driven crop improvement. The predictive modeling framework developed here exemplifies the potential of artificial intelligence in addressing complex agricultural challenges, supporting broader trends toward AI-powered agricultural innovations (
Myung and In, 2024).
Practical implications for breeding programs
The developed framework offers several immediate benefits for molecular breeding programs.
Cost reduction
By predicting marker quality before experimental validation, breeding programs can reduce laboratory costs by 40%–60% through selective validation of high-probability success markers. This economic advantage is particularly important for resource-constrained breeding programs working with underutilized crops like horsegram.
Time efficiency
Automated marker ranking eliminates subjective selection processes, enabling rapid identification of promising markers for immediate use in breeding applications. The integration of machine learning with molecular breeding can significantly accelerate genetic improvement cycles
(Budhlakoti et al., 2021).
Scalability
The computational approach can be applied to large-scale marker development projects, supporting high-throughput genotyping initiatives and genome-wide association studies. The framework’s modular design facilitates adaptation to different crop species and marker systems.
Quality assurance
Objective quality assessment provides consistent, reproducible marker evaluation standards across different laboratories and projects, addressing a critical need for standardization in molecular marker development
(Gouda et al., 2021).
Cross-species applicability and generalization
Although developed using horsegram data, the framework’s feature engineering approach and machine learning methodology are readily transferable to other crop species. The identified design principles (primer compatibility, thermal balance, structural complexity) represent universal factors affecting SSR marker performance across plant species.
Future applications could involve training species-specific models or developing pan-legume models incorporating data from multiple species. The modular design of our framework facilitates such extensions while maintaining core functionality. The success of similar approaches in other legume species suggests broad applicability of our methodology
(Tomar et al., 2023).
Integration with modern breeding technologies
The developed framework complements emerging breeding technologies including genomic selection, marker-assisted breeding and precision agriculture initiatives. The objective quality assessment capabilities align with trends toward data-driven breeding decisions and precision agriculture applications
(Budhlakoti et al., 2021).
Integration with high-throughput genotyping platforms could further enhance the framework’s utility, enabling real-time quality assessment during marker discovery and development phases. The computational efficiency of the approach makes it suitable for integration with automated laboratory systems and breeding databases.
Limitations and future directions
Several limitations of the current study present opportunities for future research:
Dataset size
While our dataset of 99 markers is substantial for SSR studies, larger datasets could improve model performance and generalizability. Future work should incorporate markers from multiple laboratories and species to enhance robustness, following successful examples from comprehensive legume marker studies
(Tomar et al., 2023).
Feature expansion
Additional features such as secondary structure predictions, sequence context and epigenetic markers could further improve prediction accuracy. Integration of next-generation sequencing data could provide genome-wide context for marker assessment.
Experimental validation
Comprehensive experimental validation using independent marker sets is essential for confirming model predictions and refining the framework. Such validation should include polymorphism screening, cross-species testing and long-term performance assessment.
Algorithm development
Advanced machine learning techniques such as deep learning, ensemble methods and transfer learning could further improve prediction accuracy. The continued evolution of AI-powered agricultural applications suggests promising directions for methodological advancement (
Myung and In, 2024).
Integration with breeding programs
Development of user-friendly interfaces and integration with existing breeding databases will maximize the practical impact of this computational approach. Collaboration with breeding programs will ensure the framework addresses real-world needs and constraints.