Dataset characteristics and quality distribution
The complete dataset comprised 2,770 pigeonpea SSR primer pairs with computational quality scores ranging from 14.9 to 38.3 (mean = 26.4±4.1). Distribution analysis showed that 97.0% of markers fell within the moderate quality range (20-35 points), while 1.5% achieved high quality scores (>35 points, representing the best markers with highest predicted success rates) and 1.5% showed lower quality scores (<20 points).
Motif distribution across the dataset showed
1,452 dinucleotide repeats (52.4% of total), 865 trinucleotide repeats (31.2%), 355 tetranucleotide repeats (12.8%) and 98 pentanucleotide repeats (3.6%). Repeat numbers varied by motif type: dinucleotides showed 11.8± 3.4 repeats with range from 6 to 23, trinucleotides showed 8.3±2.6 repeats with range from 4 to 14 and tetranucleotides showed 6.7±2.1 repeats with range from 4 to 12. These motif distributions are consistent with previous SSR characterization studies in pigeonpea, where dinucleotide repeats typically predominate (50-60%), followed by trinucleotides (30-35%)
(Bohra et al., 2020; Kaur et al., 2020; Varshney et al., 2012).
Primer design parameters showed the following distributions melting temperatures ranged
From 51.1 to 74.4°C with mean of 61.3±4.3°C, GC content ranged from 8.0 to 21.7% with mean of 14.5±2.2% and expected PCR product sizes ranged from 61 to 169 base pairs with mean of 95±23 bp. These distributions represent typical values for computationally designed SSR markers in plant genomes.
Machine learning model performance
Support Vector Regression achieved superior performance across all metrics (Fig 2A), explaining 48.7% of quality variance. Paired t-tests confirmed SVR significantly outperformed Neural Network (t = 2.18, p = 0.042) and Random Forest (t = 2.54, p = 0.028).
Feature importance rankings
Categorical importance summary
• Primer compatibility features: 72.9%.
• SSR structural features: 15.9%.
• Individual primer properties: 11.2%.
Model validation
Five-fold cross-validation demonstrated consistent SVR performance (Fig 3). Cross-validation fold R² scores: 0.428, 0.445, 0.461, 0.438, 0.433 (mean 0.441±0.032). The narrow interquartile range confirms model robustness and generalization capability. Residual analysis confirmed appropriate model fit without systematic bias (Fig 2D), indicating that the model predictions are unbiased across the quality score range.
High-quality marker characteristics
Analysis of top 20% (554 markers) versus bottom 20% (554 markers) revealed distinct patterns (Table 3).
Quantitative design guidelines
1. Thermal compatibility: Tm difference <1.5°C (optimal: <1.0°C).
2. Optimal melting temperature: 58-62°C average range.
3. Primer length balance: Length difference <2 bp (optimal: ≤1 bp).
4. Product size: 80-110 bp range for optimal resolution.
5. SSR complexity: Scores >24.
6. Repeat numbers: Dinucleotides ≥11 repeats; trinucleotides ≥11 repeats.
This study establishes three fundamental contributions that directly emerge from our experimental findings.
First validated prediction framework
As demonstrated in Table 1, the first machine learning framework was developed for automated SSR marker quality assessment in pigeonpea, achieving predictive accuracy (R² = 0.487, MAE = 2.341) through Support Vector Regression. Five-fold cross-validation (CV R² = 0.441±0.032) confirmed robust generalization. These performance metrics are comparable to other machine learning applications in genomic prediction, where R² values typically range from 0.35-0.55 for complex biological traits
(Crossa et al., 2017; Montesinos-López et al., 2024). Our results align with
Wang et al., (2023) who reported similar predictive accuracy using deep learning approaches for genomic applications.
Paradigm shift in design principles
As shown in Table 2 and Fig 2B, feature importance analysis revealed that primer pair compatibility dominates marker success (72.9% combined importance), fundamentally shifting understanding from individual primer optimization to primer pair harmony. Consistent with our quantitative findings in Table 3, melting temperature difference (25.2%), average Tm (23.0%) and length difference (17.8%) emerged as critical predictors. This dominance of primer compatibility features over SSR structural characteristics (15.9% importance) suggests that PCR amplification efficiency is primarily governed by thermodynamic harmony between primer pairs rather than inherent microsatellite properties.
This finding has important practical implications for marker design
Researchers should prioritize designing primer pairs with minimal Tm differences (<1.5°C) and balanced lengths rather than focusing solely on individual primer parameters. A similar emphasis on primer pair compatibility has been reported in other marker development studies
(Gupta et al., 2003), supporting our machine learning-derived insights.
Practical cost reduction framework
Based on the model validation results shown in Fig 2C-D and the cross-validation stability demonstrated in Fig 3, the trained model enables prioritization of 554 high-quality markers (20% of dataset) for experimental validation, potentially reducing development costs by 40-60% ($110,800-$132,960 savings). This resource efficiency is particularly valuable for pigeonpea improvement programs in developing countries. This estimated cost reduction is consistent with computational pre-screening approaches reported in other crops, where prioritization strategies have achieved 30-50% savings in marker validation costs
(Crossa et al., 2017). Similarly,
Bohra et al., (2020) emphasized that genomic tools enabling selective marker validation can significantly reduce resource requirements for orphan crop breeding programs, supporting our findings.
Practical applications in breeding programs
The developed framework offers applications for pigeonpea breeding and genomics research.
Priority marker selection
Breeding programs can use the trained model to rank any set of designed markers, selecting top candidates for experimental validation. This capability is valuable for large-scale genotyping initiatives requiring cost-effective marker selection, QTL mapping studies needing high-quality markers across target regions and diversity assessment programs requiring polymorphic markers for germplasm characterization. The objective ranking system removes subjective bias from marker selection decisions. Similar computational prioritization approaches have been successfully employed in other crops;
Saxena et al., (2012) demonstrated the value of systematic marker screening in pigeonpea SNP development, while
Bohra et al., (2020) highlighted that objective quality metrics significantly improve marker selection efficiency compared to traditional subjective methods.
Marker design optimization
The quantitative design guidelines (Tm difference <1.5°C, length difference <2 bp, SSR complexity >24) can be integrated into primer design workflows, filtering low-quality candidates before synthesis. Existing tools like Primer3 can be configured with these parameters as hard constraints or penalty functions, improving the quality of initial marker designs.
Resource allocation
Programs can stratify markers into priority tiers based on predicted quality. Tier 1 (top 20%) markers should receive immediate synthesis and validation. Tier 2 (next 30%) markers can undergo secondary validation if Tier 1 markers prove insufficient for the research objectives. Tier 3 (bottom 50%) markers should be deprioritized unless their specific genomic location is required for the study. This tiered approach maximizes return on investment in marker development, which is particularly important for pigeonpea programs with limited funding. However, several factors may affect the practical application of these findings, including: (1) genetic diversity within the target germplasm, as markers performing well in one genetic background may show reduced polymorphism in others; (2) laboratory-specific PCR conditions that may influence amplification success; (3) DNA quality variations across different extraction protocols; and (4) population-specific allelic variations that could affect marker informativeness.
Cross-species transfer
The framework can guide marker selection for cross-species amplification studies. Markers with moderate conservation scores (70-85% identity with common bean) and high predicted quality offer the best probability of successful cross-amplification, enabling comparative genomics studies and marker transfer to related Cajanus species.
Broader implications for orphan crop improvement
Beyond pigeonpea, this framework has implications for molecular breeding in underutilized crops.
Resource efficiency for orphan crops
Many orphan legumes including horsegram (
Macrotyloma uniflorum), cowpea (
Vigna unguiculata), mung bean (
Vigna radiata) and grass pea (
Lathyrus sativus) face similar resource constraints in marker development. The computational approach reduces barriers to marker development by eliminating the need to experimentally test all designed markers, making genomic resources more accessible to researchers working on these crops. The framework can be adapted to these species by training on available marker data or applying pigeonpea-derived parameters as initial estimates pending species-specific optimization. These results are in agreement with findings reported by
Varshney et al., (2012) and
Bohra et al., (2020), who emphasized the transferability of genomic resources across related legume species. Similarly, recent studies on orphan crop genomics have demonstrated that computational approaches developed in one species can be effectively adapted to related crops
(Chowdhury et al., 2020; Crossa et al., 2024; Singh et al., 2024).
Standardization across programs
The objective quality scoring provides standardized marker evaluation criteria applicable across different laboratories and programs. This standardization facilitates data sharing between research groups, enables collaborative breeding initiatives that pool resources across institutions and supports meta-analyses combining results from multiple studies. Standardized quality metrics allow researchers to compare marker performance across different studies and make informed decisions about marker selection based on published data. As demonstrated in our results (Table 2), the quality scoring system based on 15 engineered features provides reproducible metrics, with primer compatibility parameters (tm_difference, tm_average, length_difference) accounting for 66% of the total predictive importance. The consistent cross-validation performance (CV R² = 0.441± 0.032, Fig 3) further supports the reliability of these standardized metrics for cross-institutional applications.
Democratization of molecular breeding
By reducing experimental validation costs and providing objective quality assessment, the framework makes molecular breeding tools more accessible to smaller programs and developing country institutions that face budget limitations. This democratization is important for addressing food security challenges in regions dependent on orphan crops, where local breeding programs often lack the resources available to major crop improvement initiatives.
Limitations and future research directions
Several limitations of the current study present opportunities for future research.
Experimental validation requirement
While computational quality scores provide valuable prediction, experimental validation remains essential for confirming marker performance. Future work should synthesize and test the top 100 predicted markers across diverse pigeonpea genotypes representing different geographic origins and maturity groups. This validation study should compare predicted versus actual polymorphism rates and amplification success rates to refine quality scoring weights based on empirical validation data. Such validation would also enable calculation of positive and negative predictive values for the quality score thresholds.
Dataset expansion
The current dataset of 2,770 markers is substantial but limited to computational predictions. Future studies should incorporate experimentally validated markers from multiple laboratories working on pigeonpea, include markers from diverse Cajanus species (
C.
scarabaeoides,
C.
cajanifolius,
C.
lineatus) for broader applicability and expand the training dataset to 5,000-10,000 markers to improve model robustness and generalization capability. Larger datasets would also enable development of separate models for different marker applications (diversity studies, linkage mapping, marker-assisted selection), as demonstrated by
Uma et al., (2016) for disease resistance screening in related legumes.
Feature enhancement
Additional features could improve prediction accuracy. Sequence context features including flanking sequence composition and complexity could capture effects of genomic environment on amplification. Secondary structure predictions including hairpin formation probability and self-complementarity scores could identify markers prone to primer-dimer formation. Genome position information such as distance to nearest gene and location in euchromatin versus heterochromatin regions could account for chromatin accessibility effects. Epigenetic markers including DNA methylation patterns, where available, could explain variation in amplification consistency. Evolutionary conservation measures from detailed comparative genomics across multiple legume species could improve prediction of cross-species transferability.
Advanced machine learning approaches
More sophisticated algorithms could improve performance. Deep learning approaches using multi-layer neural networks with larger hidden layers and more complex architectures might capture subtle feature interactions. Ensemble methods combining predictions from multiple algorithms through stacking or boosting could improve overall accuracy. Transfer learning approaches using pre-training on data from related legume species could leverage existing knowledge. Explainable AI methods including SHAP values and attention mechanisms could provide better feature interpretation and identify interactions between features that drive marker quality.
Species expansion and pan-legume models
Development of pan-legume models trained on markers from multiple species (pigeonpea, chickpea, lentil, common bean, soybean) could identify universal quality determinants that apply across legume species and species-specific factors that require customized prediction models. Such models would enable marker transfer predictions between species and guide cross-species marker application decisions, potentially accelerating marker development across the entire legume family. This pan-legume approach aligns with recent developments in legume genomics.
Varshney et al., (2012) demonstrated significant marker transferability between pigeonpea and related Cajanus species, while
Bohra et al., (2020) reported successful cross-species application of genomic tools across multiple legume crops. Furthermore, recent pan-genome studies have revealed conserved genomic features that enable predictive model transfer between related species
(Crossa et al., 2024). Our finding that primer compatibility features (72.9% importance) dominate marker quality suggests these parameters may represent universal determinants applicable across legume species, supporting the feasibility of pan-legume prediction models.