bHLH gene family identification results
A total of 5,434 bHLH genes were identified across 30 angiosperm species, with family sizes ranging from 62 to 425 (mean = 181.13). The bHLH family is broadly conserved, though lineage-dependent expansion and contraction are evident. Most species (76.7%) had 100-249 bHLH genes, with notable deviations in
Glycine max (425),
Zea mays (352) and
Musa acuminata (323) and smaller families in
Amborella trichopoda (62) and
Nymphaea colorata (70). Fabaceae species showed internal variation, with Gmax and Psat exhibiting large expansions, while
Medicago sativa (105) and
Vigna radiata (112) had smaller families. The mean family size in legumes (180.08) was similar to that in non-legumes (181.94), with the high count in
Glycine max contributing substantially to the within-Fabaceae variation (Fig 1).
From an evolutionary perspective, the bHLH family size variation aligns with gene duplication and polyploidy events. The large family in Gmax is consistent with its paleopolyploidy background
(Schmutz et al., 2010), while the complexity of
Zea mays also supports family expansion
(Schnable et al., 2009). Smaller families in atri and Ncol suggest stronger conservation or fewer amplification events. This variation provides a framework for codon-usage analysis, highlighting the need to consider species-specific effects and avoid misattribution of family-size differences to codon-preference divergence.
Cross-species divergence in core indices and distributional signatures
Descriptive statistics of CDSs from 30 angiosperms revealed significant stratification in synonymous codon usage. The effective number of codons (ENC/Nc) was low in Poaceae (
e.g., Svir 42.323) and higher in basal lineages (
e.g., Ncol 53.648). This divergence reflects structured codon-bias patterns, driven by base composition and mutation (
Sueoka, 1988;
Sueoka, 1992;
Plotkin and Kudla, 2011).
GC3s exhibited a stronger contrast, with Fabaceae species (
e.g., Mru 0.3200) clustering in a low-GC3s range and Poaceae species (
e.g., Svir 0.7636) in a high-GC3s range (Fig 2). This contrast was further confirmed by third-position base fractions, where Poaceae showed higher C3s and G3s and Fabaceae higher A3s and T3s (Fig 2).
Fop and CAI also demonstrated consistent stratification, with Fabaceae species showing lower Fop (
e.g., Mru 0.3669) and CAI (
e.g., Msat 0.2043) and Poaceae higher values (
e.g., Svir 0.5034, Sita 0.5019). These trends suggest that codon usage is influenced by compositional biases and translation-related constraints (
Bulmer, 1991;
Akashi, 1994;
Plotkin and Kudla, 2011).
Poaceae exhibited higher gene-level heterogeneity, with larger standard deviations in GC3s and Nc, while Fabaceae showed more stable distributions (
Sueoka, 1992;
Plotkin and Kudla, 2011).
The contrast between Fabaceae and Poaceae, in terms of GC3s, Nc, Fop and CAI, provides a reproducible baseline for codon-bias divergence, supporting downstream multivariate analyses and enhancing the Fabaceae-centered framework (Fig 2).
ENC-GC3 analysis
Gene-level ENC-GC3s distributions from 30 species showed shifts below Wright’s expected curve (
Wright, 1990). DENC was mainly negative, indicating codon-bias strength exceeding mutational pressure (
Sueoka, 1988;
Hershberg and Petrov, 2008).
Poaceae species clustered in the high-GC3s region with low ENC (
e.g., Svir 42.32, Sita 42.53), while Fabaceae species, such as Mru (47.54) and Msat (48.08), showed moderate ENC under low-GC3s, reflecting less extreme bias and separating compositional effects from constraint effects (
Novembre, 2002).
DENC boxplots showed Poaceae with smaller deviations and Fabaceae with stronger negative shifts, implying gene sets with more compressed codon usage (
dos Reis et al., 2004;
Hershberg and Petrov, 2008). This suggests codon choice is influenced by selective constraints, expression levels, tRNA availability and gene functions (Fig 3).
The width of DENC distributions varied, with broader ranges in some species indicating highly biased gene subsets. This highlights the importance of integrating DENC distribution shapes to avoid misinterpreting small subsets of biased genes as genome-wide trends.
Overall, ENC-GC3s scatterplots and DENC boxplots reveal systematic codon-bias divergence, with structured differences between Fabaceae and non-Fabaceae lineages, providing a foundation for further optimal-codon and lineage-specific analyses (Fig 4).
PR2 analysis
PR2 analysis of 30 angiosperm species revealed significant third-position nucleotide asymmetry, with deviations from the PR2 equilibrium point (0.5, 0.5) consistent across species. The PR2 bias index (Euclidean distance to 0.5, 0.5) showed species medians above zero, indicating detectable A/T and G/C imbalances, supporting Sueoka’s parity framework (
Sueoka, 1995). High PR2 bias was observed in Cari, Mtru and Hvul, while species like Macu, Sbic and Bdis exhibited weaker deviations. This stratification suggests that third-position asymmetry is a lineage-specific compositional feature, offering a complementary axis for codon-preference differentiation independent of ENC/GC3s (Fig 5).
PR2 plots showed stable directional shifts in codon imbalances, with Fabaceae species (Adur, Cari, Ccaj) concentrated in the region of A3 enrichment and C3 depletion. In contrast, species like Atri and Ncol followed a G3/T3-enriched pattern. This indicates lineage-dependent directional biases in third-position codon usage, reflecting strand-specific mutation patterns and selective effects (
Sueoka, 1995).
Within Fabaceae, PR2 bias intensity varied, with Mtru and Cari showing stronger deviations than expected based on composition alone, suggesting codon preference coupling via nucleotide frequency shifts and relative enrichment of preferred codons. This is consistent with mutation-selection-drift theory, where weak selection can drive codon usage bias (
Bulmer, 1991;
Hershberg and Petrov, 2008). PR2 analysis thus provides a quantitative basis for interpreting lineage differences in codon preference strength and directionality, reinforcing that codon bias cannot be explained solely by GC3 or global composition but is shaped by lineage-specific asymmetry (Fig 6).
Neutrality plot analysis
The neutrality plot (Fig 7) quantifies the relationship between GC12 (GC content at the first and second codon positions) and GC3s (GC content at synonymous third positions) to assess compositional coupling. A slope near 1 indicates mutational bias, while a slope near 0 suggests stronger constraints on the first and second positions (
Sueoka, 1988;
Sueoka, 1992).
Across 30 species, variation in GC3s was broader than in GC12, with Poaceae species showing wider GC3s dispersion and a positive regression trend, indicating stronger third-position shifts propagating to the first and second positions. In contrast, Fabaceae species had narrower GC12 variation and weaker GC3s influence, supporting the ENC-GC3s pattern of high GC3s and low ENC in Poaceae and lower GC3s and higher ENC in Fabaceae (
Plotkin and Kudla, 2011).
This partition between GC3s and GC12 suggests a “strong compositional pressure not fully neutral,” with regression slopes deviating from unity. The neutrality plot thus highlights the roles of mutational bias, position-specific constraints and gene-set heterogeneity in codon-bias formation (Fig 7).
Correlation analysis comparison
At the whole-dataset level (n = 5,434 genes), GC3s was the primary axis of codon-bias variation, with the strongest correlation between Fop and GC3s (ρ = 0.754, FDR < 0.001), indicating that optimal codon usage tracks the third-position base composition gradient (
Hershberg and Petrov, 2008). CAI also correlated with GC3s (ρ = 0.416, FDR < 0.001) and negatively with Nc (ρ = -0.245, FDR < 0.001), suggesting coordinated codon usage (Table 1).
Poaceae species (
e.g., Osat, Sbic) showed strong negative Nc-GC3s correlations (ρ ≈-0.92), while Fabaceae species exhibited more variable Nc responses. Despite a positive Fop-GC3s correlation in Fabaceae, Nc responses differed, indicating that GC3s influences optimal-codon usage without directly affecting codon concentration (
Ikemura, 1981;
Chamary et al., 2006).
DeltaENC correlated with GC3s, with higher GC3s reducing compositional deviation in some species, while reinforcing codon-bias divergence in others, suggesting additional selective constraints (
Hershberg and Petrov, 2008). PR2 bias index showed lineage-specific patterns (
e.g., Atha, Cari, Lang).
These results highlight that GC3s is a key driver of codon-preference variation, but lineage-specific mechanisms in Nc and deltaENC require careful interpretation. The divergence between Poaceae and Fabaceae underscores the need to separate compositional influences from lineage-specific effects (
Plotkin and Kudla, 2011).
RSCU clustering analysis
Hierarchical clustering of the RSCU matrix from 30 angiosperms revealed stable codon usage patterns (Fig 8). Poaceae species (
e.g., Osat, Svir) clustered closely, while Fabaceae species (
e.g., Mru, Msat) showed more variation, suggesting codon usage is influenced by lineage history and mutational/selection constraints (
Bulmer, 1991;
Hershberg and Petrov, 2008).
The heatmap showed coordinated shifts in codon sets, especially between G/C and A/T-ending codons, supporting compositional-driven shifts rather than local drift (
Sharp and Li, 1986;
Hershberg and Petrov, 2008).
PCA of RSCU profiles (Fig 9) revealed that PC1 explained 86.5% of variance, separating Poaceae from eudicots/Fabaceae. Hvul showed deviation, indicating species-specific differences. The scree plot (Fig 10) confirmed that variation is primarily captured by PC1, with PC2 and higher components reflecting lineage-specific modulation (
Sharp and Li, 1987;
Hershberg and Petrov, 2008).
Correspondence analysis complements RSCU-based clustering and principal component analysis
Hierarchical clustering of the species-level RSCU matrix comprising 59 synonymous codons across 30 species revealed structured variation in codon usage (Fig 8). Species pairs such as Pper-Ccaj and Sbic-Bdis occurred in closely associated branches of the dendrogram, indicating similarity across multiple RSCU variables. Pvul and Gmax also clustered closely, whereas Atha and Alyr formed a separate association. These patterns are consistent with taxonomically structured variation in codon usage but do not, by themselves, establish phylogenetic relationships or identify the underlying evolutionary mechanisms
(Grantham et al., 1980; Hershberg and Petrov, 2008).
PCA of the RSCU profiles showed that PC1 and PC2 explained 86.5% and 5.3% of the total variance, respectively (Fig 9). Most Poaceae species occupied the positive side of PC1, whereas most Fabaceae and other eudicot species occurred on the negative side. Macu and Ncol occupied intermediate positions along PC1, while Hvul was separated from the main Poaceae grouping primarily along PC2. The scree plot further showed that variation in the RSCU profiles was dominated by the first principal component (Fig 10). This broad separation is consistent with the established influence of genome-wide compositional processes on codon usage, although the ordination itself does not distinguish among the underlying evolutionary forces
(Chen et al., 2004; Hershberg and Petrov, 2008).
Correspondence analysis of the species-level 64-codon count profiles provided a complementary representation of codon-count composition (
Greenacre, 2007). CA1 and CA2 accounted for 75.0% and 15.6% of the total inertia, respectively, together explaining 90.6% (Fig 11 and 12). Most Poaceae species formed a compact group on the positive side of CA1, whereas Fabaceae and other eudicot species were more broadly distributed on the negative side. Hvul was clearly separated from the main Poaceae cluster along CA2. The separation among Fabaceae species, including Msat, Lang and Mru, further indicated substantial within-family heterogeneity in their codon-count profiles.
Taken together, hierarchical clustering, PCA and CA revealed concordant patterns of codon usage that were broadly aligned with major taxonomic groups. The agreement among these complementary analyses supports the internal consistency of the observed compositional structure across alternative representations of codon usage. However, these analyses cannot, on their own, resolve the relative contributions of genome-wide mutational bias, translational selection, gene expression, or other evolutionary processes (
Ikemura, 1985;
Hershberg and Petrov, 2008).