This study uses whole genome sequencing (WGS) to re-sequence different individuals or tissues of sheep with known genomic information to discover differences between individuals. Through this method, genome-wide SNP and InDel molecular markers were developed, and SNP/InDel markers were used for population genetics research.
Animals and genome sequencing
Two sheep populations-NKY and THT - were analyzed in this study. A total of 5 mL jugular vein blood was collected from 390 individuals (185 NKY and 205 THT population) in 2022 at Bayannaoer Academy of Agriculture and Animal Husbandry Sciences, Inner Mongolia, China. After the sample ‘s genomic DNA was qualified, the DNA sequence was fragmented by ultrasound to form a random fragment.
Bioinformatic analysis workflow
After the Illumina NovaSeqTM sequencing data (Raw Data) is offline, the quality control of the offline data is performed, and the low-quality data is filtered to obtain high-quality data (Clean Data). Clean Data was aligned to the reference genome sequence using BWA software(
Li and Durbin 2009)to obtain the sequence location attribution (
i.e., BAM file). BAM files were corrected using GATK ‘s Best Practices process
(McKenna et al., 2010), and SNP and Small InDel markers were detected. The functional annotation information of SNP and InDel was obtained by using SNPEff software
(Cingolani et al., 2012) and gene prediction information of reference genome. Based on the obtained SNP and InDel molecular markers, genetic diversity, population structure, linkage disequilibrium and selective sweep were further studied (see Table 1 for bioinformatics analysis tools).
Linkage disequilibrium
In a population, the frequency of simultaneous inheritance of two genes at different loci is significantly higher than the expected random frequency, which is called linkage disequilibrium. The minimum genetic unit of species can be obtained by linkage disequilibrium analysis.
The linkage of SNP in all samples was analyzed by SNP combination on the same chromosome. Natural populations are represented by a linkage disequilibrium coefficient (r
2). The closer r
2 is to 1, the stronger the linkage. The distance between SNPs in the genome was fitted to r
2. Generally, the closer the distance between SNPs, the larger the r
2, and the farther the distance between SNPs, the smaller the r
2. Generally, the distance between the corresponding sites when r
2 decays to half is used as the value of LD-decay. The longer the LD-decay, the slower the LD decay, the greater the probability of linkage between the SNP of the representative species. The shorter the LD-decay, the faster the LD decay, the smaller the probability of linkage between the SNP of the representative species. It is generally believed that the species with fast LD decay are relatively primitive. The LD-decay of each subpopulation is shown in Fig 1.
Genome-wide linkage disequilibrium (LD) research showed that the NKY and THT populations had nearly the same level of LD and decay, with a lower R
2 correlation coefficient becoming stable at a distance of about 100 kb.
Population structure
Principal component analysis (PCA) is a pure mathematical operation method, which can select a small number of important variables through linear transformation of multiple related variables. Based on SNP, principal component analysis (PCA) was performed by GCTA software to obtain the principal component clustering of 36 samples. Through PCA analysis, we can know which samples are relatively close and which samples are relatively distant, which can assist evolutionary analysis. The results of the principal component analysis that was carried out on the genomic relatedness matrix between individuals are illustrated in Fig 2.
The population structure of the samples was analyzed. The number of clusters (K value) was assumed to be 1-20. The optimal number of clusters was determined to be 14 according to the K value corresponding to the lowest point of CV error (Cross validation error). It reflects that all our samples may come from 14 original ancestors. The population structure with K value of 14 is shown in Fig 3.
Through whole genome resequencing, it can be determined that there are regions with higher F
ST values and lower ð/è values on the genome. The q chromosome results are shown in Fig 4.
F
ST is an important index to measure the degree of genetic differentiation between populations. If an allele in a population experiences adaptive selection due to its high fitness to a specific environment, the increase of its frequency will increase the level of population differentiation, which is reflected in the higher F
ST value in F statistics. The π/θ represents the nucleotide polymorphism within the population, and the selective sweep will cause the selected genomic region to show that the polymorphism within the population is significantly lower than the average level of the genome, which corresponds to a lower π/θ value.
The evolutionary tree is used to represent the evolutionary relationship between species. According to the genetic relationship between various organisms, various organisms are placed on a branched tree-like figure to concisely represent the evolutionary history and genetic relationship of organisms. The phylogenetic tree is shown in Fig 5.
Before human-mediated specialization for wool and milk began some 4000-5000 years ago, sheep were initially raised for access to meat. One of the earliest morphological modifications that likely occurred alongside domestication and is now a trait shared by many modern breeds is the removal of horns, which has recently been demonstrated to be one of those regions of the genome that contain strong evidence for accelerated change in response to artificial selection (
Mohamadipoor Saadatabadi et al., 2021). Additionally, sheep’s other genomic areas that are being selected for comprise genes that regulate body size, reproduction and color. The separation of animals into breeds, followed by the discovery of better rams and their disproportionate genetic contribution through artificial insemination, has increased the rate of genetic gain for productive attributes during the previous few hundred years. Numerous genetic exchanges have taken place during the evolution of contemporary breeds, as evidenced by the high haplotype sharing and relatively short divergence times across breeds
(Missohou et al., 2022). A population size of more than 300 has been maintained by around 75% of contemporary sheep breeds. There are between 850 and 1409 different breeds. But several sheep breeds, such as the Shetland, Soay, and Herdwick breeds, that were first chosen for their superior performance in a particular, sometimes remote, geographic location are now regarded as uncommon. It has become difficult to maintain genetic variety by preserving these historic breeds as generalist-type sheep have mostly taken over intensive sheep farming (
Alberto et al., 2018).