Whole-genome sequencing and mapping of short reads
Sequencing was carried out using Illumina Hiseq 2500 and Novoseq 6000, which produced an average of 72.65±4.79 GBs of data ranging from 49.46 GB to 92.21 GB in each sample. Out of 424,750,051 raw paired-end reads produced for each sample, 391,021,066 reads (ranging from 267,321,089 to 512,695,893 reads) with a length of 151 bp, an average GC content of 44±0.25 per cent per sequence and Phred score of more than 30 were mapped to the reference genome (
Bos-indicus-1.0) successfully. Variants identified using the
Bos indicus reference genome will be more specific to indicine breeds
Devadasan et al., (2020). Around 38,01,32,572±17,371,788.63 reads were uniquely mapped with an average depth of 19.35-fold ranging from 12.59 to 29.51 and coverage of 97.28 per cent (Table 1). Coverage was affected by GC content, sequencing technology, read length, library preparation, structural variants and novel sequences
(Liao et al., 2013). An average of 15x depth coverage was sufficient to identify almost 75 per cent of heterozygous variants and the accuracy increases with the increase in depth
(Bentley et al., 2008).
Identification of SNPs and indels
The total length of the genome was 2,673,965,444 bp. The average depth achieved was 17.67-fold (ranging from 12.9 to 22.6). 29,366,340 variants per sample were identified across the genome, which included 25,944,935 SNPs and 3,421,405 indels. Multiallelic SNP sites were 1,042,672. Out of the total variants identified, 57 per cent (ranging from 52 to 62 per cent) were homozygous and 37 per cent (ranging from 37 to 47 per cent) were heterozygous. The mean ratio of transitions and transversions was 2.17±0.003 (2.14 to 2.19) in each sample (Table 2), which is in agreement with previous studies
(Choi et al., 2015, Das et al., 2015 and
Stafuzza et al., 2017), indicating accuracy of SNP identification. The variant rate of the genome was around 91bp (Table 3). A high variant rate was observed in chromosome 23 with one variant for every 74 bp, Y chromosome had the lowest variant rate with one variant per 1022 bp. The lower variant rate in Y chromosome may be due to its haploid state in males, which results in the reduction of sequencing depth and lower rate of variant identification. Also, selection process reduces the retention of mutation due to the effect of recessive allele in hemizygous condition
Stafuzza et al., (2017). The number of variants is proportionate to the length of chromosome (Fig 1), which is in agreement with the previous studies by
Kawahara-Miki et al., (2011) and
Stafuzza et al., (2017).
Among the total indels identified, 1,287,879 were insertions and 2,133,526 were deletions. The length of the indel varied from 1 to 49. Single nucleotide insertions and deletions (948,585 and 1,046,590 respectively) were more commonly present. These results are similar to the findings of
Eck et al., (2009), Liao et al., (2013) and
Iqbal et al., (2019). Number of indels reduced when the length increased (Fig 2).
Around 25,237,418 variants out of the total were common for all the five breeds of cattle. Whereas, 1225844, 568,634, 528,093, 838,145 and 968,206 variants were specific to Alambadi, Bargur, Kangayam, Pulikulam and Umblachery cattle respectively.
Functional annotation of SNPs and indels in genome
Among the total SNPs, 61 per cent (17071364) of SNPs were present in intergenic region and 28.6 per cent (799258) were present in the intron region. SNPs found within 5kb upstream and downstream of a gene were 952518 (3.41 per cent) and 904544 (3.24 per cent) respectively. Around 191294 (0.7 per cent) SNPs were present in 3 prime and 5 prime UTR regions. Splice site SNPs included 761 splice acceptors, 911 splice donor and 20068 splice region SNPs (Table 4). These results are in accordance with the previous studies carried out by
Kawahara-Miki et al., (2011), Liao et al., (2013), Das et al., (2015), Choi et al., (2015), Rosse et al., (2017) and
Iqbal et al., (2019) in Kuchinoshima-Ushi cattle, Gir cattle, Danish Holstein cattle, Hanwoo, yanbian cattle, Guzera cattle and Native cattle breeds of Pakistan.
Functional annotation of indels showed that around 59.18 per cent (2,202,674) of indels were observed in the intergenic region and 29.9 per cent (1,112,682) were in the intronic region. Also, two indels lead to loss of exon, two cause 5 prime UTR truncation, 9235 leads to frameshift variation and 3624 affects splice sites (Table 5).