Sequencing quality evaluation and baseline data analysis
Upon detection, the micro-libraries constructed in this study had satisfied the requirement of transcriptome sequencing. The Agilent High Sensitivity DNA Kit was adopted for Agilent 2100 Bioanalyzer detection and the peak graphs of the 35 samples suggested the presence of obvious target product main peaks within the length scope of 1-2 kb, some small fragment products could be observed at below 1kb, but they had accounted for a small proportion, suggesting that the original samples had good integrity, which were eligible upon judgment and had conformed to the requirement of library construction. The Q30 percentage of the 35 samples ranged from 89.70% to 94.10%, indicating high sequencing quality and library construction quality; therefore, the sequencing data were accurate and reliable, which could be used in subsequent analysis. In the sequencing results, the base C/G proportions of the 35 samples were close to 50% and the contents were basically corresponding and coinciding, suggesting stable and balanced base composition and high sequencing quality.
The comparison of clean reads with reference genome indicated that, the total mapped of the 35 samples were greater than 82.42%, suggesting extremely successful mRNA extraction and library construction, excluding the contamination by various exogenous impurities. Multiple mapped suggested that the sequences obtained after sequencing might be compared to the similar regions in genome and its proportion should be appropriately lower than 10%. In this study, such data of the 35 samples were below 3.5%, demonstrating excellent data quality, which could provide excellent foundation for subsequent analysis. Exon number distribution of transcripts, including mRNA and lncRNA. Based on the structural features of lncRNAs and the functional characteristics of noncoding proteins, we have established a series of rigorous screening conditions to screen lncRNAs. Wayn diagram of lncRNA for Coding ability prediction using software (CPC, txCdsPredict, CNCI) and database (pfam) is shown in Fig 1. In order to more intuitively show the number of genes in different FPKM intervals for each sample, we conducted statistics on the number of genes in three cases of FPKM (FPKM<=1, FPKM 1~10, FPKM>=10).
Expression difference analysis between groups
In this study, Bowtie2 was adopted to align clean reads to the gene reference sequence and counted the gene coverage. The differentially expressed transcripts among different samples could be obtained through differential analysis and the corresponding genes or target genes of these transcripts might exert their functions under different treatments or in different phenotypes. The DEGseq software was to analyze the gene expression levels of all samples, with union being used as the model (Fig 2). FPKM=1 was used as the threshold and that of >1 indicated up-regulation of gene expression, while that of <1 suggested gene down-regulation of gene expression. Moreover, the default screening thresholds were set as |log2ratio|≥1 and p-value <0.05. The overall distribution of the differentially expressed transcripts or genes could be intuitively observed from the scatter plot and the DEGseq was utilized to analyze the differential expression of genes among the samples. It could be seen from Fig 2, compared with natural estrus group, induced estrus group was associated with 23485 DEGs, including 1480 up-regulated and 1714 down-regulated ones. In order to better understand gene function, we will annotate the assembled novel mRNA and known mRNA. The annotated databases include NR, NT, GO, KOG, KEGG, SwissProt. We use Blast or Diamond software to annotate mRNA with NT, NR, KOG, KEGG and SwissProt and use Blast2GO and NR software annotation results for GO analysis.
Gene Oncology (referred to as GO for short) is the international standard classification system for gene function. GO can be divided into three parts, including Molecular Function, Biological Process and Cellular Component. It could be seen from Fig 3 that, the differentially expressed DEGs compared between induced estrus and natural estrus group were enriched to molecular functions (such as catalytic activity, transcription activity, signal transducer activity and molecular function regulator), biological processes (such as cellular process, biological regulation, metabolic process, response to stimulus and multicellular organismal process) and cellular components of cell part, organelle, membrane and macromolecular complex.
Furthermore, we had also employed KEGG and KOG Pathway enrichment analysis on the DEGs, as displayed in Fig 4-6. The results suggested that, the differentially expressed genes between induced estrus and natural estrus group were enriched to the cellular processes (such as transport and catabolism, cellular community-eukaryotes, cell growth and death, cell motility), environmental information processing(signal transduction, signaling molecules and interaction, membrane transport), genetic information processing (folding, sorting and degradation, translation, replication and repair, transcription), metabolism (global and overview maps, lipid metabolism, carbohydrate metabolism, amino acid metabolism, glycan biosynthesis and metabolism), organismal systems (immune system, endocrine system, digestive system and sensory system). In addition, we had also done KEGG Pathway enrichment analysis on the DEGs between all samples.
SNP and InDel analysis
Single Nucleotide Polymorphisms (SNPs) refer to changes in DNA sequence caused by changes in single nucleotides (A, T, C, or G), resulting in diverse genomes among species or individuals, including the base replacement and substitution
etc. After comparing with the reference genome, we used the software GATK to perform SNP detection on each sample and mapped them to specific genes. We detected approximately 195,601 to 2,505,134 SNPs and 13,016 to 191,097 InDel in the 35 samples. The most frequent SNPs were A>G, C>T, G>A and T>C. The most frequent length of InDel was one nucleotide.
Alternative splicing analysis
We use Asprofile software to detect alternative splicing events in different samples. We detected 11 types of Alternative exon ends (AE), Intron retention (IR), Multi-IR (MIR), Multi-exon SKIP (MSKIP), Skipped exon (SKIP), Alternative 5' first exon (TSS), Alternative 3' last exon (TTS), Approximate AE (XAE), Approximate IR (XIR), Approximate MSKIP (XMSKIP), Approximate SKIP (XSKIP). The top two AS events were TSS and TTS AS events, which accounted for more than 82% of AS events in each library (Fig 7).
lncRNA’s miRNA precursor prediction
We used Blast to align lncRNA to miRBase to find potential miRNA precursors. lncRNA with miRNA precursor alignment coverage greater than 90% was selected
(Griffiths-Jones et al., 2006). Our study found that the number of novel lncRNAs align to miRNA precursors is 211. Number of positions aligned to miRNA precursor (a lncRNA may have different positions to match miRNA precursor) is 2,942.
lncRNA target gene prediction and family analysis
The function of LncRNA is mainly achieved by acting on the target gene through cis or trans. The basic principle of Cis action target gene prediction believes that the function of lncRNA is related to the protein coding genes whose coordinates are close, so the mRNA adjacent to lncRNA is selected as its target gene. However, tran regulation does not depend on the positional relationship. We predict it by calculating the binding energy. If there is an overlap between lncRNA and the target gene, we would carefully classify the situation of overlap
(Bu et al., 2012), which help us understand the details of cis regulation and Overlap classification statistics is shown in Fig 8.
We use software INFERNAL to align lncRNA to Rfam database, so as to annotate the family of lncRNA
(Nawrocki et al., 2009, Burge et al., 2013). Rfam is a database containing information on various ncRNA families, including conserved regions of RNA secondary structure, mRNA cis-acting elements and other RNA elements. It divides ncRNAs into different families based on their common ancestors at the evolutionary. The predicted statistical graph of the lncRNA family is as shown in Fig 9.