Annotation-Free Estimates of Gene-Expression from mRNA-Seq

September, 2012
Report Number: 
Elizabeth Purdom

mRNA-Seq experiments provide an impressive array of information about the transcriptome of a sample. Yet in organisms that undergo alternative splicing, correctly estimating the standard measures of gene expression can be a complex problem because of complications caused by alternative splicing. The simple estimate based on the number of fragments aligning to a gene has the potential to be biased.  Many methods now exist that estimate individual isoform estimates, which can then be combined to give accurate gene expression estimates. However, isoform estimates require either knowledge of the transcriptome  or the ability to accurately predict it. Yet many mRNA-Seq experiments are run on organisms with no known genome, much less a transcriptome. In addition, these methods are computationally intensive and usually require access to the raw reads, making them difficult to use for researchers who want to analyze large numbers of samples.

We examine estimates based on summaries that are easy to obtain and analyze, specifically methods based on counting the number of sequenced fragments that overlap exons. We compare these methods to isoform-based gene estimates. We show that in simulated data our gene estimation methods based on exon counts give reasonable gene estimates in the presence of moderate alternative splicing. We compare all of these methods on two mRNA-Seq datasets and observe little difference between any of the methods. In which case, simple count-based methods can be sufficient and allow the experimenter to make use of statistical techniques that appropriately account for the biological variation between samples.