Estimation and correction for GC-content bias in high throughput sequencing
GC-content bias describes the dependence between fragment count (read coverage) and GC content found in high-throughput sequencing assays, particularly the Illumina Genome Analyzer technology. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation. The bias is not consistent between samples, and current methods to remove it in a single sample do not assume any knowledge of the curve shape or scale. In this work we analyze regularities in the GC-bias patterns, and find a compact description for this curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC rich fragments and AT rich fragments are under-represented in the sequencing results. Based on these findings, we propose a new method to calculate predicted coverage and correct for the bias. This parsimonious model produces single bp prediction which suffices to predict the GC effect on fragment coverage at all scales, all chromosomes and for both strands; this allows optimal GC-effect correction regardless of the downstream smoothing or binning. We demonstrate our model's potential for improving on current approaches to copy-number estimation. These GC-modeling considerations can also inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq. Finally, our analysis provides empirical evidence strengthening the hypothesis that PCR is the most important cause of the GC bias.