Monday, January 27, 2014 - 14:00
Getting an accurate genome sequence requires that you collect the data at least twice argue Robasky, Lewis, and Church in their recent opinion piece in Nat. Rev. Genetics . The DNA sequencing world kicked off 2014 with an audacious start. Andrew Pollack ran an article in the New York Times implying that 100,000 genomes will be the new norm in human genome sequencing projects . The article focused on a collaboration between Regeneron and Geisinger Health in which they plan to sequence the exomes (the ~2% of the genome that encodes proteins and some non-coding RNA) of 100,000 individuals. In addition to this project, several others were cited in the article. Next, Illumina claimed they can achieve the $1000 genome at the annual JP Morgan investor conference when they introduced their new sequencing instrument, the HiSeq X Ten. Ten is the magic number because you must buy ten, at $1 million/instrument, to have the opportunity for $1000 genomes. Illumina claims their $1000 cost includes sample prep and amortization costs. The folks at the AllSeq blog estimate that the total investment is really $72 million since it will take 72 genomes, collected over four years, to achieve the amortized costs of $1000 per genome. Unfortunately the above estimates are based on getting data from samples that are sequenced only once. Therein lies the rub. According to Robasky and team, sequencing genomes with high accuracy requires that they be sequenced, minimally, in duplicate. While some sequencing technologies claim they can produce data with errors as low at one in 10 million bases, a six billion genome sequence will still contain thousands of false positive variants. Several aspects of the sequencing process contribute to this error including purifying DNA, preparing DNA for sequencing, collecting sequence data, and comparing the resulting data to the reference sequence to identify variants (bases that differ between sample and reference). The authors how explain that some errors occur through random statistical variation (stochasitc) while others occur because of systematic biases in the different processes, and propose that collecting data in a replicated fashion is a cost effective way to reduce errors. Indeed, a current standard of practice is confirm variants observed by massively parallel next generation sequencing (NGS) by sequencing small regions containing the variant using capillary electrophoresis (Sanger). This is an expensive approach because it requires individual regions be isolated and sequenced in more laborious ways. As NGS sequencing costs drop, however, labor intensive confirmation methods become less attractive, and replicates become more feasible. The paper describes four different kinds of assay replication methods: read depth (oversampling), technical, biological, and cross-platform, and discussed their strengths and weaknesses in term of error reduction and cost. The authors also describe the kinds of errors that have been observed. However, relative to technical advancements, these observations are out of date and published analyses of current error sources are lacking. Some issues continue to exist, others may have been solved, and new ones are likely, so labs establishing sequencing services, especially in clinical arenas, need to have strategies to identify and reduce errors in the data. Finally, the authors make an additional important point that errors related to data processing, limitations of read (collected sequence) length and completeness of reference materials cannot be addressed by replicates alone. New technological solutions will be needed. So, what does this mean? Robasky, K., Lewis, N. E.,, & Church, G. M. (2014). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics DOI: 10.1038/nrg3655  Pollack, B. A. (2014, January 13). Aiming to push genomics forward in new study. The New York Times.