Big genetic data: should we throw it away?
With the promise of ever faster, smaller and higher throughput DNA reading technology, biologists are caught in a
data deluge as DNA sequencing machines pour out the genetic code from cancer cells, human subjects, bacteria, viruses and research organisms. Dealing with the data has become a serious challenge.
Ultimately, any solution will require immense data processing to
extract meaning as well as changes in where and how
computer power is brought to bear.
But medical practitioners need actionable results, and what's actionable is in fact very modest in size. It's more of a data sprinkle than a deluge, in fact.
How can we reconcile the antediluvian fears with a more forward-looking perspective? What does the future of genetic data hold, and how will we get there?
My modest suggestion: throw away the raw data.
Big Genetic Data
When hip web 2.0 companies post desperate ads to hire experts in Big Data, they usually mean software engineers and statisticians who can work with broad,
shallow data that's location-based, transactional, and tracks user activity all in a diffuse cloud of distributed data centers.
In biotechnology, Big Data means something different. It refers mainly to images and numerical measurements from laboratory experiments. Most biotech and medical data fits perfectly in the Big Data and cloud computing paradigms.
Genetic data from DNA sequencing -- reading all of the A, C, G and T letters of the genetic code -- stands out as the great outlier that's breaking the model. Why? DNA sequencers are spewing out too much data even for the Big Data paradigm to handle.
Consider the output commonly stored today: raw data consists of image files parsed into signal measurements with quality scores. These are transformed into short sequences from the genetic code. But these short sequences have to be put together again, and they're filled with noise, low quality data that must be thrown out, and ambiguity. All of the processed data adds to the storage requirements.
DNA sequencers generate so much data per day that a fast internet connection is required merely to offload the data they spew out. Although a typical corporate internet connection can handle it, if you count the bandwidth cost per instrument, it adds up.
The National Institute of Health
nearly canceled the Sequence Read Archive, a service hosting public domain data for free, because the volume of data is exploding far faster than storage costs are dropping.
Which will give out? I believe that for clinical and medical sequencing, it will be the DNA sequencing industry that changes, not the Big Data model. The reasons are the same as for image data from
surveillance cameras, and will lead to a common
paradox of success, where the leaders in churning out data face competition from radical new entrants. The relevant data is actually very small.
A Tractable Problem
Consider a human genome of 3 billion characters of DNA. Counting maternal and paternal chromosomes, there are twice as many, or 6 billion. To be sure of reaching the correct answer, biologists tend to read those 6 billion many times over.
But people's genomes differ very little, by only about 0.1%. Most bioinformatics experts agree that ultimately it's only necessary to store genetic differences.
All of the biological variation in an entire human genome can already fit on a smart phone, in about the size of one moderately high resolution photo. It's technically feasible and even cost-effective to store that size of data for the entire human race. Even adding the complexity of disease samples, it's already tractable.
But first we need to throw away the redundant and low quality raw data -- or develop technologies that never generate it in the first place.
Throw Away the Raw Data
The world's
surveillance cameras potentially could generate exabytes of data daily -- more than all of the world's DNA sequencers. For digital video, part of the solution is to aggressively process and discard data, distilling it down to what's recent or meaningful. In the case of genetic data, what's meaningful is manageably small.
As with
digital video, a workable solution to the genetic data overflow is to process and reduce the data as close to the point of generation as possible, and then push only the key results out to the data center, whether locally or remotely to a cloud provider.
But wait, aren't scientists already doing it that way? Actually, no. There's a widespread belief among researchers that all intermediate data, beginning with raw sequence fragments, is irreplaceably valuable. What if a scientist ever wants to trace back to the source to verify a conclusion? Even the scientific publication process requires deposition of raw data in a public repository so that others can verify results.
That's the difference between the researcher market and the much larger applied markets for medical sequencing.
To cope with all of the public data submitted by researchers, some like Ewan Birney of the European Bioinformatics Institute have suggested
lossy compression of DNA sequences.
Bioinformaticians, who make their livelihoods by digging through all of the intermediate data, reprocessing it, changing parameters, and staring at artifacts and anomalies, protest that the end data quality is not yet good enough to discard everything prior to the final output. In fact, they might argue that it will never be good enough, and there will always be benefits -- additional biologically relevant information -- to glean from the distribution and raw errors. In any case, they might say, we're years or decades away from being able to escape the flood.
Meanwhile, the clinical and diagnostic markets want yes/no answers: does my patient have a mutation that should influence what drug I prescribe or diagnosis I give?
In applied markets, the raw data will probably never be deposited in a public repository. It's confidential and raises patient privacy concerns. All that matters is the end result. And as long as the process to generate it has been validated, there's no need to store the intermediate data, only the answer at the end.
Imagine a simple DNA sequencer that doesn't output a deluge of data at all. Suppose the instrument includes embedded data processing, and for cost reasons, all intermediate data lives in memory only. Getting at the raw and intermediate data is not even an option. The instrument outputs only differences from a reference sequence as a single file in a standard format.
For novel genomes for disease surveillance applications, the DNA reader outputs only the assembled genome.
For diagnostic tests, it outputs only a report with at most a handful of metrics, a confidence value and a yes/no/maybe verdict.
A typical file for a human genome or a new bacterial strain can already travel on a mobile phone even without employing compression algorithms. Store the end results only, and the public data repositories would solve their data archiving problems overnight -- even for the research markets.
Science Will Advance Faster
Do scientific journals refuse to publish papers without having a permanent record of the potentially terabytes of raw data? Currently, yes. In the near future, probably not.
Would the scientific process grind to a halt without all that raw data? Absolutely not. Scientific research would hardly come to an end. To reproduce the end results, anyone could simply run a similar biological sample on the same make of instrument and compare the end results. Ultimately, validation of the end-to-end process lends much more credence to the findings anyway.
Imagine the joy to the medical researcher or biologist of having such a manageable and simple result, distilled down to its most useful and actionable form. Imagine the relief of having all of the oceans of data concentrated into simple statistical measures that anyone can understand, such as the probability that a patient has a genetic variant or signature that's linked to disease, progression or treatment.
The clinical and diagnostic markets -- everyone except the computational biologists engaged in the most advanced fundamental research -- will embrace the simplicity. In fact, they'd much prefer it, and would rather not pay for what they don't actually want or need anyway.
And that is the future we will have, sooner or later. Then, and only then, will the promise of the genetic revolution truly become a reality for human health. Genetic sequencing will become a routine part of medical care only after the flood, when the days of the data deluge have passed.