Intelligent Machines

Bases to Bytes

Cheap sequencing technology is flooding the world with genomic data. Can we handle the deluge?

Apr 25, 2012

The cost of sequencing human genomes is plunging—in the most advanced genomics centers, it’s falling five times faster than the cost of computing. Increasingly, people are getting their DNA sequenced by companies and research labs in a search for clues about genetic variation and disease.

Source: U.S. National Human Genome Research Institute, Nature
Source: U.S. National Human Genome Research Institute, Nature
The map is based on data from a user-generated database of publicly available statistics, representing 60 to 70 percent of all machines; it excludes biotech and pharmaceutical companies and some sequencing service providers. Source: omicsmaps.com
Source: Illumina, Facebook, IDC

But the industry must figure out how to cheaply store all the resulting data. Each of the 3.2 billion DNA base pairs in a human genome can be encoded by two bits—800 megabytes for the entire genome. But considerable data about each base is usually collected, and genes are often sequenced many times to ensure accuracy, so it’s common to save around 100 gigabytes when sequencing a human genome with a machine made by industry leader Illumina. Keeping this much data about every person on the planet would require about as much digital storage as was available in the whole world in 2010.

The trick, then, will be to save less. Harvard geneticist George Church says that eventually only the differences between a newly sequenced genome and a reference genome will need to be stored. That information could be encoded in as little as four megabytes. Then your genome might be just another e-mail attachment.

Information graphics by Infographics.com