Georgia Tech wins $25 million to advance DNA-based archival data storage

The demand for archival data storage has been skyrocketing, and if a new research initiative reaches its goals, that need could be met by taking advantage of an efficient and robust information storage medium that has proven itself through the centuries: the biopolymer DNA.

The Intelligence Advanced Research Projects Activity's (IARPA) Molecular Information Storage (MIST) program has awarded a multi-phase contract worth up to $25 million to develop scalable DNA-based molecular storage techniques. The goal of the project, which will be led by the Georgia Tech Research Institute (GTRI), is to use DNA as the basis for deployable storage technologies that can eventually scale into the exabyte regime and beyond with reduced physical footprint, power and cost requirements relative to conventional storage technologies.

The technology already exists for storing and reading information into DNA -- which also encodes the genetic blueprint for living organisms -- but significant advances will be needed to make it commercially practical and cost competitive with established magnetic tape and optical disk memory. While current archival storage has a limited lifetime, information stored in DNA could last for hundreds of years. CAPTION GTRI researchers Brooke Beckert, Nicholas Guise, Alexa Harter and Adam Meier are shown outside the cleanroom of the Institute for Electronics and Nanotechnology at the Georgia Institute of Technology. Device fabrication for the DNA data storage project will be done in the facility behind them.  CREDIT Branden Camp, Georgia Tech{module INSIDE STORY}

"The goal is to significantly reduce the size, weight and power required for archival data storage," said Alexa Harter, director of GTRI's Cybersecurity, Information Protection, and Hardware Evaluation Research (CIPHER) Laboratory. "What would take acres in a data farm today could be kept in a device the size of the tabletop. We want to significantly improve all kinds of metrics for long-term data storage."

The Scalable Molecular Archival Software and Hardware (SMASH) project resulted from a proposal prepared by GTRI, San Francisco-based Twist Bioscience and San Diego-based Roswell Biotechnologies.

In the project plans, Twist will engineer a DNA synthesis platform on silicon that "writes" the DNA strands that carry the data. Roswell will provide DNA sequencing, or "reading" technology. At Georgia Tech, the project will involve fabrication facilities at the Institute for Electronics and Nanotechnology and researchers in such specialties as chemistry and information theory, who will also draw from four of GTRI's eight laboratories.

"The reason people are looking at DNA for storage is that it has evolved over the ages as a very compact and reliable means of information storage," said Nicholas Guise, a GTRI senior research scientist. "It's so compact that a practical DNA archive could store an exabyte of data--equivalent to a million terabyte hard drives--in a volume about the size of a sugar cube. Scientists have been able to read DNA from animals that died centuries ago, so the data lasts essentially forever under the right conditions."

Technology for encoding and decoding DNA works at small scales today, but to be useful for commercial archival purposes, researchers will have to scale up the production of synthetic DNA, reliably connect it to established supercomputing systems and improve the speed of the data writing and reading process. The project goal would be to encode and decode terabytes of data in a day at costs and rates more than 100 times better than current technologies.

DNA data storage won't initially replace server farms for information that must be accessed quickly and often. Because of the time required for reading and decoding, the technique would be useful for information that must be kept indefinitely, but accessed infrequently.

Part of the technical challenge is interfacing the DNA with standard CMOS electronic technologies. The researchers plan to build hybrid chips in which the DNA grows above layers containing the electronics. The overall project will leverage the efficiencies of current semiconductor technologies, said Brooke Beckert, a GTRI research engineer.

"We'll be working with commercial foundries, so when we get the processing right, it should be much easier to transition the technology over to them," she said. "Connecting to the existing technology infrastructure is a critical part of this project, but we'll have to custom-make most of the components in the first stage."

Among the challenges will be managing the tradeoffs between speed and error, said Guise. "The issue is how far down we can scale this without introducing too many errors," he said. "The basic synthesis is proven at a scale of hundreds of microns. We want to shrink that by a factor of 100, which leads us to worry about such issues as crosstalk between different DNA strands in adjacent locations on the chips."

Current technology uses modified inkjet printing to produce the DNA strands, but the SMASH project plans to grow the biopolymer more rapidly and in larger quantities using parallelized synthesis on the hybrid chips.

To achieve the major advances in reading cost and speed required, the program will rely on the molecular electronic DNA reader chips under development at Roswell. The data will be read from DNA strands using a molecular electronic sensor array chip, on which single molecules are drawn through nanoscale current meters that measure electrical signatures of each letter in the sequence. For biomedical applications, the sequencing industry has been focused on a goal of achieving a $1,000 human genome. The DNA reading goals of this program amount to delivering a $10 genome, and that will require a major technology disruption.

The researchers acknowledge the challenges ahead in bringing their devices to commercial scale.

"We don't see any killers ahead for this technology," said Adam Meier, a GTRI senior research scientist. "There is a lot of emerging technology and doing this commercially will require many orders of magnitude improvement. Magnetic tape for archival storage has been improving steadily for 60 years, and this investment from IARPA will power the advancements needed to make DNA storage competitive with that." 

NC State research overcomes key obstacles to scaling up DNA data storage

Researchers from North Carolina State University have developed new techniques for labeling and retrieving data files in DNA-based information storage systems, addressing two of the key obstacles to widespread adoption of DNA data storage technologies.

"DNA systems are attractive because of their potential information storage density; they could theoretically store a billion times the amount of data stored in a conventional electronic device of comparable size," says James Tuck, co-corresponding author of a paper on the work and an associate professor of electrical and computer engineering at NC State.

"But two of the big challenges here are, how do you identify the strands of DNA that contain the file you are looking for? And once you identify those strands, how do you remove them so that they can be read - and do so without destroying the strands?" {module In-article}

"Previous work had come up with a system that appends short, 20-monomer long sequences of DNA called primer-binding sequences to the ends of DNA strands that are storing information," says Albert Keung, co-corresponding author of the paper and an assistant professor of chemical and biomolecular engineering at NC State. "You could use a small DNA primer that matches the corresponding primer-binding sequence to identify the appropriate strands that comprise your desired file. However, there are only an estimated 30,000 of these binding sequences available, which is insufficient for practical use. We wanted to find a way to overcome this limitation."

To address these problems, the researchers developed two techniques that, taken together, they call DNA Enrichment and Nested Separation, or DENSe.

The researchers tackled the file identification challenge by using two, nested primer-binding sequences. The system first identifies all of the strands containing the initial binder sequence. It then conducts a second "search" of that subset of strands to single out those strands that contain the second binder sequence.

"This increases the number of estimated file names from approximately 30,000 to approximately 900 million," Tuck says.

Once identified, the file still needs to be extracted. Existing techniques use polymerase chain reaction (PCR) to make lots (and lots) of copies of the relevant DNA strands, then sequence the entire sample. Because there are so many copies of the targeted DNA strands, their signal overwhelms the rest of the strands in the sample, making it possible to identify the targeted DNA sequence and read the file.

"That technique is not efficient, and it doesn't work if you are trying to retrieve data from a high-capacity database - there's just too much other DNA in the system," says Kyle Tomek, a Ph.D. student at NC State and co-lead author of the paper.

So the researchers took a different approach to data retrieval, attaching any of several small molecular tags to the primers being used to identify targeted DNA strands. When the primer finds the targeted DNA, it uses PCR to make a copy of the relevant DNA - and the copy is attached to the molecular tag.

The researchers also utilized magnetic microbeads coated with molecules that bind specifically to a given tag. These functionalized microbeads "grab" the tags of targeted DNA strands. The microbeads can then be retrieved with a magnet, bringing the targeted DNA with them.

"This system allows us to retrieve the DNA strands associated with a specific file without having to make many copies of each strand, while also preserving the original DNA strands in the database," Keung says.

"We've implemented the DENSe system experimentally using sample files, and have demonstrated that it can be used to store and retrieve text and image files," Keung adds.

"These techniques, when used in tandem, open the door to developing DNA-based data storage systems with modern capacities and file-access capabilities," Tomek says.

"Next steps include scaling this up and testing the DENSe approach with larger databases," Tuck says. "A big challenge there is cost."

The paper, "Driving the Scalability of DNA-Based Information Storage Systems," is published in the journal ACS Synthetic Biology. Co-lead author of the paper is Kevin Volkel, a Ph.D. student at NC State. The paper was co-authored by Alexander Simpson, a former graduate student at NC State; and Austin Hass and Elaine Indermaur, both undergraduates at NC State.