DNA storage in the genetic code
(appeared in May 2016)

(link to main website)

Made-to-measure DNA may be the solution for long term storage of mounting data loads, says S.Ananthanarayanan.

Data storage in magnetic tape or hard disks needs high maintenance so that the data is not corrupted and that it stays in a format that is readable in the future. An emerging alternative is storage in the way the genetic code is preserved in DNA for millions of years. It is reported that Twist Bioscience, a California based start-up company that specializes in building bits of DNA from scratch, has been engaged by the Microsoft Corporation to supply ten million DNA strands and help try out the new medium. Dr Emily M Leproust, CEO of Twist Bioscience, in fact, was an author of seminal papers, one in 2010, which reported building DNA strands and the other in 2013, where the strands were used to store digital data.

Data storage has grown to become an area of great importance. The output of very expensive scientific programmes, like space exploration, or CERN’s Large Hadron Collider, is immense data. While the data gets processed over months, either in large, dedicated facilities or even through crowd sourcing, a good part of the data could be needed again and needs to be preserved. Another field that gives rise to huge data is the digitizing of civil documents, like land records, topographic data, layout of drainage or communication lines. As actual drawings and plans deteriorate and the data is growing, most of these have been converted to digital formats, as are also many other records that need archiving.

But digital storage also deteriorates and what is worse, the decoding technology, like DVD format, also changes. The digital data hence needs to be periodically renewed, both to prevent deterioration as well as to bring the storage format up-to-date. Such conversion and verification can be a very time consuming and expensive and with increasing loads of data, the investment on maintenance of data can be comparable to that of its acquisition.

The microscopic DNA macro-molecule, in contrast, is able to pack huge data within very small volume and the stability of the record is legendary. Apart from being the vehicle of faithfully transmitting the mammoth genetic data of living things across generations, it is routine that fossils and organic remains from prehistoric times contain DNA in good enough condition for research. Finding a way to encode digital, computer-generated data on to the DNA structure would hence permit very compact and hardy storage.

Digital format

Computers consist of electronic devices which ultimately take only the states of ‘on’ or ‘off’, represented by the numbers ‘0’ and ‘1’. All data, be it of text, images or sounds, hence have to be coded with the help of only these two numbers. This coding is done with the help of a form of counting which is based on only the numbers ‘0’ and ‘1’ and is called binary arithmetic, as opposed to the usual decimal’ arithmetic, based on the number, 10.

The decimal system has symbols for the numbers from ‘0’ to ‘9’ and when we reach the number ten, we write it as ‘10’, to say that it is one ‘ten’ and no units. In the same way we write twenty as ‘20’, one more as ‘21’, and so on, till a hundred is ‘100’, to indicate ten tens and no more. In binary arithmetic, we do the same, with the number, two taking the place of the number, ten. Thus, we have the two symbols, ‘0’ and ‘1’ and when we count one more, we say, ‘10’ to indicate, ‘one time the number, two and no more’. One more, or the number, ‘3’ would become, ‘11’, to mean, ‘one times two and one more’, and so on. The number, ‘4’, would be written as ‘100’, the number, ‘5’ as ‘101’, ‘six’ as ‘110’, and so on.

When we wish to represent features of text, like the alphabet, digits, punctuation and other symbols, there is a convention known as the American Standard Code for Information Interchange (ASCII), where all text characters are represented by the 128 numbers from zero to 127. Thus, the characters, A, B, C…..Z are coded as the numbers ‘65’ to ‘90’, small letters, a,b,c…z, are coded as numbers ‘97’ to ‘122’, the full stop is ‘96’, the comma is ‘44’, etc. But computers still cannot recognize these numbers, and the numbers are again converted to binary, like, ‘65’, for ‘A’, becomes ‘1000001’, ‘122’, for ‘z’, becomes ‘1111010’, etc.

Even long texts, along with the additional information of font, size, colour, etc, as we type into the keyboard, get converted into binary representation and that is the way text is stored and processed in the computer. There are again standards like ‘jpg’ or ‘bmp’, which specify how the data of pixels that make up an image would be represented by 8-digit binary numbers, and there are similar conventions for audio files. This article for instance, consists of 9,523 characters (including spaces) and 3 image files and was represented in MSWord format, along with font, margins, etc, data, in 141,000 eight digit binary numbers.

DNA representation

The DNA is a string of chemical units, each one of which can take on and be differentiated by one of four types of side markers. The units of the string are called nucleotides and they attach to each other like the carriages of a train. The four kinds of ‘side chains’ are denoted as C, G, A and T. These side chains also form bonds with the side chains of a parallel train of nucleotides, but with a rule, that A pairs with T and C pairs with G. The parallel train thus forms with the side chains that correspond to those of the first train, and with bonds forming right through the length of the string, the DNA molecule has remarkable stability and resilience, till special enzymes cause it to separate, for reproduction, etc.

It is the order in which C, G, A, T are attached that specifies the different amino acids of the different proteins of a species that the DNA codes for. The scheme, in fact, is based on groups of three nucleotides, each of which can have any of four kinds of side chains, and in the group of three, there can be 4x4x4=64 different combinations. With provision of redundancy to take care of any errors, and also for markers to indicate the start and the end of the code for a protein, these 64 possible combinations code for 20 amino acids, which are the constituents of all the millions of proteins.

The same idea can also be extended to represent characters of text, digits, distribution of pixels, etc, to form coding that stores computer records. While the successive units in magnetic tape or on a hard disc can take only the two forms of ‘on’ or ‘off’ or ‘up’ or ‘down’, to represent the numbers, ‘0’ or ‘1’, the units in DNA can take four forms. In practice, it is found that strings of more than three units with the same side chain are not stable. The side chain, ‘G’, is hence not used in the coding, but is used only as a filler to break any chain of the repeated letters, and we are left with only three forms of each unit for coding.

Creating the DNA

In principle, the digital, or binary representation of all kind of documents can be converted to trinary, and then coded on to a string of nucleotides in DNA. We would then have all our data preserved in the billion-nucleotide-long DNA molecules, compact and secure, able to last centuries!

The trouble is that it is no simple task to create DNA in the way we like. Even in nature, present-day DNA has evolved from simpler forms, and does not form unit by unit, but assembles when the strings of DNA separate for reproduction. Artificial synthesis is carried out by actual attachment of successive nucleotides, using materials with nano-meter pores as a scaffold. The process, however is limited by side reactions and a chain of more than 100 nucleotides was not possible. The firm, Agilent Technologies Inc, where Dr Leproust was a researcher, has refined the process, to create stretches of over 150 nucleotides. These find application in biomedical research and were used in the DNA data storage trial reported in 2013.

The trial used a large text sample, over 700 kilobytes (this article is about 141 KB) and transferred the data on to 153,335 DNA strings , each one 117 nucleotides long. The data on each string also carried information that identified the portion of text, where in the whole text the portion belonged, check digits, to detect and possibly correct any errors and also a large overlap. DNA itself consists of a pair of complementary strings, which act as alternate copies. Along with the overlap, the coding thus provided ample redundancy and security. The paper of 2013 reports error free reproduction of the coded material when the mass of DNA strings were decoded using methods of genetic engineering.

We can see that DNA coding is not as simple as writing to the disk by hitting the ‘save’ button. And then retrieval is a task too. But the application is for data that needs to be saved for along time, which would have recurring costs in the normal way. The costs of DNA coding and decoding would also come down. The result of the Microsoft Corporation trials may set the course.

------------------------------------------------------------------------------------------