Combining machine learning with DNA storage approaches

DNA-based storage technologies have attracted a lot of interest in recent years as a non-volatile memory device. Nature is known to be the best scientist in the world and can create structures and operations that are difficult to reproduce synthetically, so there is a lot to learn from these systems. DNA in its natural environment can encode and store vast amounts of information, and as the need for data storage increases in the modern data-driven society, scientists and engineers are investigating bioelectronic options beyond fully synthetic storage devices that we use today.

It’s safe to say that DNA-based storage is nowhere near the level of synthetic storage technologies used today, but a lot of research is being done to slowly advance the field and improve device performance. One of the latest developments has been to move away from the traditional architecture seen with many DNA storage devices to create a system that can use machine learning algorithms to encode, decode, process and store images. and image-based data.

Traditional DNA storage systems

DNA-based storage devices are increasingly seen as a viable alternative to conventional magnetic, optical, and flash memory devices used in electronics today. To date, many DNA-based storage device architectures store user information in synthesized DNA strings (oligos) and retrieve data via high-throughput sequencing technologies or nanopore sequencing. .

Even though there have been a lot of advances in the design of DNA storage devices, they usually only encode information in the nucleotide sequence of the molecule, which leads to some issues that could make them less commercially practical. . Some of the key issues to date include the high cost of synthetic DNA, lack of a simple rewrite mechanism, large read-write latencies, and some missing oligo errors.

When it comes to images, image data often needs to be compressed before it is saved, so a single incompatibility can cause a very large error during decompression, ultimately leading to unrecognizable reproduction of data. ‘origin. Other issues include sequencing errors that vary in magnitude from platform to platform, as well as PCR reactions and data write-back operations that cause sequencing errors to gradually increase.

In many data storage systems, accurate rebuilding can be guaranteed by considering the worst-case scenario and performing extensive read-write experiments to determine device error rates, before adding redundancy for these errors. However, the nature of many DNA-based storage devices makes it difficult to obtain an estimate of the error, resulting in large errors on some devices. To overcome the challenges of many DNA-based memory devices, a shift in architecture has been proposed to a hybrid model where information is stored in both sequence and basic DNA structures.

Creating hybrid devices that use machine learning

Researchers have now created a hybrid 2D DNA-based storage device that stores information in the sequence and backbone structures of DNA, allowing it to perform joint encoding, decoding and processing operations. Datas. The name of the device was called “2DDNA” and was developed to primarily address rewrite-related issues and to avoid the use of worst-case error-correction approaches that are typically required to compensate for missing and random oligos in infrastructure. of the device.

The 2DDNA uses two different information systems to combine the desirable characteristics of synthetic and pseudo-based recorders. On the one hand, the images are stored in the synthetic DNA, but the metadata of the sequence-encoded images (ownership information, dates, clinical status descriptions) are overlaid and stored as pseudonyms in the backbone of DNA.

Sequences contain a lot of information, but such an amount of information has been the cause of bad rewrite operations in many DNA-based devices. However, the information stored in nicks is generally smaller in volume and is much more suitable for efficient, permanent, and privacy-preserving erase and rewrite operations. Additionally, information in both the sequence and the skeleton can be read simultaneously.

The other challenge was to avoid using worst-case scenario redundancy to correct errors in the sequence and/or rewrite operations. To mitigate potential mismatch errors (in decoding parameters), the researchers used machine learning algorithms to detect whether the stored images suffered from fading or painting effects, which would suggest that there are has problems with the data.

The machine learning approach uses a compression scheme for images that operates on three separate color channels. Machine learning and computer vision approaches reconstruct images and improve quality to generate high-quality replicas of the original image. The device was experimentally tested by reconstructing an image library, where the images showed undetectable or very small visual impairments. The fixes were made by erasing and rewriting the copyright metadata encoded in the nicks.

The study results showed that DNA can be used as both a write-once memory and a rewritable memory and that the data can be erased permanently and preserving confidentiality. The approach addresses some key challenges of DNA-based storage devices and removes the need for worst-case redundancy approaches.

Although it will still take some time before DNA-based storage devices align with other synthetic devices, this hybrid and AI-based approach offers the possibility of retrieving quality images and high information density. The device provides a way to efficiently rewrite data using the metadata contained in the DNA backbone and should be suitable for use (in the future) in applications that use synthetic or natural DNA strands like sequencing oligo for data.


Milenkovic O. et al., DNA-based rewritable two-dimensional data storage with machine learning reconstruction, Nature Communications, 13, (2021), 2984

Sherry J. Basler