Stowers investigator develops explainable AI for decoding genome biology

Opening the black box to uncover the rules of the genome's regulatory code

Researchers at the Stowers Institute for Medical Research, in collaboration with colleagues at Stanford University and the Technical University of Munich have developed an advanced explainable artificial intelligence (AI) in a technical tour de force to decipher regulatory instructions encoded in DNA. In a report published online in an academic journal today, the team found that a neural network trained on high-resolution maps of protein-DNA interactions can uncover subtle DNA sequence patterns throughout the genome and provide a deeper understanding of how these sequences are organized to regulate genes. Researchers used DNA sequences from high-resolution experiments to train a neural network called BPNet, whose "black box" innerworkings were then uncovered to reveal sequence patterns and organizing principles of the genome's regulatory code.  CREDIT Illustration courtesy of Mark Miller, Stowers Institute for Medical Research.{module INSIDE STORY}

Neural networks are powerful AI models that can learn complex patterns from diverse types of data such as images, speech signals, or text to predict associated properties with impressive high accuracy. However, many see these models as uninterpretable since the learned predictive patterns are hard to extract from the model. This black-box nature has hindered the wide application of neural networks to biology, where the interpretation of predictive patterns is paramount.

One of the big unsolved problems in biology is the genome's second code--its regulatory code. DNA bases (commonly represented by letters A, C, G, and T) encode not only the instructions for how to build proteins but also when and where to make these proteins in an organism. The regulatory code is read by proteins called transcription factors that bind to short stretches of DNA called motifs. However, how particular combinations and arrangements of motifs specify regulatory activity is an extremely complex problem that has been hard to pin down.

Now, an interdisciplinary team of biologists and computational researchers led by Stowers Investigator Julia Zeitlinger, Ph.D., and Anshul Kundaje, Ph.D., from Stanford University, have designed a neural network--named BPNet for Base Pair Network--that can be interpreted to reveal regulatory code by predicting transcription factor binding from DNA sequences with unprecedented accuracy. The key was to perform transcription factor-DNA binding experiments and computational modeling at the highest possible resolution, down to the level of individual DNA bases. This increased resolution allowed them to develop new interpretation tools to extract the key elemental sequence patterns such as transcription factor binding motifs and the combinatorial rules by which motifs function together as a regulatory code.

"This was extremely satisfying," says Zeitlinger, "as the results fit beautifully with existing experimental results, and also revealed novel insights that surprised us."

For example, the neural network models enabled the researchers to discover a striking rule that governs the binding of the well-studied transcription factor called Nanog. They found that Nanog binds cooperatively to DNA when multiples of its motif are present in a periodic fashion such that they appear on the same side of the spiraling DNA helix.

"There has been a long trail of experimental evidence that such motif periodicity sometimes exists in the regulatory code," Zeitlinger says. "However, the exact circumstances were elusive, and Nanog had not been a suspect. Discovering that Nanog has such a pattern, and seeing additional details of its interactions, was surprising because we did not specifically search for this pattern."

"This is the key advantage of using neural networks for this task," says Ziga Avsec, Ph.D., first author of the paper. Avsec and Kundaje created the first version of the model when Avsec visited Stanford during his doctoral studies in the lab of Julien Gagneur, Ph.D., at the Technical University in Munich, Germany.

"More traditional bioinformatics approaches model data using pre-defined rigid rules that are based on existing knowledge. However, biology is extremely rich and complicated," says Avsec. "By using neural networks, we can train much more flexible and nuanced models that learn complex patterns from scratch without previous knowledge, thereby allowing novel discoveries."

BPNet's network architecture is similar to that of neural networks used for facial recognition in images. For instance, the neural network first detects edges in the pixels, then learns how edges form facial elements like the eye, nose, or mouth, and finally detects how facial elements together form a face. Instead of learning from pixels, BPNet learns from the raw DNA sequence and learns to detect sequence motifs and eventually the higher-order rules by which the elements predict the base-resolution binding data.

Once the model is trained to be highly accurate, the learned patterns are extracted with interpretation tools. The output signal is traced back to the input sequences to reveal sequence motifs. The final step is to use the model as an oracle and systematically query it with specific DNA sequence designs, similar to what one would do to test hypotheses experimentally, to reveal the rules by which sequence motifs function in a combinatorial manner.

"The beauty is that the model can predict way more sequence designs that we could test experimentally," Zeitlinger says. "Furthermore, by predicting the outcome of experimental perturbations, we can identify the experiments that are most informative to validate the model." Indeed, with the help of CRISPR gene editing techniques, the researchers confirmed experimentally that the model's predictions were highly accurate.

Since the approach is flexible and applicable to a variety of different data types and cell types, it promises to lead to a rapidly growing understanding of the regulatory code and how genetic variation impacts gene regulation. Both the Zeitlinger Lab and the Kundaje Lab are already using BPNet to reliably identify binding motifs for other cell types, relate motifs to biophysical parameters, and learn other structural features in the genome such as those associated with DNA packaging. To enable other scientists to use BPNet and adapt it for their own needs, the researchers have made the entire software framework available with documentation and tutorials.

MIT's Boyden shows the symphony of cellular signals driving biology

A new imaging technology lets scientists record the flurry of messages passed within cells as they do . . . potentially everything.

Until now, most scientists could visualize only one or two of these intracellular signals at a time, says Howard Hughes Medical Institute Investigator Ed Boyden of the Massachusetts Institute of Technology. His team's new approach could make it possible to see as many signals as you want - in real-time, at once, Boyden says - giving researchers a more detailed view of cells' internal discussions than ever before.

In tests with neurons, the researchers examined five signals involved in processes such as learning and memory, Boyden and his colleagues report on November 23, 2020, in the journal Cell. "You could apply this technology to all sorts of biological mysteries," he says. "Every cell works due to all the signals inside it." Because signaling contributes to all biological processes, a better means to study it could illuminate a host of diseases, from Alzheimer's to diabetes and cancer. CAPTION To visualize cellular signals within a neuron, researchers scattered reporters in clusters (green) across the cell. They then identified the signal each cluster represented (multiple colors).  CREDIT C. Linghu, S. Johnson et al./Cell 2020{module INSIDE STORY}

The team's new approach is a breakthrough, says Clifford Woolf, a neurobiologist at Harvard Medical School who was not involved with the work. He plans to use it to examine how pain-sensing neurons become more sensitive in injury or illness. With the new imaging technology, he says "we can take apart what's happening in cells in a way that just has not been possible before."

Give a computer or human brain information, and it will crackle with electrical impulses as it prepares a response. Within cells, these impulses result in spurts of multiple molecular signals. Boyden describes this process as a group conversation. "Signals within a cell are like a set of people trying to decide what to do for the evening: they take into account many possibilities, and then decide what to collectively do," he says.

These cellular discussions are what prompt, for example, a neuron to encode a memory or a cell to turn cancerous. Despite their importance, scientists still don't have a strong grasp of how these signals work together to guide a cell's behavior.

To see cell signaling in action, scientists typically introduce genes encoding sensors connected to fluorescent proteins. These molecular reporters sense a signal and then glow a specific color under the microscope. Researchers can use a different color reporter for each signal to tell the signals apart. But finding sets of reporters with colors that a microscope can differentiate is challenging. And a typical cellular conversation can involve dozens of signals - or more.

Changyang Linghu and Shannon Johnson, scientists in Boyden's lab, got around this limitation by affixing reporters to small, self-assembling proteins that act like LEGO bricks. These small proteins "clicked together," forming clusters that were randomly scattered across the cell-like little islands. Each cluster, which appears under the microscope as a luminescent dot, reports only one type of cellular signal. "It's like having some islands with thermometers to report temperature and other islands with barometers measuring pressure," Johnson says.

In experiments with neurons, the team created clusters that each glowed upon detection of one of five different signals, including calcium ions and other important signaling molecules. After imaging the live cells, the researchers attached molecular labels to the glowing dots to identify the reporters located there. Using supercomputer analyses, the team turned the dots magenta, yellow, and other colors, depending on whether they represented calcium or another signal. This lets them see which signals were switching on and off across a cell's interior.

By monitoring so many signals at once, the team was able to figure out how each signal related to one another. "Teasing apart such relationships could help scientists understand complex processes ¬- like learning, " Linghu says.

He likens a cell to an orchestra and its signals to a symphony. "It's difficult to fully appreciate a symphony by listening to just a single instrument," he says. Because the new technique lets scientists observe multiple signals at the same time, "we can understand the symphony of cellular activities."

Boyden's team estimates it may be possible to detect as many as 16 signals with their technology, but improvements in genetic engineering techniques could raise that number significantly. "Potentially, you could look at dozens, hundreds, or even more signals," he says. "The next challenge," Boyden says, "is getting sensors for all of those signals into a cell."

Hokkaido researchers show how high pressure is key for better optical fibers

Optical fiber data transmission can be significantly improved by producing the fibers, made of silica glass, under high pressure, researchers from Japan, and the US report in the journal npj Computational Materials.

Using supercomputer simulations, researchers at Hokkaido University, Pennsylvania State University and their industry collaborators theoretically show that signal loss from silica glass fibers can be reduced by more than 50 percent, which could dramatically extend the distance data can be transmitted without the need for amplification.

"Improvements in silica glass, the most important material for optical communication, have stalled in recent years due to lack of understanding of the material on the atomic level," says Associate Professor Madoka Ono of Hokkaido University's Research Institute of Electronic Science (RIES). "Our findings can now help guide future physical experiments and production processes, though it will be technically challenging."

Optical fibers have revolutionized high-bandwidth, long-distance communication all over the world. The cables carrying all that information are mainly made of fine threads of silica glass, slightly thicker than a human hair. The material is strong, flexible, and very good at transmitting information, in the form of light, at a low cost. But the data signal peters out before reaching its final destination due to light being scattered. Amplifiers and other tools are used to contain and relay the information before it scatters, ensuring it is delivered successfully. Scientists are seeking to reduce light scatter, called Rayleigh scattering, to help accelerate data transmission and move closer towards quantum communication.  CAPTION The voids in silica glass (yellow), which are responsible for scattering of light and degradation of signals, become much smaller when the glass is quenched at higher pressures (Yongjian Yang, et al., npj Computational Materials, September 17, 2020).  CREDIT Yongjian Yang, et al., npj Computational Materials, September 17, 2020{module INSIDE STORY}

Ono and her collaborators used multiple computational methods to predict what happens to the atomic structure of silica glass under high temperature and high pressure. They found large voids between silica atoms form when the glass is heated up and then cooled down, which is called quenching, under low pressure. But when this process occurs under 4 gigapascals (GPa), most of the large voids disappear and the glass takes on a much more uniform lattice structure.

Specifically, the models show that the glass goes under a physical transformation, and smaller rings of atoms are eliminated or "pruned" allowing larger rings to join more closely together. This helps to reduce the number of large voids and the average size of voids, which cause light scattering and decrease signal loss by more than 50 percent.

The researchers suspect even greater improvements can be achieved using a slower cooling rate at higher pressure. The process could also be explored for other types of inorganic glass with similar structures. However, actually making glass fibers under such high pressures at an industrial scale is very difficult.

"Now that we know the ideal pressure, we hope this research will help spur the development of high-pressure manufacturing devices that can produce this ultra-transparent silica glass," Ono says.