Comparing the genomes of different species -- or different members of the same species -- is the basis of a great deal of modern biology. DNA sequences that are conserved across species are likely to be functionally important, while variations between members of the same species can indicate different susceptibilities to disease.

The basic algorithm for determining how much two sequences of symbols have in common -- the "edit distance" between them -- is now more than 40 years old. And for more than 40 years, computer science researchers have been trying to improve upon it, without much success.

At the ACM Symposium on Theory of Computing (STOC) next week, MIT researchers will report that, in all likelihood, that's because the algorithm is as good as it gets. If a widely held assumption about computational complexity is correct, then the problem of measuring the difference between two genomes -- or texts, or speech samples, or anything else that can be represented as a string of symbols -- can't be solved more efficiently.

In a sense, that's disappointing, since a supercomputer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes. But it also means that computer scientists can stop agonizing about whether they can do better.

"This edit distance is something that I've been trying to get better algorithms for since I was a graduate student, in the mid-'90s," says Piotr Indyk, a professor of computer science and engineering at MIT and a co-author of the STOC paper. "I certainly spent lots of late nights on that -- without any progress whatsoever. So at least now there's a feeling of closure. The problem can be put to sleep."

Moreover, Indyk says, even though the paper hasn't officially been presented yet, it's already spawned two follow-up papers, which apply its approach to related problems. "There is a technical aspect of this paper, a certain gadget construction, that turns out to be very useful for other purposes as well," Indyk says.

Squaring off

Edit distance is the minimum number of edits -- deletions, insertions, and substitutions -- required to turn one string into another. The standard algorithm for determining edit distance, known as the Wagner-Fischer algorithm, assigns each symbol of one string to a column in a giant grid and each symbol of the other string to a row. Then, starting in the upper left-hand corner and flooding diagonally across the grid, it fills in each square with the number of edits required to turn the string ending with the corresponding column into the string ending with the corresponding row.

Computer scientists measure algorithmic efficiency as computation time relative to the number of elements the algorithm manipulates. Since the Wagner-Fischer algorithm has to fill in every square of its grid, its running time is proportional to the product of the lengths of the two strings it's considering. Double the lengths of the strings, and the running time quadruples. In computer parlance, the algorithm runs in quadratic time.

That may not sound terribly efficient, but quadratic time is much better than exponential time, which means that running time is proportional to 2N, where N is the number of elements the algorithm manipulates. If on some machine a quadratic-time algorithm took, say, a hundredth of a second to process 100 elements, an exponential-time algorithm would take about 100 quintillion years.

Theoretical computer science is particularly concerned with a class of problems known as NP-complete. Most researchers believe that NP-complete problems take exponential time to solve, but no one's been able to prove it. In their STOC paper, Indyk and his student Arturs Bačkurs demonstrate that if it's possible to solve the edit-distance problem in less-than-quadratic time, then it's possible to solve an NP-complete problem in less-than-exponential time. Most researchers in the computational-complexity community will take that as strong evidence that no subquadratic solution to the edit-distance problem exists.

Can't get no satisfaction

The core NP-complete problem is known as the "satisfiability problem": Given a host of logical constraints, is it possible to satisfy them all? For instance, say you're throwing a dinner party, and you're trying to decide whom to invite. You may face a number of constraints: Either Alice or Bob will have to stay home with the kids, so they can't both come; if you invite Cindy and Dave, you'll have to invite the rest of the book club, or they'll know they were excluded; Ellen will bring either her husband, Fred, or her lover, George, but not both; and so on. Is there an invitation list that meets all those constraints?

In Indyk and Bačkurs' proof, they propose that, faced with a satisfiability problem, you split the variables into two groups of roughly equivalent size: Alice, Bob, and Cindy go into one, but Walt, Yvonne, and Zack go into the other. Then, for each group, you solve for all the pertinent constraints. This could be a massively complex calculation, but not nearly as complex as solving for the group as a whole. If, for instance, Alice has a restraining order out on Zack, it doesn't matter, because they fall in separate subgroups: It's a constraint that doesn't have to be met.

At this point, the problem of reconciling the solutions for the two subgroups -- factoring in constraints like Alice's restraining order -- becomes a version of the edit-distance problem. And if it were possible to solve the edit-distance problem in subquadratic time, it would be possible to solve the satisfiability problem in subexponential time.

CAPTION This is Saurabh Sinha, Professor of Computer Science at the University of Illinois and member of the Gene Networks in Neural and Developmental Plasticity research theme at the Carl R. Woese Institute for Genomic Biology. CREDIT Courtesy of L. Brian Stauffer

Supercomputer scientists and molecular biologists use biological knowledge and bioinformatic algorithms to predict dynamics of gene regulation

Inside every cell that makes up a diminutive fruit fly is a vast, dynamic network of information -- the genome whose 15,000 genes allow that cell to function. In a study recently published as a breakthrough article in Nucleic Acids Research (DOI: 10.1093/nar/gkv195), supercomputer scientists and molecular biologists demonstrated the utility of a novel approach to deciphering how networks of genes are regulated.

University of Illinois computer scientist Saurabh Sinha and colleagues, including Scot Wolfe and Michael Brodsky at the University of Massachusetts Medical School, were led to the current study by their fascination with how interactions between DNA and a special category of cellular proteins, called 'transcription factors,' help control when and where genes are expressed. University of Illinois supercomputer science graduate student Charles Blatti played a major role in the study, which was funded by the NIH, NSF, and a Cohen Graduate Fellowship awarded to Blatti.

Transcription factors grab onto regions of DNA near sequences that code for genes. Their participation in the molecular complexes responsible for gene expression can either promote or repress expression of individual genes. Considered as a group, these proteins are a little bit like an operating system of the cell -- they help orchestrate the genetic 'programs' that cells need to run.

Each type of transcription factor recognizes a unique DNA sequence, called a motif, that acts like a molecular password or barcode. By scanning through genome sequences to look for these motifs and the genes that are nearby, researchers like Sinha and his colleagues are working to gain an understanding of which transcription factors control which genes, and therefore which biological functions.

Looking at motifs alone, however, gives limited and potentially misleading information about where transcription factors are actually present and performing. Because motifs are short, they may occur in many places in the genome simply due to chance; many biological factors, including the tissue type and developmental stage of a cell, will change the way DNA is packaged in different areas of the genome, which helps determine whether a given transcription factor can physically access and bind to each of its many motifs or not.

'Motifs alone are of limited use in that the motif profile of the genome is static,' said Sinha, who is also a faculty member in the Gene Networks in Neural and Developmental Plasticity research theme at the Carl R. Woese Institute for Genomic Biology. 'How does the transcription factor's binding profile change from one cell to another, from one time point to another?...It's the accessibility that is changing.'

Researchers have relied on an expensive and labor-intensive method, chromatin immunoprecipitation with DNA sequencing (ChIP-seq), to determine where in the genome a transcription factor is binding. This method must be repeated for each transcription factor a researcher investigates. Sinha and his collaborators realized that they might be able to leverage their growing knowledge of how transcription factors interact with motifs to develop a shortcut.

They did this by first using just one laboratory experiment to determine how DNA was packaged throughout the genome in a given set of biological conditions. Then, they used this information as a filter, guessing that each transcription factor would bind to only those of its motifs that were in accessible regions of DNA given those conditions. They found that with this combination of laboratory and supercomputational work, they could predict the observed DNA-binding behavior of transcription factors in previous experiments.

'In order to reconstruct a regulatory network in a new system, you don't necessarily have to do a whole lot of assays in the right cell types,' Sinha said. 'If you instead do an accessibility assay in those cell types and then overlay the motif information on top of it...these two together approximate the same information very well.'

In addition to making this type of work easier and more affordable for laboratories with limited resources, the publication demonstrated the value of the new approach for exploring the function of enhancers, regions of DNA where several transcription factors bind close together. Access to high-quality information about the activity of many different transcription factors makes it easier to infer relationships between enhancers, transcription factors, and biological functions.

Researchers from Uppsala University, together with colleagues at University College Dublin, have studied the dynamics of active swarms using supercomputer simulations and experiments on unicellular algae. The team not only found full analogy of the active motion in a field to magnetic hysteresis but also managed to quantify the controllability of the swarm and identify the signatures of collective behavior of the active agents.

Active motion of living organisms and artificial self-propelling particles has been an area of intense research at the interface of biology, chemistry and physics. In nature, some active species can move independently of each other, some form spectacular flocks and swarms. 

One of the observations that boosted the development of theory of flocking is that the active species may behave much like magnetic spins: they achieve coherent motion by aligning their direction to the nearest neighbours. It remained unknown, however, whether a swarm can be also controlled by external field in a way similar to permanent magnets (such as compass needle).

Researchers from the Department of Mathematics at Uppsala University (Dr. Maksym Romensky) together with colleagues from School of Physics of University College Dublin (Dr. V. Lobaskin and Dr. D. Scholz) in their recent work published in the JRS Interface studied the dynamics of active swarms in alternating fields.

The team measured hysteresis of the velocity of an active swarm in supercomputer simulations and experiments on unicellular algae Euglena gracilis. The phototactic Euglena cells were driven by variable illumination and tracked using microscopy.

The team not only found full analogy of the active motion in a field to magnetic hysteresis but also managed to quantify the controllability of the swarm and identify the signatures of collective behavior of the active agents. In particular, the collective response was only possible when the field was varying slowly and the active species had enough time to align their motion. The group's motion was then more persistent than that of an individual.

By contrast, at high field frequencies, the active group fails to develop the alignment and tends to behave like a set of independent individuals even in the presence of interactions. The authors suggest that the observed laws might describe a variety of dynamic phenomena from the motion of synthetic active particles to animal flocks and herds, human crowd or opinion dynamics and help to formulate the general laws of collective behaviour.

Evolutionary theorist Stephen Jay Gould is famous for describing the evolution of humans and other conscious beings as a chance accident of history. If we could go back millions of years and "run the tape of life again," he mused, evolution would follow a different path.

A study by University of Pennsylvania biologists now provides evidence Gould was correct, at the molecular level: Evolution is both unpredictable and irreversible. Using simulations of an evolving protein, they show that the genetic mutations that are accepted by evolution are typically dependent on mutations that came before, and the mutations that are accepted become increasingly difficult to reverse as time goes on.

The research team consisted of postdoctoral researchers and co-lead authors Premal Shah and David M. McCandlish and professor Joshua B. Plotkin, all from Penn's Department of Biology in the School of Arts & Sciences. They reported their findings in this week's Early Edition of the Proceedings of the National Academy of Sciences.

The study focuses exclusively on the type of evolution known as purifying selection, which favors mutations that have no or only a small effect in a fixed environment. This is in contrast to adaptation, in which mutations are selected if they increase an organism's fitness in a new environment. Purifying selection is by far the more common type of selection.

"It's the simplest, most boring type of evolution you can imagine," Plotkin said. "Purifying selection is just asking an organism to do what it's doing and keep doing it well."

As an evolutionary model, the Penn team used the bacterial protein argT, for which the three-dimensional structure is known. Its small size means that the researchers could reliably predict how a given genetic mutation would affect the protein's stability.

Using a supercomputational model, they simulated the protein evolving during the equivalent of roughly 10 million years by randomly introducing mutations, accepting them if they did not significantly affect the protein's stability and rejecting them if they did. They then examined pairs of mutations, asking whether the later mutation would have been accepted had the earlier mutation not have been made.

"The very same mutations that were accepted by evolution when they were proposed, had they been proposed at a much earlier in time, they would have been deleterious and would have been rejected," Plotkin said.

This result -- that later mutations were dependent on the earlier ones -- demonstrates a feature known as contingency. In other words, mutations that are accepted by evolution are contingent upon previous mutations to ameliorate their effects.

The researchers then asked a distinct, converse question: whether it is possible to revert an earlier mutation and still maintain the protein's stability. They found that the answer was no. Mutations became "entrenched" and increasingly difficult to revert as time went on without having a destabilizing effect on the protein.

"At each point in time, if you make a substitution, you wouldn't see a large change in stabilization," Shah said. "But, after a certain number of changes to the protein, if you go back and try to revert the earlier change, the protein structure begins to collapse."

The concepts of contingency and entrenchment were well known to be present in adaptive evolution, but it came as a surprise to the researchers to find them under purifying selection.

"We thought we would just try this with purifying selection and see what happened and were surprised to see how much contingency and entrenchment occurs," Plotkin said. "What this tells us is that, in a deep sense, evolution is unpredictable and in some sense irreversible because of interactions between mutations."

Such interactions, when the effect of a mutation is dependent on another, are known as epistasis. The researchers' investigation found that, unexpectedly, purifying selection enriches for epistatic mutations as opposed to mutations that are simply additive. Plotkin explained that this is because purifying selection favors mutations that have a small effect. Either the mutation can have a small effect on its own, or it can have a small effect because another, earlier mutation ameliorated the effects of the current mutation. Thus mutations that are dependent upon earlier mutations will be favored.

"Our study shows, and this has been known for a long time, that most of the substitutions that occur are substitutions that have small effects," McCandlish said. "But what's interesting is that we find that the substitutions that have small effects change over time."

An implication of these findings is that predicting the course of evolution, as one might wish to do, say, to make an educated guess as to what flu strain might arise in a given year, is not easy.

"The way these substitutions occur, since they're highly contingent on what happened before, makes predictions of long-term evolution extremely difficult," Plotkin said.

The researchers hope to partner with other groups in the future to conduct laboratory experiments with microbes to confirm that real-world evolution supports their findings.

And while Gould's comment about replaying the tape of life was mainly a nod to the large amount of randomness inherent in evolution's path, this study suggests a more nuanced reason that the playback would appear different.

"There is intrinsically a huge amount of contingency in evolution," Plotkin said. "Whatever mutations happen to come first set the stage for what other later mutations are permissible. Indeed, history channels evolution down a certain path. Gould's famous tape of life would be very different if replayed, even more different than Gould might have imagined."

Students of Pharmaceutical Sciences and Biomedical Chemistry attending a lecture on computer-aided drug design at Johannes Gutenberg University Mainz (photo/©: Peter Pulkowski, JGU)

Students of Pharmaceutical Sciences at Johannes Gutenberg University Mainz (JGU) can now opt for taking a more research-oriented approach to their subject by exploring the latest techniques to create new active drug substances. In lectures and practical courses, they learn about a selection of computer-based methods that the pharmaceutical industry already routinely uses for computer-aided drug design. SuperComputer-aided drug design (CADD) is one of the mainstays of modern drug development. Therefore, JGU's Institute of Pharmaceutical Sciences and Biochemistry has recently acquired a 3-D projector that visually demonstrates to students how drugs interact with their target structures in the body, i.e., proteins.

"Our objective is to introduce all students of Pharmaceutical Sciences and Biomedical Chemistry to the basics of computer-aided drug design because this method is now in standard use in the pharmaceutical industry. It is also an integral part of our research here at the institute," explained Junior Professor Ruth Brenk, who initiated the project. The Gutenberg Teaching Council (GTC) of Mainz University has provided nearly EUR 60,000 to support this research-oriented teaching project. The fact that the students are queuing up to take part in the elective module on CADD demonstrates that this was a worthwhile investment in modern learning and teaching.

The new setup of 3-D projections is used in seminars and lectures and has been put together by the institute itself using a projector, software, and a graphics card. Equipped with 3-D glasses, students can follow on a special metalized silver screen how and where proteins, with their three-dimensional structure, provide an access point for drugs, and how protein and ligand interact. "Through demonstrations and through practical exercises and, of course, with the relevant teaching support, students can be made sufficiently familiar with the new techniques in a relatively short period of time so that they can then apply these themselves to independently work on current research topics", Brenk summarized the advantages. CAAD lectures and practical courses have since become standard parts of the course of study at the Institute of Pharmaceutical Sciences and Biochemistry at Mainz University.

Page 10 of 42