Johns Hopkins builds cloud-based platform that opens genomics data to all

Harnessing the power of genomics to find risk factors for major diseases or search for relatives relies on the costly and time-consuming ability to analyze huge numbers of genomes. A team co-led by a Johns Hopkins University computer scientist has leveled the playing field by creating a cloud-based platform that grants genomics researchers easy access to one of the world’s largest genomics databases.

Known as AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space), the new platform gives any researcher with an Internet connection access to thousands of analysis tools, patient records, and more than 300,000 genomes. The work, a project of the National Human Genome Institute (NHGRI), appears today in Cell Genomics.

“AnVIL is inverting the model of genomics data sharing, offering unprecedented new opportunities for science by connecting researchers and datasets in new ways and promising to enable exciting new discoveries,” said project co-leader Michael Schatz, Bloomberg Distinguished Professor of Computer Science and Biology at Johns Hopkins.

Typically genomic analysis starts with researchers downloading massive amounts of data from centralized warehouses to their own data centers, a process that is not only time-consuming, inefficient, and expensive, but also makes collaborating with researchers at other institutions difficult.

“AnVIL will be transformative for institutions of all sizes, especially smaller institutions that don’t have the resources to build their own data centers. It is our hope that AnVIL levels the playing field so that everyone has equal access to make discoveries,” Schatz said.

Genetic risk factors for ailments such as cancer or cardiovascular disease are often very subtle, requiring researchers to analyze thousands of patients’ genomes to discover new associations. The raw data for a single human genome comprises about 40GB, so downloading thousands of genomes can take takes several days to several weeks: A single genome requires about 10 DVDs worth of data, so transferring thousands means moving “tens of thousands of DVDs worth of data,” Schatz said.  

In addition, many studies require integrating data collected at multiple institutions, which means each institution must download its copy while ensuring that patient-data security is maintained. This challenge is expected to become even greater in the future, as researchers embark on ever-larger studies requiring the analysis of hundreds of thousands to millions of genomes at once.

“Connecting to AnVIL remotely eliminates the need for these massive downloads and saves on the overhead,” Schatz says. “Instead of painfully moving data to researchers, we allow researchers to effortlessly move to the data in the cloud. It also makes sharing datasets much easier so that data can be connected in new ways to find new associations, and it simplifies a lot of computing issues, like providing strong encryption and privacy for patient datasets.”

AnVIL also provides researchers with several major analysis tools, including Galaxy, developed in part at Johns Hopkins, along with other popular tools such as R/Bioconductor, Jupyter notebooks, WDLs, Gen3, and Dockstore to support both interactive analysis and large-scale batch computing. Collectively, these tools allow researchers to tackle even the largest studies without having to build out their computing environments.

Researchers from all over the world currently use the platform to study a variety of genetic diseases, including autism spectrum disorders, cardiovascular disease, and epilepsy. Schatz’s team, part of the Telomere-to-Telomere Consortium, used it to reanalyze thousands of human genomes with the new reference genome to discover more than 1 million new variants.

Already, the AnVIL team has collected petabytes of data from several of the largest NHGRI projects, including hundreds of thousands of genomes from the Genotype-Tissue Expression (GTEx), Centers for Mendelian Genetics (CMG), and Centers for Common Disease Genomics (CCDG) projects, with plans to host many more projects soon.

Leiden astronomers calculate genesis of Oort cloud in chronological order

A team of Leiden astronomers has managed to calculate the first 100 million years of the history of the Oort cloud in its entirety. Until now, only parts of the history had been studied separately. The cloud, with roughly 100 billion comet-like objects, forms an enormous shell at the edge of our solar system. The astronomers will soon publish their comprehensive simulation and its consequences in the journal Astronomy & Astrophysics.

The Oort cloud was discovered in 1950 by the Dutch astronomer Jan Hendrik Oort to explain why there continue to be new comets with elongated orbits in our solar system. The cloud, which starts at more than 3000 times the distance between the Earth and the Sun, should not be confused with the Kuiper belt. This is the rim of rock, grains, and ice in which the dwarf planet Pluto is located and which orbits relatively close to the Sun at about 30 to 50 times the Earth-Sun distance. Webp

Losse gebeurtenissen verbonden

How exactly the Oort Cloud must have formed has remained a mystery until now. This is because a series of events take place which a computer can hardly reproduce in its entirety. Some processes lasted only a few years and took place at relatively short distances, comparable to the distance between the Earth and the Sun. Other processes lasted billions of years and took place over light-years, comparable to distances between stars. Astronomer and simulation expert Simon Portegies Zwart (Leiden University in the Netherlands) explains: ‘If you want to calculate the whole sequence in a computer, you will irrevocably run aground. That's why, until now, only separate events were simulated.’

The Leiden researchers started from separate events, as in previous studies, but the new is that they were able to connect the events with each other. For example, they used the end result of the first calculation as the starting point for the next calculation. In this way, they were able to map out the entire genesis of the Oort cloud.

Comets from inside and outside the solar system

The Oort cloud, the Leiden simulations confirm, is a remnant of the protoplanetary disk of gas and debris from which the Solar system emerged some 4.6 billion years ago. The comet-like objects in the Oort cloud come from roughly two places in the Universe. The first part of the objects comes from close by, from the Solar system. This debris and asteroids have been thrown out by the giant planets. However, some of the debris did not succeed in doing so and is still in the asteroid belt between Mars and Jupiter. A second population of objects, the Leiden astronomers concluded, comes from other stars. When the Sun was just born, there were about a thousand other stars in the vicinity. The Oort cloud may have captured comets that originally belonged to those other stars.

In addition, the Leiden astronomers could immediately debunk a number of events. They, for example, argue that the Oort cloud was formed relatively late. That is, after the Sun had been ejected from the group of stars in which it was born. With their simulations, the astronomers also reject the hypothesis put forward in 2005 that the Oort cloud was a consequence of the migration of the giant planets in the Solar system. This hypothesis, which turns out to be debunked, would have to explain the excess of old craters on the moon.

Complex but not unique

‘With our new calculations, we show that the Oort cloud arose from a kind of cosmic conspiracy,’ says Portegies Zwart, ‘in which nearby stars, planets, and the Milky Way all play their part. Each of the individual processes alone would not be able to explain the Oort cloud. You really need the interplay and the right choreography of all the processes together. And that, by the way, can be explained quite naturally from Sun's birth environment. So although the Oort cloud is complicatedly formed, it is probably not unique.’

During the calculations, the researchers regularly wondered how such a complicated process could actually emerge. Portegies Zwart: ‘Despair often got the better of us. Only when the calculations were completed, did all the pieces of the puzzle suddenly fall into place and it all looked quite natural and self-evident. That is, I think, one of the most beautiful aspects of being a scientist. You suddenly realize how distorted our thinking concerning this problem was until it actually turned out to be rather natural.’

Columbia Engineering team builds first hacker-resistant cloud software system

As the first system to guarantee the security of virtual machines in the cloud, SeKVM could transform how cloud services are designed, developed, deployed, and trusted

Whenever you buy something on Amazon, your customer data is automatically updated and stored on thousands of virtual machines in the cloud. For businesses like Amazon, ensuring the safety and security of the data of its millions of customers is essential. This is true for large and small organizations alike. But up to now, there has been no way to guarantee that a software system is secure from bugs, hackers, and vulnerabilities.

Columbia Engineering researchers may have solved this security issue. They have developed SeKVM, the first system that guarantees--through mathematical proof--the security of virtual machines in the cloud. In a new paper to be presented on May 26, 2021, at the 42nd IEEE Symposium on Security & Privacy, the researchers hope to lay the foundation for future innovations in system software verification, leading to a new generation of cyber-resilient system software.

SeKVM is the first formally verified system for cloud computing. Formal verification is a critical step as it is the process of proving that software is mathematically correct, that the program's code works as it should, and there are no hidden security bugs to worry about. Microverification of cloud hypervisors  CREDIT Jason Nieh and Ronghui Gu/Columbia Engineering

"This is the first time that a real-world multiprocessor software system is mathematically correct and secure," said Jason Nieh, professor of computer science and co-director of the Software Systems Laboratory. "This means that users' data are correctly managed by software running in the cloud and are safe from security bugs and hackers."

The construction of correct and secure system software has been one of the grand challenges of computing. |Nieh has worked on different aspects of software systems since joining Columbia Engineering in 1999. When Ronghui Gu, the Tang Family Assistant Professor of Computer Science and an expert in formal verification, joined the computer science department in 2018, he and Nieh decided to collaborate on exploring formal verification of software systems.

Their research has garnered major interest: both researchers won an Amazon Research Award, multiple grants from the National Science Foundation, and a multi-million dollar Defense Advanced Research Projects Agency (DARPA) contract to further develop the SeKVM project. In addition, Nieh was awarded a Guggenheim Fellowship for this work.

Over the past dozen years, there has been a good deal of attention paid to formal verification, including work on verifying multiprocessor operating systems. "But all of that research has been conducted on small toy-like systems that nobody uses in real life," said Gu. "Verifying a multiprocessor commodity system, a system in wide use like Linux has been thought to be more or less impossible." 

The exponential growth of cloud computing has enabled companies and users to move their data and computation off-site into virtual machines running on hosts in the cloud. Cloud computing providers, like Amazon, deploy hypervisors to support these virtual machines.

A hypervisor is the key piece of software that makes cloud computing possible. The security of the virtual machine's data hinges on the correctness and trustworthiness of the hypervisor. Despite their importance, hypervisors are complicated -- they can include an entire Linux operating system. Just a single weak link in the code -- one that is virtually impossible to detect via traditional testing -- can make a system vulnerable to hackers. Even if a hypervisor is written 99% correctly, a hacker can still sneak into that particular 1% set-up and take control of the system.

Nieh and Gu's work is the first to verify a commodity system, specifically the widely-used KVM hypervisor, which is used to run virtual machines by cloud providers such as Amazon. They proved that SeKVM, which is KVM with some small changes, is secure and guarantees that virtual computers are isolated from one another.

"We've shown that our system can protect and secure private data and computing uploaded to the cloud with mathematical guarantees," said Xupeng Li, Gu's Ph.D. student and co-lead author of the paper. "This has never been done before."

SeKVM was verified using MicroV, a new framework for verifying the security properties of large systems. It is based on the hypothesis that small changes to the system can make it significantly easier to verify, a new technique the researchers call MICROverification. This novel layering technique retrofits an existing system and extracts the components that enforce security into a small core that is verified and guarantees the entire system's security.

The changes needed to retrofit a large system are quite modest--the researchers demonstrated that if the small core of the larger system is intact, then the system is secure and no private data will be leaked. This is how they were able to verify a large system such as KVM, which was previously thought to be impossible.

"Think of a house--a crack in the drywall doesn't mean that the integrity of the house is at risk," Nieh explained. "It's still structurally sound and the key structural system is good."

Shih-Wei Li, Nieh's Ph.D. student and co-lead author of the study, added, "SeKVM will serve as a safeguard in various domains, from banking systems and Internet of Things devices to autonomous vehicles and cryptocurrencies."

SeKVM could change how cloud services should be designed, developed, deployed, and trusted as the first verified commodity hypervisor. In a world where cybersecurity is a growing concern, this resiliency is highly in demand. Major cloud companies are already exploring how they can leverage SeKVM to meet this demand.