DePaul University computer scientist earns NSF CAREER grant to study reproducibility

Assistant Professor Tanu Malik's container method makes it easier to reproduce and compare scientific experiments

Reproducibility is the cornerstone of science. In order for scientists to make advancements, they must be able to validate and build on each other's work. Now that so much science relies on computations and data, many researchers are struggling to share their computational artifacts in ways that are usable for others, said Tanu Malik, an assistant professor in DePaul University's College of Computing and Digital Media.

"We have results that are generated through computational artifacts but are being presented on PDF papers. As a researcher, there are no easy means for verifying the results being presented," said Malik. "Emailing and sharing through websites are old methods. We need more efficient and usable methods to verify results from complex scientific experiments." CAPTION Tanu Malik, assistant professor in DePaul University's College of Computing and Digital Media, was awarded a Faculty Early Career Development (CAREER) grant, the National Science Foundation's most prestigious award in support of early-career faculty.  CREDIT DePaul University/Jeff Carrion{module In-article}

Now, the National Science Foundation has awarded Malik a Faculty Early Career Development (CAREER) grant to support her work to lay the foundation for establishing reproducibility of real-world computational and data science. Malik's project will also increase awareness of the need for computational reproducibility tools through a research and education plan involving scientists, students, and instructors. The $498,889, a five-year research grant is NSF's most prestigious award in support of early-career faculty.

Hitting on an idea

Malik knew she was onto something in 2013 as a research associate scientist at the University of Chicago while working with a group of geoscientists. Spread across seven universities, they were trying to collect and run their computations together, but it wasn't working. Malik and her colleagues created a product, called the Sciunit container (http://sciunit.run), that could align not just the data but also the programs and environments where the information had been created. The geoscientists had been trying to share data and computation for several years.

Malik's system gave them results in 30 minutes.

"They were able to run this tool, and it gathered everything from different machines and made it portable. It became a huge thing," Malik said. She had discovered that it wasn't enough just to share a program code and data, but researchers also need what's called the "compute environment" to ensure that data is being run in the same way, getting relatively the same outputs. Malik likened it to trying to download a new program on your personal computer, but it just won't run. "That's the kind of situation we're trying to avoid."

The solution, said Malik, is to make it all portable -- the data, the program, the operating system -- so that others can move ahead and reproduce research, faster. At that time, NSF recognized the importance of the work with a $1.3 million grant, and Malik moved her research to DePaul in 2016.

"DePaul gave me the bandwidth to actually go deeper into this problem and really think from a computational aspect. I am looking at how containers should be designed to make them really robust for different kinds of computations," said Malik, who co-directs the Data Systems and Optimization Lab in DePaul's School of Computing.

Reproducibility as a spectrum

Malik's work will also make it easier for researchers to judge whether their own attempts at an experiment are reproducible or not. Her research aims to define the phases of reproducibility in computational research.

"You may want to do verification with different data sets, with different input parameters. So how do we make that verification fast? The underlying technology that we use in all of this is what is known as data provenance. It's capturing the provenance of the entire compute, or the history of how exactly it happened. And this time, this is what you have changed," Malik explained.

The term data provenance is derived from the art world, said Malik, and it refers to how data was created.

"Data always interests me," said Malik. "And the provenance of data seemed like a cool thing to study. You always look at your files -- and I think, 'how did I generate this file?' These are questions that come very naturally when I'm working, and I felt that provenance is important and wanted to explore it more."

Recognition and work ahead

The CAREER grant is awarded to scientists who have the potential to serve as academic role models in research and education, and who can lead advances in the mission of their department or organization.

At the heart of this exploration is Malik's work with students at DePaul. This spring, she created an advanced graduate course in the School of Computing about containers and reproducibility, and she said students were enjoying the work. The CAREER grant will allow Malik to engage more students with her work, especially in DePaul's data science program. She hopes to engage more women in the work, as a representation of women in computer science is still lagging, said Malik.

"The number of women who get funded in this area is abysmally low -- so I think it's a big deal," said Malik. "I just feel honored to have that opportunity. If I could share somehow that would be fantastic."

Malik added that coming to DePaul has helped give her the time and space to do the work she "always wanted to do."

"I have been doing this work for some time now, and the fact that this work is being recognized, that we did make an impact in a few lives by making it simpler, it feels good," said Malik. "NSF has recognized my work and is helping us to expand this further to make a greater impact. That's the ultimate fun, to make a dent in this hard problem."

Halperin lab creates computational tool to predict how gut microbiome changes over time

New insights into gut microbiome dynamics could lead to better diagnosis, treatment of disease

A new computational modeling method uses snapshots of which types of microbes are found in a person's gut to predict how the microbial community will change over time. The tool, developed by Liat Shenhav, Leah Briscoe and Mike Thompson from the Halperin lab, University of California Los Angeles, and colleagues at the Mizrahi lab at Ben-Gurion University, Israel, is presented in PLOS Computational Biology.

The types and relative amounts of microbes found in a person's gut can reflect and affect the state of their health. Knowing how this microbial community composition changes over time could provide key insights into health and disease. However, it is unclear to what degree the microbial community composition of a person's gut at a given moment determines its future composition. CAPTION New insights into gut microbiome dynamics could lead to better diagnosis, treatment of disease.  CREDIT sbtlneet/pixabay{module In-article}

To address this question, Shenhav and colleagues developed Microbial community Temporal Variability Linear Mixed Model (MTV-LMM), a new method for modeling temporal changes in the microbial composition of the gut. When tested against real-world data, the new tool makes more accurate predictions than do other models previously developed for the same purpose.

The researchers then used MTV-LMM to surface new insights into microbiome dynamics. For instance, they demonstrated that, in both infants and adults, gut microbiome community composition can indeed be accurately predicted based on earlier observations of the community. They also applied the model to data from 39 infants and revealed a key shift around the age of 9 months in how the gut microbiome changes over time.

Looking forward, MTV-LMM could be applied to explore the temporal dynamics of the gut microbiome in the context of disease, which could lead to improved diagnosis and treatment. It could also be useful for understanding other types of temporal microbiome processes, such as those occurring during digestion.

"Our approach provides multiple methodological advancements, but this is still just the tip of the iceberg," Shenhav says. In the future, she and her colleagues will work to further improve the prediction accuracy of the model and explore additional applications. "Modeling the temporal behavior of the microbiome is a fundamental scientific question, with potential applications in medicine and beyond."

Model predicts bat species with the potential to spread deadly Nipah virus in India

Findings can help guide surveillance and prevent deadly outbreaks

Since its discovery in 1999, Nipah virus has been reported almost yearly in Southeast Asia, with Bangladesh and India being the hardest hit. In a new study, published today in PLoS Neglected Tropical Diseases, scientists used machine learning to identify bat species with the potential to host Nipah virus, with a focus on India - the site of a 2018 outbreak. Four new bat species were flagged as surveillance priorities.

Barbara Han, a disease ecologist at Cary Institute of Ecosystem Studies, is a co-lead author on the paper. She explains, "While there is a growing understanding that bats play a role in the transmission of Nipah virus in Southeast Asia, less is known about which species pose the most risk. Our goal was to help pinpoint additional species with a high likelihood of carrying Nipah, to target surveillance and protect public health." CAPTION Indian flying fox roosting near bananas.  CREDIT Rajib Islam{module In-article}

Raina K. Plowright, a disease ecologist at Montana State University, was also a co-lead author. She notes, "As this paper was going to press, another case of Nipah virus was confirmed in Kerala. The public health community has again been forced into reactive mode. Our study is a starting point for the research needed to contain Nipah at its source, so we are managing spillover risk, instead of human suffering."

Nipah virus is a highly lethal, emerging henipavirus that can be transmitted to people from the body fluids of infected bats. Eating fruit or drinking date palm sap that has been contaminated by bats has been flagged as a transmission pathway. Once infected, people can spread the virus directly to other people, sparking an outbreak. Domestic pigs are also bridging hosts that can infect people. There is no vaccine and the virus has a high mortality rate.

"Bat-borne viruses are found all over the world, yet surveillance and sampling efforts have been patchy," says Han. "There are likely many competent Nipah hosts that have not been identified. For this reason, there is a need to devise new methods that take all available data into account to guide sampling efforts in India and in other regions."

India is home to an estimated 113 bat species. Just 31 of these species have been sampled for Nipah virus, with 11 found to have antibodies that signal host potential. Plowright notes, "Given the role bats play in transmitting viruses infectious to people, investment in understanding these animals has been low. The last comprehensive and systematic taxonomic study on the bats in India was conducted more than a century ago." CAPTION Geographic ranges of bat species that are in the 90th percentile of similarity (based on generalized boosted regression) with other bat species that are positive for Nipah virus from Asia, Australia, and Oceana (based on PCR or serology).  CREDIT Plowright RK, Becker DJ, Crowley DE, Washburne AD, Huang T, Nameer PO, et al. (2019){module In-article}

Machine learning, a form of artificial intelligence, was used to flag bat species with the potential to harbor Nipah. Han explains, "By looking at the traits of bat species known to carry Nipah globally, our model was able to make predictions about additional bat species residing in India with the potential to carry the virus and transmit it to people. These bats are currently not on the public health radar and are worthy of additional study."

First, the team compiled published data on bat species known to carry Nipah and other henipaviruses globally. Data included 48 traits of 523 bat species, including information on foraging methods, diet, migration behaviors, geographic ranges, and reproduction. They also looked at the environmental conditions in which reported spillovers occurred.

Then they applied a trait-based machine learning approach to a subset of species that occur in Asia, Australia, and Oceana. Their algorithm identified known Nipah-positive bat species with 83% accuracy. It also identified six bat species that occur in Asia, Australia, and Oceana that have traits that could make them competent hosts and should be prioritized for surveillance. Four of these species occur in India, two of which are found in Kerala.

Plowright explains, "We set out to make trait-based predictions of likely henipavirus reservoirs near Kerala. Our focus was narrow, but the model was successful in identifying Nipah hosts, demonstrating that this method could serve as a powerful tool in guiding surveillance for Nipah and other disease systems."

The authors note that their predictions must be combined with local knowledge on bat ecology - including distribution, abundance, and proximity to humans - to design sampling plans that can effectively identify bat hosts that pose a risk to humans. This work provides a list of species to guide early surveillance and should not be taken as a definitive list of reservoirs.

"Surveilling high-risk bat populations can provide early warning for veterinarians and public health authorities to take preventative measures needed to preempt an outbreak. Identifying which species harbor disease is an important first step in surveillance planning. We also need to prioritize research on which virus strains pose the greatest risk to people. Ultimately, the goal is to extinguish risk, not fight fires," Han concludes.