EMBL scientists present tool for large-scale analysis of genomic data with cloud supercomputing

Most bioinformatics software used for genomic analysis is experimental in nature and has a relatively high failure rate. In addition, cloud infrastructure itself, when running at scale, is prone to system crashes. These setbacks mean that big biomedical data analysis can take a long time and incur huge costs. To solve these problems, Sergei Yakneen, Jan Korbel, and colleagues at EMBL developed a system that identifies and fixes crashes efficiently.

Researchers performing analysis on the cloud need a number of technological skills, from configuring large clusters of machines and loading them with software, to handling networking, data security, and efficiently recovering from crashes. Butler helps researchers master these new domains by serving up appropriate tools that overcome all these challenges. 223545 web 88e91{module INSIDE STORY}

Saving time by checking the system's pulse

Butler differs from other bioinformatics workflow systems because it constantly collects health metrics from all system components, for example, the Central Processing Unit (CPU), memory, or disk space. Its self-healing modules use these health metrics to figure out when something has gone wrong and can take automated action to restart failed services or machines.

When this automated action does not work, a human operator is notified by email or Slack to solve the problem. Previously, a crew of trained people was necessary to check a similar system and detect failures. By automating this process, Butler dramatically reduces the time needed to execute large projects. "It is indeed very rewarding that these large-scale analyses can now take place in a few months instead of years," Korbel says.

Open-source

Good solutions are already available for individual challenges associated with supercomputing in the cloud. So instead of reinventing the wheel, the team improved existing technologies. "We built Butler by integrating a large number of established open source projects", says Sergei Yakneen, the paper's first author, currently Chief Operating Officer at SOPHiA GENETICS. "This dramatically improves the ease and cost-effectiveness with which the software can be maintained, and regularly brings new features into the Butler ecosystem without the need for major development efforts."

Besides system stability and maintainability, using the cloud for genomics research is also challenging with respect to data privacy and the way it is regulated in different countries. Bigger projects will need to make simultaneous use of several cloud environments in different institutes and countries in order to meet the diverse data handling requirements of various jurisdictions. Butler addresses this challenge by being able to run on a wide variety of cloud computing platforms, including most major commercial and academic clouds. This allows researchers access to the widest variety of datasets while meeting stringent data protection requirements.

Butler in use

Butler's ability to facilitate such complex analyses was demonstrated in the context of the Pan-Cancer Analysis of the Whole Genome study. Butler processed a 725 terabyte cancer genome dataset in a time-efficient and uniform manner, on 1500 CPU cores, 5.5 terabytes of RAM, and approximately one petabyte of storage. The European Bioinformatics Institute (EMBL-EBI) played a crucial role by providing access and support to their Embassy Cloud, which was used for testing Butler. The system has recently been used in other projects as well, for example in the European Open Science Cloud pilot project (EOSC).

The Pan-Cancer project

The Pan-Cancer Analysis of Whole Genomes project is a collaboration involving more than 1300 scientists and clinicians from 37 countries. It involved an analysis of more than 2600 genomes of 38 different tumor types, creating a huge resource of primary cancer genomes. This was the starting point for 16 working groups to study multiple aspects of cancer development, causation, progression, and classification.

Real-time flu prediction may be possible using wearable heart rate, sleep tracking devices

The first study to evaluate de-identified data from wearable devices on resting heart rate and sleep finds improved real-time prediction of influenza-like illness in 5 US states compared to current surveillance methods

The research, published in The Lancet Digital Health journal, demonstrates the potential of data from wearable devices to improve surveillance of infectious disease. Resting heart rate tends to spike during infectious episodes and this is captured by wearable devices such as smartwatches and fitness trackers, that track heart rate. De-identified data from 47,249 Fitbit users, retrospectively identify weeks with elevated resting heart rate and changes to routine sleep. Further prospective studies will need to be done to help differentiate between infectious versus non-infectious forecasting.

Influenza results in 650,000 deaths worldwide annually. Approximately 7% of working adults and 20% of children aged under five years get flu each year. Traditional surveillance reporting takes 1-3 weeks to report, which limits the ability to enact quick outbreak response measures - such as ensuring patients stay at home, wash hands, and deploying antivirals and vaccines. {module INSIDE STORY}

Past studies using crowdsourced data, such as Google Flu Trends and Twitter have found variable success on their own as these methods tend to overestimate rates during epidemics. This is because it is impossible to separate out the activity of individuals with influenza from heightened awareness or related to media during flu season.

Study author Dr Jennifer Radin, Scripps Research Translational Institute, USA, says: "Responding more quickly to influenza outbreaks can prevent further spread and infection, and we were curious to see if sensor data could improve real-time surveillance at the state level. We demonstrate the potential for metrics from wearable devices to enhance flu surveillance and consequently improve public health responses. In the future as these devices improve, and with access to 24/7 real-time data, it may be possible to identify rates of influenza on a daily instead of a weekly basis." 

The researchers reviewed de-identified data from 200,000 users who wore a Fitbit wearable device that tracks users' activity, heart rate and sleep for at least 60 days during the study time from March 2016 to March 2018. From the 200,000, 47,248 users from California, Texas, New York, Illinois and Pennsylvania wore a Fitbit device consistently during the study period, resulting in a total of 13,342,651 daily measurements evaluated. The average user was 43 years old and 60% were female. All Fitbit users, including those whose data are included in this study, are notified that their de-identified data could potentially be used for research in the Fitbit Privacy Policy.

Users' average resting heart rate and sleep duration were calculated, as well as deviations to this to help identify when these measures were outside of an individual's typical range (ie, using standard deviation). During each week, a user was identified as abnormal if their weekly average resting heart rate was above their overall average (by more than a half or a full standard deviation) and their weekly average sleep was not below their overall average (by more than half a standard deviation). The users were arranged by which state they lived in, and the proportion of users above the threshold each week was calculated. This data was compared to weekly estimates for influenza-like illness rates reported by the U.S. Centers for Disease Control (CDC).

By incorporating data from Fitbit trackers, influenza predictions at the state level were improved. In all five states, there was an improvement in real-time surveillance, and the closest alignment with CDC data was found when abnormal resting heart rate was defined as half a standard deviation above normal and sleep more than half a standard deviation below.

This is the first time heart rate trackers and sleep data have been used to predict flu, or any infectious disease, in real-time. With greater volumes of data, it may be possible to apply the method to more geographically defined areas, such as county or city-level.

The authors identify several limitations in their study. A general lack of activity data meant they could not control for seasonal fitness changes or more short-term activity changes. Weekly resting heart rate averages may include days when an individual is both sick and not sick, and this may result in an underestimation of illness by lowering the weekly averages. Other factors may also increase the resting heart rate, including stress or other infections. Lastly, the authors note that previous studies of sleep measuring devices have been found to have low accuracy, though the authors note that accuracy will continue to improve as technology evolves.

In the study, the Fitbit users were predominantly middle-aged adults and likely higher income than the general population is so potentially less likely to suffer comorbidities which could make them more susceptible to severe infections. They also may be more likely to get influenza vaccines or receive antivirals or other medicines if they do get sick which can reduce disease severity, so the models may need to be modified for use in other populations.

In a linked Comment article, Dr. Cécile Viboud of the Fogarty International Center, National Institutes of Health, USA, says: "The study by Radin et al is a promising first step towards integrating wearable measurements in predictive models of infectious diseases. [...] we anticipate that a large amount of real-time data generated by Fitbit and other personal devices will prove highly useful for public health and augment traditional surveillance systems. The ever-expanding "big data" revolution offers unique opportunities to mine new data streams, identify epidemiologically-relevant patterns, and enrich infectious disease forecasts."

NASA, NOAA analyses reveal 2019 second warmest year on record

According to independent analyses by NASA and the National Oceanic and Atmospheric Administration (NOAA), Earth's global surface temperatures in 2019 were the second warmest since modern recordkeeping began in 1880.

Globally, 2019 temperatures were second only to those of 2016 and continued the planet's long-term warming trend: the past five years have been the warmest of the last 140 years.

This past year, they were 1.8 degrees Fahrenheit (0.98 degrees Celsius) warmer than 1951 to 1980 mean, according to scientists at NASA's Goddard Institute for Space Studies (GISS) in New York.

"The decade that just ended is clearly the warmest decade on record," said GISS Director Gavin Schmidt. "Every decade since the 1960s clearly has been warmer than the one before." CAPTION This plot shows yearly temperature anomalies from 1880 to 2019, with respect to the 1951-1980 mean, as recorded by NASA, NOAA, the Berkeley Earth research group, the Met Office Hadley Centre (UK), and the Cowtan and Way analysis. Though there are minor variations from year to year, all five temperature records show peaks and valleys in sync with each other. All show rapid warming in the past few decades, and all show the past decade has been the warmest.  CREDIT NASA GISS/Gavin Schmidt{module INSIDE STORY}

Since the 1880s, the average global surface temperature has risen and the average temperature is now more than 2 degrees Fahrenheit (a bit more than 1 degree Celsius) above that of the late 19th century. For reference, the last Ice Age was about 10 degrees Fahrenheit colder than pre-industrial temperatures.

Using climate models and statistical analysis of global temperature data, scientists have concluded that this increase mostly has been driven by increased emissions into the atmosphere of carbon dioxide and other greenhouse gases produced by human activities.

"We crossed over into more than 2 degrees Fahrenheit warming territory in 2015 and we are unlikely to go back. This shows that what's happening is persistent, not a fluke due to some weather phenomenon: we know that the long-term trends are being driven by the increasing levels of greenhouse gases in the atmosphere," Schmidt said.

Because weather station locations and measurement practices change over time, the interpretation of specific year-to-year global mean temperature differences has some uncertainties. Taking this into account, NASA estimates that 2019's global mean change is accurate to within 0.1 degrees Fahrenheit, with a 95% certainty level.

Weather dynamics often affect regional temperatures, so not every region on Earth experienced similar amounts of warming. NOAA found the 2019 annual mean temperature for the contiguous 48 United States was the 34th warmest on record, giving it a "warmer than average" classification. The Arctic region has warmed slightly more than three times faster than the rest of the world since 1970.

Rising temperatures in the atmosphere and ocean are contributing to the continued mass loss from Greenland and Antarctica and to increases in some extreme events, such as heatwaves, wildfires, intense precipitation.

NASA's temperature analyses incorporate surface temperature measurements from more than 20,000 weather stations, ship- and buoy-based observations of sea surface temperatures, and temperature measurements from Antarctic research stations.

These in situ measurements are analyzed using an algorithm that considers the varied spacing of temperature stations around the globe and urban heat island effects that could skew the conclusions. These calculations produce the global average temperature deviations from the baseline period of 1951 to 1980.

NOAA scientists used much of the same raw temperature data, but with a different interpolation into the Earth's polar and other data-poor regions. NOAA's analysis found 2019 global temperatures were 1.7 degrees Fahrenheit (0.95 degrees Celsius) above the 20th-century average.

NASA's full 2019 surface temperature data set and the complete methodology used for the temperature calculation and its uncertainties are available at:

https://data.giss.nasa.gov/gistemp

GISS is a laboratory within the Earth Sciences Division of NASA's Goddard Space Flight Center in Greenbelt, Maryland. The laboratory is affiliated with Columbia University's Earth Institute and School of Engineering and Applied Science in New York.

NASA uses the unique vantage point of space to better understand Earth as an interconnected system. The agency also uses airborne and ground-based measurements and develops new ways to observe and study Earth with long-term data records and computer analysis tools to better see how our planet is changing. NASA shares this knowledge with the global community and works with institutions in the United States and around the world that contribute to understanding and protecting our home planet.

The slides for the Jan. 15 news conference are available at:

https://www.ncdc.noaa.gov/sotc/briefings/20200115.pdf

NOAA's Global Report is available at:

https://www.ncdc.noaa.gov/sotc/global/201913