In the era of big data, there's plenty of software on the market that helps people to explore and visualize datasets in search of patterns and new discoveries. But how can users tell if the patterns they're seeing are real or if they simply appear in the data by random chance?
The answer is that they can't -- not unless they apply appropriate statistical tests to make sure their findings are valid, a feature that currently available commercial data exploration tools do not provide. But with a new $3.1 million grant from the Defense Advanced Research Projects Agency, Brown University computer scientists are aiming to develop a software package that brings new statistical rigor to interactive data exploration.
"The goal is to build a user-friendly system than can easily explore data and produce useful visualizations, but also continuously controls for the statistical validity of the results," said Eli Upfal, professor of computer science at Brown and the project's principal investigator.
The grant brings together a team of Brown professors, postdocs and students to tackle different aspects of the project. Tim Kraska, an assistant professor, and Carsten Binnig, adjunct associate professor, are machine learning and database experts who will work mainly on the data management side of the project. Computer graphics pioneer Andy van Dam will work on the user interface and visualizations. Upfal, an expert in computational theory, will work mainly on the statistical side of the project.
Statisticians and scientists routinely use a suite of tests to measure whether or not a result is statistically significant. But the statistical issues in the big data world go well beyond basic significance tests. Modern data exploration tools make it easy to poke and prod a dataset in myriad ways with a few mouse clicks. That can create an issue known to statisticians as the "multiple comparisons problem," and it's one of the things Upfal and his colleagues hope to address.
The problem is essentially this: The more questions you ask of a dataset, the more likely you are to stumble upon something that looks like a genuine correlation, but is actually just a random fluctuation in the data. Without proper statistical correction, it can lead to false discoveries.
"To some extent it's our fault here in computer science that we have made analysis of data so easy," Upfal said. "If I give you a huge database and let you simply push a button to ask question after question, you're eventually reach something that's there purely by chance."
There are statistical techniques for dealing with the problem, but none of them are easily implemented in a real-time data exploration setting. So Upfal and his colleagues will need to develop an appropriate technique on their own.
"We'll be building some new theory about how to evaluate a sequence of data queries," Upfal said. "And we'll need it to be computationally efficient. Our system is interactive, so we need everything to compute right away."
Better data science
The researchers envision a system that continually monitors the questions people ask in the process of exploring data and warns them when they're on shaky statistical ground. By doing so, the system will help users -- especially those without statistics training -- to avoid making false discoveries.
And as the use of data exploration expands into new domains, statistical safeguards like these become ever more important. Companies like Netflix and Google for years have combed huge datasets looking for correlations that help them suggest movies or target advertising. In those settings, false correlations lead to a few bad recommendations or mis-targeted ads.
"That's not such a big deal," Upfal said. "But when we're applying these techniques to medicine, for example, we need to be a bit more careful."
The project will be part of Brown's recently launched Data Science Initiative, which is broadly aimed at developing these kinds of novel approaches to dealing with data.
"Ultimately we want to promote data science and see it be successful," Upfal said. "We hope this project will be a step toward that."