“It’s a challenge for a lot of people working in these fields to welcome data science, but it also comes with a lot of promise,” said Rachel Schutt, senior vice president of data science at the media conglomerate News Corp.

Kris Snibbe/Harvard Staff Photographer

Science & Tech

The promise of ‘big data’

8 min read

Symposium embraces goals, challenges of collecting, processing massive amounts of information on complex issues

What use is a library without a librarian, or an encyclopedia without an index? Scale that prospect up to the realms of Web analytics, astronomy, high-speed finance, or even basketball statistics, and the problem becomes clear.

As research scientist Fernando Pérez put it, “Regardless of the amount of data we have … we still only have two eyeballs and one brain.”

Pérez, of the University of California, Berkeley, spoke at a symposium last Friday titled “Weathering the Data Storm: The Promise and Challenges of Data Science,” hosted by the Institute for Applied Computational Science (IACS) at the Harvard School of Engineering and Applied Sciences (SEAS). The annual symposium marks the culmination of two weeks of events at IACS called ComputeFest.

Leaders from academia and a range of industries spoke about the power of computational science and engineering to solve real-world problems.

For example, the Manhattan power grid contains 21,000 miles of underground electrical cable, some of which is 130 years old. Given the human and financial costs of major outages, proactive maintenance becomes more important as the system ages. Statistician Cynthia Rudin of Massachusetts Institute of Technology described how she collected and analyzed data on these cables, manholes, inspections, and “trouble tickets” to generate a robust model that is currently the best predictor of power failures in New York City.

Another symposium presenter described a more desperate problem.

Humanitarians at UNICEF periodically send text messages to 245,000 Ugandans to solicit information about the state of their nation. When the texters asked, “Have you heard of any children being sacrificed in your community?” they received chilling responses: some “yes,” some “no,” and a flood of cries for help.

UNICEF’s Ureport system of weekly surveys gathers essential data on vulnerable populations in order to guide its outreach and direct limited resources to the people who need them most. The incoming text messages sometimes report famines, floods, Ebola outbreaks, evictions, and dried-up water sources — often begging for assistance.

“There just aren’t enough humans to read all of these messages and try to determine: Is this something that requires immediate action?” said Bonnie Ray, director for cognitive algorithms at IBM’s T.J. Watson Research Lab. Ray’s team worked with UNICEF to improve the process of sorting and prioritizing the messages. The new system parses spelling errors, uses common word associations to understand synonyms, and incorporates conditional probability techniques to make intelligent assessments that work to put the most urgent messages quickly in front of the people who need to see them.

The information filtered by this system does not constitute “big data” on the scale of Facebook or Google, Ray noted, but “it’s too much for a human to do, and it is having a real impact on the lives of Ugandans.”

As computing power allows nonprofits, businesses, and researchers to gather ever-larger troves of information, new challenges arise, such as those involving privacy and security. (As Google research scientist Diane Lambert noted, “If you’ve ever put a query into Google, then you’ve been in an experiment.”)

Meanwhile, the demand for reliable software that can make sense of ever-larger and more complex data sets continues to grow, as does the need for well-educated analysts who can deftly weave together computer science, statistics, and other disciplines. This new breed of data scientist not only can guide important decisions, but also can provide new tools for scientific inquiry or recognize hidden patterns in human behavior, demographics, and epidemiology.

“The underlying methods can be familiar techniques such as logistic regression,” or Bayesian statistics, “techniques that have been part of the standard statistics and machine learning curriculum for a long time,” said Rachel Schutt, senior vice president of data science at the media conglomerate News Corp. Yet the vast scale of the data, the need for real-time analysis and implementation, and the way business decisions rapidly feed back into the data stream are all recent developments and require new types of experts, disrupting the traditional notion of a “quantitative analyst” or “statistician.”

“It’s a challenge for a lot of people working in these fields to welcome data science,” Schutt added, “but it also comes with a lot of promise.”

“It’s exciting to be present at the birth of a new discipline, not quite yet defined,” said SEAS Dean Cherry A. Murray, who established IACS in 2010 in response to several catalysts. “We are experiencing the convergence of ubiquitous computing power and cloud services at the same time that the connectivity of the Internet and the microelectronics revolution are enabling us to collect, store, interact with, and learn from massive streams of raw data.”

At Harvard, rigorous scholarship in machine learning, advanced computational techniques, algorithms, and visualization are converging with studies in statistics, social science, and the humanities. “With knowledge from across these areas, graduates have the opportunity to inform decision-making in science, business, or government settings, greatly enhancing our understanding of nature and of society,” said Murray.

Speakers at the symposium presented some tools that live-Tweeters in the crowd called “mind-blowing.”

Pérez wowed the audience with IPython, a comprehensive tool for streamlining the entire analysis process, from data exploration to publication. Jeffrey Heer, associate professor of computer science at the University of Washington, provided a tour of Data Wrangler, a clever cleanup tool for messy data sets. And Ryan Adams, assistant professor of computer science at Harvard SEAS, extolled the virtues of Bayesian optimization.

Adams raised a concern that seemed to resonate with the audience: As computational tools become more sophisticated, the field of data science risks alienating non-experts. Investigative journalists, for instance, have much to gain from accessible research tools.

Likewise, several speakers noted, it is important for practitioners of computational science and engineering to be able to accurately and engagingly communicate the results of an investigation to those outside their fields.

“There’s an element we can learn from journalists — hearing how they tell stories and investigate and ask questions, and how they find what’s actually interesting to other people,” explained Schutt. “It’s important in communicating about data [to know] exactly what’s objective and what’s subjective … and [to make] sure you’re transparent about the data-collection process and your modeling process.”

“It does require some education,” agreed Heer, “and doing that hand-in-hand with basic quantitative skills as well is incredibly important.”

At SEAS, graduate students can pursue a one-year master of science or two-year master of engineering in computational science and engineering (CSE). Doctoral candidates in the Graduate School of Arts and Sciences (GSAS) can also take a secondary field in CSE. Undergraduates can take courses such as “Data Science,” “Visualization,” “Data Structures and Algorithms,” “Introduction to Scientific Computing,” or “Statistics and Inference in Biology” as part of their liberal arts coursework. And graduates with deep and broad skills — beyond just numbers crunching — are in high demand.

In a changing economy, universities have a responsibility to foster these types of abilities in all students, said Murray. But there is another reason academia, not just business, must influence the evolution of data science.

“It is important to think deeply about and measure how ubiquitous computing and data are affecting society and our everyday lives, and how players in society interact to create social norms, disrupt old systems of social interaction and business models, and affect and interact with legal systems,” she explained. “This is why ‘data science’ cannot become its own narrow discipline, but will need to be intrinsically transdisciplinary, and why it is important for Harvard, in particular, to be focusing on the field.”

The session drew close to 500 attendees from Harvard, other Boston-area universities, and industry partners, as well as sponsorship from Liberty Mutual Insurance Co. and VMware.

“The annual IACS symposium has become a cornerstone event for SEAS and Harvard,” said Hanspeter Pfister, An Wang Professor of Computer Science and director of IACS. “The impressive audience turnout and their active participation in the engaging panel discussions are compelling indications that there is a real interest in data science at Harvard and beyond.”

For two weeks, ComputeFest 2014 featured workshops on computing tools and techniques, talks by entrepreneurs in computational science, and a competition in which teams of students designed intelligent software that could win a game against another computer. Sophomores Rebecca Chen (computer science) and Colin Lu (mathematics) won this computational challenge.

Pfister also announced the recipients of a new student fellowship. Enabled by an anonymous gift, the first fellowships will support Dylan Nelson, a Ph.D. candidate in astronomy, and Ekin Dogus Çubuk, a Ph.D. candidate in applied physics, who are both pursuing secondary fields in computational science and engineering through SEAS.