As big data becomes a common analytical tool in fields from the sciences to the humanities, Harvard’s computer infrastructure experts are turning their attention to an increasingly pressing question: How do you manage it all?
In recent years, Harvard invested in the Odyssey computing cluster, whose 60,000 CPUs provide the sheer computing horsepower needed to crunch big data.
But as large data sets multiply, the question of where to put the information and how to seamlessly retrieve it for analysis has become increasingly important. In August, the National Science Foundation announced a grant of nearly $4 million over the next five years to develop the North East Storage Exchange (NESE), a collaboration among five area universities, including Harvard, to provide not just space for massive data sets, but also the high-speed infrastructure that allows it to be quickly retrieved for analysis.
“People are downloading now 50 to 80 terabyte data sets from NCBI [the National Center for Biotechnology Information] and the National Library of Medicine over an evening. This is the new normal. People [are] pulling genomic data sets wider and deeper than they’ve ever been,” said James Cuff, Harvard’s assistant dean and distinguished engineer for research computing. “What used to be — in lab, in vivo, or in vitro practice — ‘cutting edge’ … are now standard old processes. PCR [polymerase chain reaction] was cutting edge at one point. Now it’s just a thing you do.”
The institutions involved include Harvard, Massachusetts Institute of Technology, Northeastern University, Boston University, and the University of Massachusetts. They are taking on the project as an expansion of their existing high-performance computing collaboration. In 2012, the five institutions opened the Massachusetts Green High Performance Computing Center (MGHPCC). Located in Holyoke on a rehabilitated industrial site, MGHPCC provides state-of-the-art computing services and is home to part of Harvard’s Odyssey computer. The site was also designed to be energy-efficient and is largely run on hydropower and solar energy.
MGHPCC President Richard McCullough, Harvard’s vice provost for research and professor of materials science and engineering, said the capacity the project will provide is badly needed, but the project is seen as more than a one-off effort. Lessons learned will help inform similar efforts elsewhere.
“You just need more and more of these kinds of resources to be at the forefront of data science,” McCullough said. “This grant will keep us at the forefront, and may allow us to take a quantum leap forward. This is a really important win for us.”
Cuff expects data retrieval from the North East Storage Exchange to be about 10 times faster than that from equivalent storage through private cloud-based servers, and McCullough said it will be cheaper too, just a fifth that of commercial vendors.
Cuff, NESE’s principal investigator, said that officials hope to have more than 50 petabytes of storage capacity available at MGHPCC within the next five years, with the ability to expand it further. John Goodhue, MGHPCC’s executive director and a co-principal investigator of NESE, said he expects the speed of the connection to collaborating institutions to double or triple over the next few years.
“What we’re building is an extendable architecture,” Cuff said.
Though Cuff said NESE could be thought of as collaborating institutions’ private cloud, he doesn’t expect NESE to compete with commercial cloud storage providers. Rather, he said, researchers have a range of data storage options, which should be matched to their purpose. NESE, for example, could potentially back up its data to the cloud.
“This isn’t a competitor to the cloud. It’s a complementary cloud storage system,” Cuff said.
Cuff compared the NESE collaboration to the early days of the internet, when the communications needs of groups of institutions prompted them to create computer networks that grew increasingly interconnected. Now, the problem facing institutions around the country is how to manage the tidal wave of data being generated by researchers and the larger wave likely to break over them in the years to come.
The collaboration depends on contributions from each institution, Cuff said, adding that the five-year effort is also an experiment in managing their needs in order to build the research computing infrastructure of the future.
Despite all the effort, Goodhue and Cuff said, ultimately the goal is to make it invisible to the users.
“There’s cost savings at every level, savings in the amount of time a researcher has to spend worrying about whether the data is OK and backed up properly,” Goodhue said. “Having something so easy to work with that you don’t even have to think about it is a goal too.”