Public health embraces messy world of big data
The promises of big data in health care are seemingly endless, but so are the challenges — poor data quality, byzantine medical codes, and the complexity of human genetics, to name just a few.
A recent event, hosted by the Program in Quantitative Genomics at Harvard T.H. Chan School of Public Health, tackled some of these challenges. The conference brought together experts from around the world who talked candidly about the obstacles they encounter when working with huge amounts of data from multiple sources.
“Biomedical big data is transforming the study of human biology and disease. For epidemiologists, it’s a big playground. But it’s also a bit of a nightmare,” said Gil McVean, director of the Big Data Institute at the University of Oxford.
With the rise of genetic sequencing technologies, electronic medical records, and digital communications, researchers are able to collect more information on patients than ever before. To store and organize the data, they have started building what are known as biobanks — giant, freezing cold repositories that can safely house biological samples for long-term analysis and genetic sequencing.
The amounts of raw data that biobanks can generate are staggering. Consider the UK Biobank: the program recruited 500,000 adults who provide blood, urine, and other samples for storage. Each participant’s electronic medical records are linked to the biobank, allowing researchers to track their encounters with the health system. In addition, subsets of participants respond regularly to questionnaires on lifestyle habits, occupational history, mental health, cognitive function, and other health-related issues. Some wear ECG monitors and accelerometers to capture data on heart health and physical activity. On top of that, more than 30,000 participants agreed to have detailed MRI scans taken of vital organs to provide a visual record of changes over time, which could lead to new insights for diagnosing and treating various diseases.
“This endeavor requires industrial scale processes,” said Catherine Sudlow, chief scientist of the UK Biobank and one of the event’s keynote speakers. She noted that the UK Biobank is open access, meaning that researchers from around the world can use the data for free. “It’s messy, real-world data, and that’s why we’re interested in working with all the people in this room,” she told attendees.
The two-day conference featured more than a dozen speakers, and highlighted the work of junior researchers with a Stellar Abstracts Award ceremony. Several presenters noted the creative ways they’re making use of the UK Biobank, including Tianxi Cai, the John Rock Professor of Population and Translational Data Sciences at Harvard Chan School. Cai works with large data sets from the U.S. Department of Veteran Affairs and said that she can use the UK Biobank to help validate her research on genetic markers for a wide range of diseases, including cardiovascular conditions, aneurysm, and skin conditions.
Cai, who joked during her presentation that she spent the first 10 years of her career cleaning data, knows how far the field has come and how much further there is to go. “We want to do better,” she said.