For scientists who study rare diseases, hospitals’ vast data banks hold tantalizing potential. Access to anonymized electronic medical records allows researchers to track the progress of a larger group of patients than would be possible in a traditional cohort study, at a much lower cost. If the records can be linked to blood samples that patients have consented to be used for DNA testing, symptoms can be compared to genetic markers for a richer analysis. But making sense out of such a complex sea of billing codes and physicians’ notes requires considerable biostatistical skill.

Tianxi Cai, SD ’99, professor of biostatistics at Harvard School of Public Health, is working to develop a framework to help researchers use large datasets to better understand the genetic basis of complex diseases. She spoke about the promise and challenge of working with electronic medical records for the annual Myrto Lefkopoulou Distinguished Lectureship, held September 18, 2014. Even determining what key words in a patient record indicate that he or she has a particular disease can be quite difficult, Cai said. “There are 10 different ways to say coronary heart disease.”

Cai and her colleagues looked for rheumatoid arthritis patients among thousands of records from Boston’s Brigham and Women’s Hospital — a painstaking process which took two years, she said. The researchers were able to improve their accuracy through the development of an algorithm that enabled them to scan for symptom keywords in physicians’ notes, in addition to using coded data such as lab reports and prescriptions.

Read Full Story