Illustration by Ben Boothman

Health

AI revolution in medicine

long read

It may lift personalized treatment, fill gaps in access to care, cut red tape but risks abound

Third in a series that taps the expertise of the Harvard community to examine the promise and potential pitfalls of the coming age of artificial intelligence and machine learning.

The news is bad: “I’m sorry, but you have cancer.”

Those unwelcome words sink in for a few minutes, and then your doctor begins describing recent advances in artificial intelligence, advances that let her compare your case to the cases of every other patient who’s ever had the same kind of cancer. She says she’s found the most effective treatment, one best suited for the specific genetic subtype of the disease in someone with your genetic background — truly personalized medicine.

And the prognosis is good.

It has taken time — some say far too long — but medicine stands on the brink of an AI revolution. In a recent article in the New England Journal of Medicine, Isaac Kohane, head of Harvard Medical School’s Department of Biomedical Informatics, and his co-authors say that AI will indeed make it possible to bring all medical knowledge to bear in service of any case. Properly designed AI also has the potential to make our health care system more efficient and less expensive, ease the paperwork burden that has more and more doctors considering new careers, fill the gaping holes in access to quality care in the world’s poorest places, and, among many other things, serve as an unblinking watchdog on the lookout for the medical errors that kill an estimated 200,000 people and cost $1.9 billion annually.

“I’m convinced that the implementation of AI in medicine will be one of the things that change the way care is delivered going forward,” said David Bates, chief of internal medicine at Harvard-affiliated Brigham and Women’s Hospital, professor of medicine at Harvard Medical School and of health policy and management at the Harvard T.H. Chan School of Public Health. “It’s clear that clinicians don’t make as good decisions as they could. If they had support to make better decisions, they could do a better job.”

Years after AI permeated other aspects of society, powering everything from creepily sticky online ads to financial trading systems to kids’ social media apps to our increasingly autonomous cars, the proliferation of studies showing the technology’s algorithms matching the skill of human doctors at a number of tasks signals its imminent arrival.

“I think it’s an unstoppable train in a specific area of medicine — showing true expert-level performance — and that’s in image recognition,” said Kohane, who is also the Marion V. Nelson Professor of Biomedical Informatics. “Once again medicine is slow to the mark. I’m no longer irritated but bemused that my kids, in their social sphere, are using more advanced AI than I use in my practice.”

But even those who see AI’s potential value recognize its potential risks. Poorly designed systems can misdiagnose. Software trained on data sets that reflect cultural biases will incorporate those blind spots. AI designed to both heal and make a buck might increase — rather than cut — costs, and programs that learn as they go can produce a raft of unintended consequences once they start interacting with unpredictable humans.

“I think the potential of AI and the challenges of AI are equally big,” said Ashish Jha, former director of the Harvard Global Health Institute and now dean of Brown University’s School of Public Health. “There are some very large problems in health care and medicine, both in the U.S. and globally, where AI can be extremely helpful. But the costs of doing it wrong are every bit as important as its potential benefits. … The question is: Will we be better off?”

Many believe we will, but caution that implementation has to be done thoughtfully, with recognition of not just AI’s strengths but also its weaknesses, and taking advantage of a range of viewpoints brought by experts in fields outside of medicine and computer science, including ethics and philosophy, sociology, psychology, behavioral economics, and, one day, those trained in the budding field of machine behavior, which seeks to understand the complex and evolving interaction of humans and machines that learn as they go.

“You’re not expecting this AI doctor that’s going to cure all ills but rather AI that provides support so better decisions can be made.”

— Finale Doshi-Velez, John L. Loeb Associate Professor of Engineering and Applied Sciences at the Harvard John A. Paulson School of Engineering and Applied Sciences

Finale Doshi-Velez.

Rose Lincoln/Harvard file photo

“The challenge with machine behavior is that you’re not deploying an algorithm in a vacuum. You’re deploying it into an environment where people will respond to it, will adapt to it. If I design a scoring system to rank hospitals, hospitals will change,” said David Parkes, George F. Colony Professor of Computer Science, co-director of the Harvard Data Science Initiative, and one of the co-authors of a recent article in the journal Nature calling for the establishment of machine behavior as a new field. “Just as it would be challenging to understand how a new employee will do in a new work environment, it’s challenging to understand how machines will do in any kind of environment, because people will adapt to them, will change their behavior.”

Though excitement has been building about the latest wave of AI, the technology has been in medicine for decades in some form, Parkes said. As early as the 1970s, “expert systems” were developed that encoded knowledge in a variety of fields in order to make recommendations on appropriate actions in particular circumstances. Among them was Mycin, developed by Stanford University researchers to help doctors better diagnose and treat bacterial infections. Though Mycin was as good as human experts at this narrow chore, rule-based systems proved brittle, hard to maintain, and too costly, Parkes said.

The excitement over AI these days isn’t because the concept is new. It’s owing to rapid progress in a branch called machine learning, which takes advantage of recent advances in computer processing power and in big data that have made compiling and handling massive data sets routine. Machine learning algorithms — sets of instructions for how a program operates — have become sophisticated enough that they can learn as they go, improving performance without human intervention.

“The superpower of these AI systems is that they can look at all of these large amounts of data and hopefully surface the right information or the right predictions at the right time,” said Finale Doshi-Velez, John L. Loeb Associate Professor of Engineering and Applied Sciences at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS). “Clinicians regularly miss various bits of information that might be relevant in the patient’s history. So that’s an example of a relatively low-hanging fruit that could potentially be very useful.”

Before being used, however, the algorithm has to be trained using a known data set. In medical imaging, a field where experts say AI holds the most promise soonest, the process begins with a review of thousands of images — of potential lung cancer, for example — that have been viewed and coded by experts. Using that feedback, the algorithm analyzes an image, checks the answer, and moves on, developing its own expertise.

In recent years, increasing numbers of studies show machine-learning algorithms equal and, in some cases, surpass human experts in performance. In 2016, for example, researchers at Beth Israel Deaconess Medical Center reported that an AI-powered diagnostic program correctly identified cancer in pathology slides 92 percent of the time, just shy of trained pathologists’ 96 percent. Combining the two methods led to 99.5 percent accuracy.

More recently, in December 2018, researchers at Massachusetts General Hospital (MGH) and Harvard’s SEAS reported a system that was as accurate as trained radiologists at diagnosing intracranial hemorrhages, which lead to strokes. And in May 2019, researchers at Google and several academic medical centers reported an AI designed to detect lung cancer that was 94 percent accurate, beating six radiologists and recording both fewer false positives and false negatives.

“The challenge with machine behavior is that you’re not deploying an algorithm in a vacuum. You’re deploying it into an environment where people will respond to it, will adapt to it.”

— David Parkes, George F. Colony Professor of Computer Science and co-director of the Harvard Data Science Initiative

David Parkes.

Kris Snibbe/Harvard file photo

One recent area where AI’s promise has remained largely unrealized is the global response to COVID-19, according to Kohane and Bates. Bates, who delivered a talk in August at the Riyad Global Digital Health Summit titled “Use of AI in Weathering the COVID Storm,” said though there were successes, much of the response has relied on traditional epidemiological and medical tools.

One striking exception, he said, was the early detection of unusual pneumonia cases around a market in Wuhan, China, in late December by an AI system developed by Canada-based BlueDot. The detection, which would turn out to be SARS-CoV-2, came more than a week before the World Health Organization issued a public notice of the new virus.

“We did some things with artificial intelligence in this pandemic, but there is much more that we could do,” Bates told the online audience.

In comments in July at the online conference FutureMed, Kohane was more succinct: “It was a very, very unimpressive performance. … We in health care were shooting for the moon, but we actually had not gotten out of our own backyard.”

The two agree that the biggest impediment to greater use of AI in formulating COVID response has been a lack of reliable, real-time data. Data collection and sharing have been slowed by older infrastructure — some U.S. reports are still faxed to public health centers, Bates said — by lags in data collection, and by privacy concerns that short-circuit data sharing.

“COVID has shown us that we have a data-access problem at the national and international level that prevents us from addressing burning problems in national health emergencies,” Kohane said. 

A key success, Kohane said, may yet turn out to be the use of machine learning in vaccine development. We won’t likely know for some months which candidates proved most successful, but Kohane pointed out that the technology was used to screen large databases and select which viral proteins offered the greatest chance of success if blocked by a vaccine.

“It will play a much more important role going forward,” Bates said, expressing confidence that the current hurdles would be overcome. “It will be a key enabler of better management in the next pandemic.”

Corporations agree about that future promise and in recent years have been scrambling to join in. In February 2019, IBM Watson Health began a 10-year, $50 million partnership with Brigham and Women’s Hospital and Vanderbilt University Medical Center whose aim is to use AI on electronic health records and claims data to improve patient safety, precision medicine, and health equity. And in March 2019, Amazon awarded a $2 million AI research grant to Beth Israel in an effort to improve hospital efficiency, including patient care and clinical workflows.

A properly developed and deployed AI, experts say, will be akin to the cavalry riding in to help beleaguered physicians struggling with unrelenting workloads, high administrative burdens, and a tsunami of new clinical data.

Robert Truog, head of the HMS Center for Bioethics, the Frances Glessner Lee Professor of Legal Medicine, and a pediatric anesthesiologist at Boston Children’s Hospital, said the defining characteristic of his last decade in practice has been a rapid increase in information. While more data about patients and their conditions might be viewed as a good thing, it’s only good if it can be usefully managed.

“Psychologists say that humans can handle four independent variables and when we get to five, we’re lost. So AI is coming at the perfect time. It has the potential to rescue us from data overload.”

— Robert Truog, head of the Harvard Medical School Center for Bioethics and the the Frances Glessner Lee Professor of Legal Medicine

Robert Truog.

Rose Lincoln/Harvard file photo

“Over the last 10 years of my career the volume of data has absolutely gone exponential,” Truog said. “I would have one image on a patient per day: their morning X-ray. Now, if you get an MRI, it generates literally hundreds of images, using different kinds of filters, different techniques, all of which convey slightly different variations of information. It’s just impossible to even look at all of the images.

“Psychologists say that humans can handle four independent variables and when we get to five, we’re lost,” he said. “So AI is coming at the perfect time. It has the potential to rescue us from data overload.”

Given the technology’s facility with medical imaging analysis, Truog, Kohane, and others say AI’s most immediate impact will be in radiology and pathology, fields where those skills are paramount. And, though some see a future with fewer radiologists and pathologists, others disagree. The best way to think about the technology’s future in medicine, they say, is not as a replacement for physicians, but rather as a force-multiplier and a technological backstop that not only eases the burden on personnel at all levels, but makes them better.

“You’re not expecting this AI doctor that’s going to cure all ills but rather AI that provides support so better decisions can be made,” Doshi-Velez said. “Health is a very holistic space, and I don’t see AIs being anywhere near able to manage a patient’s health. It’s too complicated. There are too many factors, and there are too many factors that aren’t really recorded.”

In a September 2019 issue of the Annals of Surgery, Ozanan Meireles, director of MGH’s Surgical Artificial Intelligence and Innovation Laboratory, and general surgery resident Daniel Hashimoto offered a view of what such a backstop might look like. They described a system that they’re training to assist surgeons during stomach surgery by having it view thousands of videos of the procedure. Their goal is to produce a system that one day could virtually peer over a surgeon’s shoulder and offer advice in real time.

At the Harvard Chan School, meanwhile, a group of faculty members, including James Robins, Miguel Hernan, Sonia Hernandez-Diaz, and Andrew Beam, are harnessing machine learning to identify new interventions that can improve health outcomes.

Their work, in the field of “causal inference,” seeks to identify different sources of the statistical associations that are routinely found in the observational studies common in public health. Those studies are good at identifying factors that are linked to each other but less able to identify cause and effect. Hernandez-Diaz, a professor of epidemiology and co-director of the Chan School’s pharmacoepidemiology program, said causal inference can help interpret associations and recommend interventions.

For example, elevated enzyme levels in the blood can predict a heart attack, but lowering them will neither prevent nor treat the attack. A better understanding of causal relationships — and devising algorithms to sift through reams of data to find them — will let researchers obtain valid evidence that could lead to new treatments for a host of conditions.

“We will make mistakes, but the momentum won’t go back the other way,” Hernandez-Diaz said of AI’s increasing presence in medicine. “We will learn from them.”

Finding new interventions is one thing; designing them so health professionals can use them is another. Doshi-Velez’s work centers on “interpretable AI” and optimizing how doctors and patients can put it to work to improve health.

AI’s strong suit is what Doshi-Velez describes as “large, shallow data” while doctors’ expertise is the deep sense they may have of the actual patient. Together, the two make a potentially powerful combination, but one whose promise will go unrealized if the physician ignores AI’s input because it is rendered in hard-to-use or unintelligible form.

“I’m very excited about this team aspect and really thinking about the things that AI and machine-learning tools can provide an ultimate decision-maker — we’ve focused on doctors so far, but it could also be the patient — to empower them to make better decisions,” Doshi-Velez said.

“Getting diversity in the training of these algorithms is going to be incredibly important, otherwise we will be in some sense pouring concrete over whatever current distortions exist.”

— Isaac Kohane, head of Harvard Medical School’s Department of Biomedical Informatics

Isaac Kohane.

Stephanie Mitchell/Harvard file photo

While many point to AI’s potential to make the health care system work better, some say its potential to fill gaps in medical resources is also considerable. In regions far from major urban medical centers, local physicians could be able to get assistance diagnosing and treating unfamiliar conditions and have available an AI-driven consultant that allows them to offer patients a specialists’ insight as they decide whether a particular procedure — or additional expertise — is needed.

Outside the developed world that capability has the potential to be transformative, according to Jha. AI-powered applications have the potential to vastly improve care in places where doctors are absent, and informal medical systems have risen to fill the need. Recent studies in India and China serve as powerful examples. In India’s Bihar state, for example, 86 percent of cases resulted in unneeded or harmful medicine being prescribed. Even in urban Delhi, 54 percent of cases resulted in unneeded or harmful medicine.

“If you are sick, is it better to go to the doctor or not? In 2019, in large parts of the world, it’s a wash. It’s unclear. And that is scary,” Jha said. “So it’s a low bar. People ask, ‘Will AI be helpful?’ I say we’d really have to screw up AI for it not to be helpful. Net-net, the opportunity for improvement over the status quo is massive.”

Though the promise is great, the road ahead isn’t necessarily smooth. Even AI’s most ardent supporters acknowledge that the likely bumps and potholes, both seen and unseen, should be taken seriously.

One challenge is ensuring that high-quality data is used to train AI. If it is biased or otherwise flawed, that will be reflected in the performance. A second challenge is ensuring that the prejudices rife in society aren’t reflected in the algorithms, added by programmers unaware of those they may unconsciously hold.

That potential was a central point in a 2016 Wisconsin legal case, when an AI-driven, risk-assessment system for criminal recidivism was used in sentencing a man to six years in prison. The judge remarked that the “risk-assessment tools that have been utilized suggest that you’re extremely high risk to reoffend.”

The defendant challenged the sentence, arguing that the AI’s proprietary software — which he couldn’t examine — may have violated his right to be sentenced based on accurate information. The sentence was upheld by the state supreme court, but that case, and the spread of similar systems to assess pretrial risk, has generated national debate over the potential for injustices due to our increasing reliance on systems that have power over freedom or, in the health care arena, life and death, and that may be unfairly tilted or outright wrong.

“We have to recognize that getting diversity in the training of these algorithms is going to be incredibly important, otherwise we will be in some sense pouring concrete over whatever current distortions exist in practice, such as those due to socioeconomic status, ethnicity, and so on,” Kohane said.

Also highlighted by the case is the “black box” problem. Since the algorithms are designed to learn and improve their performance over time, sometimes even their designers can’t be sure how they arrive at a recommendation or diagnosis, a feature that leaves some uncomfortable.

“If you see a frontline community health worker in India disagree with a tool developed by a big company in Silicon Valley, Silicon Valley is going to win. And that’s potentially a dangerous thing.”

— Ashish Jha, former director of the Harvard Global Health Institute and now dean of Brown University’s School of Public Health

Ashish Jha.

Jon Chase/Harvard Staff Photographer

“If you start applying it, and it’s wrong, and we have no ability to see that it’s wrong and to fix it, you can cause more harm than good,” Jha said. “The more confident we get in technology, the more important it is to understand when humans can override these things. I think the Boeing 737 Max example is a classic example. The system said the plane is going up, and the pilots saw it was going down but couldn’t override it.”

Jha said a similar scenario could play out in the developing world should, for example, a community health worker see something that makes him or her disagree with a recommendation made by a big-name company’s AI-driven app. In such a situation, being able to understand how the app’s decision was made and how to override it is essential.

“If you see a frontline community health worker in India disagree with a tool developed by a big company in Silicon Valley, Silicon Valley is going to win,” Jha said. “And that’s potentially a dangerous thing.”

Researchers at SEAS and MGH’s Radiology Laboratory of Medical Imaging and Computation are at work on the two problems. The AI-based diagnostic system to detect intracranial hemorrhages unveiled in December 2019 was designed to be trained on hundreds, rather than thousands, of CT scans. The more manageable number makes it easier to ensure the data is of high quality, according to Hyunkwang Lee, a SEAS doctoral student who worked on the project with colleagues including Sehyo Yune, a former postdoctoral research fellow at MGH Radiology and co-first author of a paper on the work, and Synho Do, senior author, HMS assistant professor of radiology, and director of the lab.

“We ensured the data set is of high quality, enabling the AI system to achieve a performance similar to that of radiologists,” Lee said.

Second, Lee and colleagues figured out a way to provide a window into an AI’s decision-making, cracking open the black box. The system was designed to show a set of reference images most similar to the CT scan it analyzed, allowing a human doctor to review and check the reasoning.

Jonathan Zittrain, Harvard’s George Bemis Professor of Law and director of the Berkman Klein Center for Internet and Society, said that, done wrong, AI in health care could be analogous to the cancer-causing asbestos that was used for decades in buildings across the U.S., with widespread harmful effects not immediately apparent. Zittrain pointed out that image analysis software, while potentially useful in medicine, is also easily fooled. By changing a few pixels of an image of a cat — still clearly a cat to human eyes — MIT students prompted Google image software to identify it, with 100 percent certainty, as guacamole. Further, a well-known study by researchers at MIT and Stanford showed that three commercial facial-recognition programs had both gender and skin-type biases.

Ezekiel Emanuel, a professor of medical ethics and health policy at the University of Pennsylvania’s Perelman School of Medicine and author of a recent Viewpoint article in the Journal of the American Medical Association, argued that those anticipating an AI-driven health care transformation are likely to be disappointed. Though he acknowledged that AI will likely be a useful tool, he said it won’t address the biggest problem: human behavior. Though they know better, people fail to exercise and eat right, and continue to smoke and drink too much. Behavior issues also apply to those working within the health care system, where mistakes are routine.

“We need fundamental behavior change on the part of these people. That’s why everyone is frustrated: Behavior change is hard,” Emanuel said.

Susan Murphy, professor of statistics and of computer science, agrees and is trying to do something about it. She’s focusing her efforts on AI-driven mobile apps with the aim of reinforcing healthy behaviors for people who are recovering from addiction or dealing with weight issues, diabetes, smoking, or high blood pressure, conditions for which the personal challenge persists day by day, hour by hour.

The sensors included in ordinary smartphones, augmented by data from personal fitness devices such as the ubiquitous Fitbit, have the potential to give a well-designed algorithm ample information to take on the role of a health care angel on your shoulder.

The tricky part, Murphy said, is to truly personalize the reminders. A big part of that, she said, is understanding how and when to nudge — not during a meeting, for example, or when you’re driving a car, or even when you’re already exercising, so as to best support adopting healthy behaviors.

“How can we provide support for you in a way that doesn’t bother you so much that you’re not open to help in the future?” Murphy said. “What our algorithms do is they watch how responsive you are to a suggestion. If there’s a reduction in responsivity, they back off and come back later.”

The apps can use sensors on your smartphone to figure out what’s going on around you. An app may know you’re in a meeting from your calendar, or talking more informally from ambient noise its microphone detects. It can tell from the phone’s GPS how far you are from a gym or an AA meeting or whether you are driving and so should be left alone.

Trickier still, Murphy said, is how to handle moments when the AI knows more about you than you do. Heart rate sensors and a phone’s microphone might tell an AI that you’re stressed out when your goal is to live more calmly. You, however, are focused on an argument you’re having, not its physiological effects and your long-term goals. Does the app send a nudge, given that it’s equally possible that you would take a calming breath or angrily toss your phone across the room?

Working out such details is difficult, albeit key, Murphy said, in order to design algorithms that are truly helpful, that know you well, but are only as intrusive as is welcome, and that, in the end, help you achieve your goals.

For AI to achieve its promise in health care, algorithms and their designers have to understand the potential pitfalls. To avoid them, Kohane said it’s critical that AIs are tested under real-world circumstances before wide release.

Similarly, Jha said it’s important that such systems aren’t just released and forgotten. They should be reevaluated periodically to ensure they’re functioning as expected, which would allow for faulty AIs to be fixed or halted altogether.

Several experts said that drawing from other disciplines — in particular ethics and philosophy — may also help.

Programs like Embedded EthiCS at SEAS and the Harvard Philosophy Department, which provides ethics training to the University’s computer science students, seek to provide those who will write tomorrow’s algorithms with an ethical and philosophical foundation that will help them recognize bias — in society and themselves — and teach them how to avoid it in their work.

Disciplines dealing with human behavior — sociology, psychology, behavioral economics — not to mention experts on policy, government regulation, and computer security, may also offer important insights.

“The place we’re likely to fall down is the way in which recommendations are delivered,” Bates said. “If they’re not delivered in a robust way, providers will ignore them. It’s very important to work with human factor specialists and systems engineers about the way that suggestions are made to patients.”

Bringing these fields together to better understand how AIs work once they’re “in the wild” is the mission of what Parkes sees as a new discipline of machine behavior. Computer scientists and health care experts should seek lessons from sociologists, psychologists, and cognitive behaviorists in answering questions about whether an AI-driven system is working as planned, he said.

“How useful was it that the AI system proposed that this medical expert should talk to this other medical expert?” Parkes said. “Was that intervention followed? Was it a productive conversation? Would they have talked anyway? Is there any way to tell?”

Next: A Harvard project asks people to envision how technology will change their lives going forward.