The ease with which ChatGPT can produce coherent content and convincing answers has raised fears that it will enable cheating on University campuses and replace workers in fields ranging from journalism to medicine.
A group of pediatric specialists, however, aren’t sweating just yet after their first pass at testing ChatGPT on the knowledge required to do their jobs.
Research conducted earlier this year pitted the 3.5 version of ChatGPT — a type of artificial intelligence called a “large language model” — against the neonatal-perinatal board exam required for practicing pediatricians specializing in the period just before and after birth. The AI got 46 percent correct.
The study, published in July in the journal JAMA Pediatrics, tested the large language model against a board practice test. It did best in questions of basic recall and clinical reasoning, and worse on more complex multi-logic questions. It performed poorest, 37.5 percent, in gastroenterology, and best, 78.5 percent, in ethics.
The study’s senior author, Andrew Beam, assistant professor of biomedical informatics at Harvard Medical School and of epidemiology at the Harvard T.H. Chan School of Public Health, said he knew that ChatGPT had successfully passed some general professional examinations, including the U.S. Medical Licensing Exam, required for medical students to become doctors. But he wondered how it would fare against more specialized board exams, taken by physicians who’ve devoted additional years of study and clinical work to master more narrowly focused specialties. Luckily, he didn’t have far to look.
Beam’s wife, Kristyn, an instructor in pediatrics at Harvard Medical School and a neonatologist at Beth Israel Deaconess Medical Center, agreed to participate by grading the AI’s answers along with HMS colleague Dara Brodsky, author of an influential neonatal textbook, and her co-author Camilia Martin, chief of newborn medicine at Weill Cornell Medicine and New York Presbyterian-Komansky Children’s Hospital.
The speed of development of these latest large language models have impressed Andrew Beam, who advocated pitting AI against the U.S. Medical Licensing Exam back in 2017 at a technology conference but found his own models couldn’t do better than 40 percent. Then things started moving quickly.
“There was this moment last year when, all of a sudden, five or six different models were all getting scores of 80 percent of higher,” he said. “The pace in this field is just crazy. The original ChatGPT isn’t even a year old — even I tend to forget that. But we’re very, very early in this and people are still trying to figure things out.”