Plugging medical symptoms into Google is so common that clinicians have nicknamed the search engine “Doctor Google.” But a newcomer is quickly taking its place: “Doctor Chatbot.” People with medical questions are drawn to generative artificial intelligence because chatbots can answer conversationally worded questions with simplified summaries of complex technical information. Users who direct medical questions to, say, OpenAI’s ChatGPT or Google’s Gemini may also trust the AI tool’s chatty responses more than a list of search results.
But that trust might not always be wise. Concerns remain as to whether these models can consistently provide safe and accurate answers. New study findings, set to be presented at the Association for Computing Machinery’s Web Conference in Singapore this May, underscore that point: OpenAI’s general-purpose GPT-3.5 and another AI program called MedAlpaca, which is trained on medical texts, are both more likely to produce incorrect responses to health care queries in Mandarin Chinese, Hindi and Spanish compared with English.
In a world where less than 20 percent of the population speaks English, these new findings show the need for closer human oversight of AI-generated responses in multiple languages—especially in the medical realm, where misunderstanding a single word can be deadly. About 14 percent of Earth’s people speak Mandarin, and Spanish and Hindi are used by about 8 percent each, making these the three most commonly spoken languages after English.
“Most patients in the world do not speak English, and so developing models which can serve them should be an important priority,” says ophthalmologist Arun Thirunavukarasu, a digital health specialist at John Radcliffe Hospital and the University of Oxford, who was not involved in the study. More work is needed before these models’ performance in non-English languages matches what they promise the English-speaking world, he adds.
In the new preprint study, researchers at the Georgia Institute of Technology asked the two chatbots more than 2,000 questions similar to those typically asked by the public about diseases, medical procedures, medications, and other general health topics.* The queries in the experiment, chosen from three English-language medical datasets, were then translated into Mandarin Chinese, Hindi and Spanish.
For each language, the team checked whether the chatbots answered questions correctly, comprehensively and appropriately—qualities that would be expected of a human expert’s answer. The study authors used an AI tool (GPT-3.5) to compare generated responses against the answers provided in the three medical datasets. Finally, human assessors double-checked a portion of those evaluations to confirm the AI judge was accurate. Thirunavukarasu, though, says he wonders about the extent to which artificial intelligence and human evaluators agree; people can, after all, disagree over critiques of comprehension and other subjective traits. Additional human study of the generated answers would help clarify conclusions about chatbots’ medical usefulness, he adds.
The authors found that according to GPT-3.5’s own evaluation, GPT-3.5 produced more unacceptable replies in Chinese (23 percent of answers) and Spanish (20 percent), compared with English (10 percent). Its performance was poorest in Hindi, generating answers that were contradictory, not comprehensive or inappropriate about 45 percent of the time. Answer quality was much worse for MedAlpaca: more than 67 percent of the answers it generated to questions in Chinese, Hindi and Spanish were deemed irrelevant or contradictory. Because people might use chatbots to verify information about medications and medical procedures, the team also tested the AI’s capability to distinguish between correct and erroneous statements; the chatbots performed better when the claims were in English or Spanish, compared with Chinese or Hindi.