Are AI doctors good at holding medical conversations with patients?

Explore how AI tools like CRAFT-MD evaluate large language models, improving diagnostic accuracy in real-world medical interactions.

Discover how the CRAFT-MD framework evaluates AI models to enhance diagnostic accuracy in realistic medical conversations. Learn about its impact on healthcare.

Discover how the CRAFT-MD framework evaluates AI models to enhance diagnostic accuracy in realistic medical conversations. Learn about its impact on healthcare. (CREDIT: CC BY-SA 4.0)

Collecting a patient’s medical history has long been the cornerstone of diagnosis, guiding physicians in their clinical decisions. Yet, the growing burden of patient numbers, limited access to care, and shortened consultation times have strained this process.

The COVID-19 pandemic accelerated telemedicine adoption, further complicating traditional patient-doctor interactions. These challenges underscore the need for innovative solutions to preserve the quality of history-taking and diagnostic accuracy.

Recent advancements in generative artificial intelligence (AI), particularly large language models (LLMs), offer a promising avenue to address these challenges. LLMs excel in processing complex conversations, making them strong candidates for assisting with patient history collection and initial diagnostic support. However, their readiness for real-world clinical use remains a subject of debate.

Evaluations of LLMs in medical contexts often focus on standardized test-like formats, such as multiple-choice questions. While these tests assess knowledge, they fail to evaluate the dynamic, conversational nature of real-world medical interactions. This gap highlights the necessity of testing frameworks tailored to simulate realistic patient encounters.

To address this shortfall, researchers introduced the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD). Unlike traditional evaluations, CRAFT-MD tests LLMs through simulated doctor-patient conversations.

CRAFT-MD: a framework for evaluating the conversational abilities of clinical LLMs in medical contexts. (CREDIT: Nature Medicine)

This innovative framework employs a multi-agent system: an AI patient simulates natural patient responses, an AI grader evaluates the LLM’s diagnostic accuracy, and medical experts validate the outcomes. By integrating these components, CRAFT-MD provides a scalable, ethical, and realistic evaluation method.

In recent studies published in the journal Nature Medicine, CRAFT-MD was used to assess the diagnostic capabilities of prominent LLMs, including GPT-4, GPT-3.5, and others across 12 medical specialties. The findings revealed that while LLMs performed well on structured test questions, their accuracy diminished significantly during conversational assessments.

For example, these models struggled to ask pertinent follow-up questions, synthesize scattered patient information, and adapt to the nuanced dynamics of medical interviews. The challenges were even more pronounced in multimodal models, such as GPT-4V, which integrate textual and visual data.

Dr. Pranav Rajpurkar, a senior researcher on the project, emphasized the paradox: “While these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit. The dynamic nature of medical conversations poses unique challenges that go far beyond answering multiple-choice questions.” This limitation highlights the need for more sophisticated AI models and evaluation tools.

The CRAFT-MD framework exemplifies how realistic simulations can advance the field. By mimicking actual clinical interactions, the framework evaluates an LLM’s ability to collect patient history, ask relevant questions, and render accurate diagnoses. The process is efficient, with AI evaluators processing 10,000 conversations in under three days, far surpassing human evaluators who would require over 1,000 hours for similar assessments.

Beyond diagnosing deficiencies, the CRAFT-MD framework provides actionable recommendations for optimizing LLM performance. These include designing models capable of engaging in open-ended, conversational exchanges; integrating textual and non-textual data, such as medical images; and creating AI systems that interpret non-verbal cues like tone and body language. Incorporating these elements can help bridge the gap between theoretical knowledge and practical application.

Shreya Johri, a doctoral student and co-author of the study, pointed out the limitations of current testing methods: “This approach assumes that all relevant information is presented clearly and concisely. In the real world, this process is far messier.” The shift toward more realistic testing methods is essential to ensure LLMs are equipped to handle the complexities of clinical settings.

CRAFT-MD also reduces ethical risks by preventing unverified AI models from interacting directly with real patients. By simulating interactions, the framework protects patient safety while accelerating the development of reliable AI tools. The system’s scalability allows researchers to keep pace with rapid advancements in AI technology, ensuring continuous improvements in model performance.

Effect of replacing case vignettes with simulated doctor–patient conversations in four-choice MCQs and FRQs. (CREDIT: Nature Medicine)

Dr. Roxana Daneshjou, a co-senior author of the study, highlighted the broader implications: “CRAFT-MD creates a framework that more closely mirrors real-world interactions, helping to move the field forward in testing AI model performance in healthcare.” As the field evolves, frameworks like CRAFT-MD will likely become essential in evaluating and deploying clinical AI tools effectively and ethically.

The introduction of LLMs into healthcare carries the promise of revolutionizing patient care, but also the responsibility of ensuring these tools meet rigorous standards. One of the most pressing challenges is enabling LLMs to handle the unstructured, complex nature of real-world conversations.

Patients rarely present symptoms in neat, concise packages. Instead, they provide scattered details, intertwining relevant and irrelevant information. Effective diagnosis requires the ability to parse this information, ask clarifying questions, and synthesize a coherent narrative.

Current models, while powerful, often lack this flexibility. Their performance shines in structured settings but falters in open-ended conversations. This underscores the importance of evaluation frameworks like CRAFT-MD, which push models to operate in more challenging, realistic scenarios.

For instance, a clinical LLM must discern whether a patient’s mention of fatigue is related to a new medication, an underlying condition, or simply lifestyle factors. These nuances are difficult to capture without robust conversational capabilities.

Trends in vignette and conversational formats across skin disease datasets. (CREDIT: Nature Medicine)

Another significant challenge lies in integrating multimodal data. Real-world diagnosis often requires synthesizing textual information with other forms of data, such as lab results, imaging studies, and even non-verbal cues from patients.

Multimodal models like GPT-4V aim to bridge this gap, but current iterations still struggle with the complexity of combining diverse data streams into accurate clinical insights.

The potential benefits of overcoming these challenges are immense. AI tools could alleviate clinician workloads by automating routine tasks, such as taking medical histories or triaging patients. This would allow healthcare professionals to focus on more complex aspects of patient care, improving efficiency and outcomes.

Additionally, by standardizing certain aspects of diagnosis, AI could reduce variability in care and help identify patterns that might be missed by individual clinicians.

However, these advancements must be accompanied by robust safeguards. Patient safety is paramount, and any deployment of clinical AI tools must prioritize minimizing risks. CRAFT-MD’s multi-agent system, which includes human oversight, provides a model for achieving this balance. By combining the efficiency of AI with the expertise of medical professionals, the framework ensures that tools are both effective and ethical.

The road ahead involves not only refining AI models but also rethinking how they are integrated into healthcare systems. Collaboration between AI developers, clinicians, and regulators will be crucial.

Developers must prioritize creating models that align with clinical realities, while clinicians need to provide feedback to ensure these tools address real-world needs.

Regulators, in turn, must establish clear guidelines for evaluating and approving clinical AI tools, balancing innovation with safety.

As AI continues to evolve, frameworks like CRAFT-MD will play a critical role in shaping its future. By setting high standards for evaluation, they ensure that advancements in AI translate into tangible benefits for patients and providers alike.

The ultimate goal is not just to create smarter tools but to build trust in their ability to enhance healthcare.

Note: Materials provided above by The Brighter Side of News. Content may be edited for style and length.


Like these kind of feel good stories? Get The Brighter Side of News' newsletter.


Joshua Shavit
Joshua ShavitScience and Good News Writer

Joshua Shavit
Science & Technology Writer | AI and Robotics Reporter

Joshua Shavit is a Los Angeles-based science and technology writer with a passion for exploring the breakthroughs shaping the future. As a contributor to The Brighter Side of News, he focuses on positive and transformative advancements in AI, technology, physics, engineering, robotics and space science. Joshua is currently working towards a Bachelor of Science in Business Administration at the University of California, Berkeley. He combines his academic background with a talent for storytelling, making complex scientific discoveries engaging and accessible. His work highlights the innovators behind the ideas, bringing readers closer to the people driving progress.