Menu
in

Mitigation of limitations in large language models for healthcare. #AIHealthcare

The MIMIC-CDM dataset was created using the MIMIC-IV Database, containing electronic health records of 2,400 patients with four target pathologies. The dataset includes laboratory and microbiology test results, diagnoses, procedures, treatments, and clinical notes. Patients were filtered based on primary diagnoses, and data on physical examinations, laboratory tests, radiology reports, and procedures were included. The dataset was cleaned and anonymized for analysis.

A reader study was conducted with clinicians to compare their diagnostic accuracy with that of language models (LLMs) on the MIMIC-CDM-FI dataset. The LLMs were instructed to gather information and provide a diagnosis and treatment plan. Models were evaluated on their ability to request appropriate treatments, follow instructions, and provide accurate diagnoses.

Various LLMs were tested, with OASST performing best overall, although not suitable for clinical use. Llama 2 Chat had the lowest diagnostic accuracy and struggled to follow instructions, while WizardLM was inconsistent in ordering diagnostic exams. Clinical Camel achieved the highest diagnostic accuracy but could not participate in the clinical decision-making task.

Statistical tests were conducted to compare the performance of models and clinicians, with per-class accuracy used as the primary metric. Results showed that LLMs had lower diagnostic accuracy compared to clinicians, highlighting the need for further development and evaluation before clinical deployment. The study emphasizes the importance of using open-source models for medical AI to ensure patient privacy, transparency, and reliability.

Source link

Source link: https://www.nature.com/articles/s41591-024-03097-1

Leave a Reply

Exit mobile version