Gynecology World Conference 2026

Speakers - GWC2026

Julia Babinska, Speaker at Urogynecology Conference Singapore

Julia Babinska

Julia Babinska

  • Designation: Medical University of Warsaw
  • Country: Poland
  • Title: Simulating Multimodal AI: Evaluating Large Language Models on Synthetic Fetal Growth Disorder Vignettes

Abstract

Objectives: Recent reviews emphasize multimodal AI for managing fetal growth disorders. As frontline clinicians often lack access to these specific algorithms, this study evaluates if standard Large Language Models (LLMs) can accurately synthesize complex maternal, biometric, and Doppler data - simulating a multimodal AI approach - to correctly stage and manage fetal growth disorders using synthetic clinical scenarios.

Methods: Extracting predictive parameters highlighted in recent AI literature, we engineered 100 complex, synthetic clinical vignettes representing diverse fetal growth scenarios (FGR and macrosomia). Variables deliberately included discordant data: varying maternal demographics, asymmetrical ultrasound biometry, and conflicting Doppler indices (e.g., abnormal CPR with normal UA PI). Six leading LLMs (including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet) were prompted to diagnose the disorder and recommend delivery timing per ISUOG guidelines. Accuracy was benchmarked against consensus recommendations from three Maternal-Fetal Medicine experts.

Results: Gemini 1.5 Pro and GPT-4o achieved the highest baseline diagnostic accuracy for FGR staging (88% and 86%, respectively). However, when evaluating discordant sonographic variables, average LLM accuracy plummeted to 54%. Smaller models frequently hallucinated clinical interventions, prematurely recommending delivery in 22% of early-onset FGR simulations. Furthermore, all models failed to spontaneously integrate complex maternal biomarkers (e.g., sFlt-1/PlGF ratios) into the diagnostic matrix without explicit, stepwise prompting.

Conclusions: While LLMs demonstrate baseline proficiency in standard FGR scenarios, they fail to reliably process the complex, discordant data characteristic of true multimodal ML algorithms. Testing via synthetic vignettes reveals that "out-of-the-box" LLMs cannot yet replace purpose-built predictive models for fetal growth disorders. Safe deployment requires robust, domain-specific grounding and fine-tuning.