Objectives: While Large Language Models (LLMs) have demonstrated proficiency in general obstetrics, their reasoning capabilities in highly specialized fields like Perinatology remain underexplored. This study aimed to evaluate and compare the accuracy of six state-of-the-art LLMs on the Polish Specialty Certificate Examination (SCE) in Perinatology, assessing their reliability for advanced clinical decision-making and sensitivity to task complexity.
Methods: We analyzed 240 single-best-answer questions from the two most recent SCE sessions in Perinatology. Questions were translated into English by a certified medical translator. The questions were input into six LLMs without prior conversation memory: ChatGPT 5.1, Gemini Flash 2.5, Gemini Pro 3.0, Claude Sonnet 4.5, Microsoft Copilot, and Llama 4 Maverick. Performance was evaluated based on overall accuracy. Additionally, questions were categorized as "Easy" or "Difficult" using a median split of the human-derived Difficulty Index (DI) to assess model robustness.
Results: All evaluated models surpassed the standard 60% passing threshold. Gemini Pro 3.0 achieved the highest overall accuracy (89.5%), demonstrating statistical superiority over all other models (p<0.001) and maintaining stable performance across both "Easy" and "Difficult" question subsets (91% vs. 88%, p=0.210). In contrast, models such as Gemini Flash 2.5, Microsoft Copilot, and Llama 4 Maverick exhibited significant performance degradation on complex questions. For example, Llama 4 Maverick's accuracy dropped from 79% on easy questions to 54% on difficult questions (p<0.01).
Conclusions: LLMs possess sufficient baseline knowledge to pass the highly specialized Perinatology SCE. However, only top-tier models maintain reliable reasoning capabilities when confronted with complex maternal-fetal scenarios. The significant performance drop in other models on difficult questions highlights the danger of deploying unvalidated AI in subspecialty care.