Despite the impressive performance of LLMs in medical tasks, we still lack a clear understanding of what medical knowledge they actually contain, how it is structured, and how to access it in a deterministic and reproducible way. These models are trained on vast, heterogeneous corpora that include biomedical literature, clinical guidelines, and possibly patient-facing content, but the boundaries and provenance of their embedded knowledge remain opaque, and they cannot handle newer or fuzzy concepts well. As a result, when an LLM provides a diagnosis suggestion or explains a treatment pathway, it is difficult to trace the origin of that information or assess its reliability. This uncertainty poses challenges for clinical validation, regulatory approval, and integration into decision support systems, and erodes trust, especially in contexts where they consistently underperform human judgement.
Moreover, the probabilistic nature of LLM outputs means that identical prompts can yield different responses, making it hard to guarantee consistency or auditability. Without mechanisms to interrogate and extract specific medical knowledge deterministically e.g., structured querying, provenance tracking, or integration with external knowledge graphs, LLMs risk being powerful but unpredictable tools. Addressing this gap is essential for building trustworthy, explainable, and clinically useful AI systems in healthcare.
Through its systematic approaches, this project aims to build the evidence base that characterises what types of fuzzy medical information LLMs can reliably reproduce, under what conditions, and with what limitations. We will design controlled experiments to probe the models’ responses across relevant scenarios, comparing outputs to established guidelines, expert consensus, and curated knowledge bases.
This will be grounded in a benchmark and methodology to make this embedded knowledge more accessible and reproducible. This includes designing and documenting structured prompting techniques, integrating external ontologies and knowledge graphs, creating new semantic knowledge bases for fuzzy, evolving knowledge and building tools for provenance tracking and response auditing. By combining empirical analysis with semantic technologies, we aim to transform LLMs from black-box generators into transparent, trustworthy components of medical systems.

