Randomised controlled trials (RCTs) form the basis for evidence-based medicine, but their validity depends on how well recruited participants reflect the underlying patient population. Systematic under-representation of women, minority ethnic groups, and socio-economically deprived individuals has been reported [1–2]. Our recent proof-of-concept study in a large NHS cardiac centre used natural language processing (NLP) over routinely collected electronic health records (EHRs) to compare coronary artery disease (CAD) RCT participants with the real-world hospital CAD population and suggested that RCT cohorts were less comorbid and included fewer women, minorities, and socio-economically deprived patients [3]. These findings point to structural recruitment biases that can limit the generalisability of trial results.
This project will extend that work beyond a single condition and centre, developing computable eligibility pipelines to reconstruct “eligible-but-not-recruited” populations from EHRs and quantify representativeness at scale, ultimately informing equitable trial design and reporting. Most studies of trial representation use manual checks or registry summaries. This project turns trial eligibility into code that runs on routinely collected EHRs, with the assistance from large language models (LLMs) for analysing criteria and clinical notes at scale. It will deliver clear and interpretable metrics for representativeness (sex, ethnicity, age, deprivation, multimorbidity). The work addresses healthcare disparities by state-of-the-art NLP technologies.
The project aims to (1) measure how representative RCT cohorts are when compared with the real underlying population who meets the trial’s criteria; (2) discover drivers of under-representation and selection bias at patient, clinician, service, and protocol levels; (3) improve equity by designing and testing simple, actionable changes to eligibility, screening, and outreach; and (4) deliver a generalisable and transparent tool that RCTs can adopt.

