2024-25 Project (Sharples & Keogh & Cowling)

Development and evaluation of clinical decision rules for early diagnosis of cancers, with application to high dimensional data from breath testing.



Professor Linda Sharples at LSHTM
Email: Linda.Sharples@lshtm.ac.uk


Professor Ruth Keogh at LSHTM
Email: Ruth.Keogh@lshtm.ac.uk


Dr Thomas Cowling at LSHTM
Email: Thomas.Cowling@lshtm.ac.uk


Project Summary

We are seeking an outstanding candidate for a PhD project that lies on the intersection between statistics, health data science and clinical medicine. Using data on patients with suspected cancers of the liver and digestive system, the successful candidate will investigate methods for development and evaluation of clinical decision rules for early diagnosis.  Key aims are to investigate methods for development and evaluation of, (i) clinical prediction rules separately for each cancer, or for all cancers combined (i.e. binary outcome) and (ii) prediction of the most likely cancer from a set of possible cancers.    

The ideal candidate will have an MSc or equivalent experience in statistics, health data science or a related field, although a four-year scholarship is also an option (including MSc training in the first year). We particularly welcome applications from candidates in underrepresented groups.  Note that we will add further information on fairness in selection and LSHTM research culture using the wording recommended by our EDI committee, unless an overall statement is planned.

Project Key Words

Statistics, machine learning, clinical prediction, cancer.

MRC LID Themes

  • Global Health = No
  • Health Data Science = Yes
  • Infectious Disease = No
  • Translational and Implementation Research = Yes


MRC Core Skills

  • Quantitative skills = Yes
  • Interdisciplinary skills = Yes
  • Whole organism physiology = No

Skills we expect a student to develop/acquire whilst pursuing this project

Expertise in the design and analysis of studies to develop and evaluate clinical decision rules for cancer, based on high dimensional data.  
Expertise in both statistical and machine learning methods for investigating prediction of single and multiple outcomes. 
Experience of working in a multi-disciplinary environment (clinical, statistical, data science) to produce relevant, usable and generalisable methodology in this context. 
Experience of translating complicated technical analysis to clinical, patient groups and other non-statistical colleagues.


Which route/s is this project available for?

  • 1+4 = Yes
  • +4 = Yes

Possible Master’s programme options identified by supervisory team for 1+4 applicants:

  • LSHTM – MSc Medical Statistics

Full-time/Part-time Study

Is this project available for full-time study? Yes
Is this project available for part-time study? No


Particular prior educational requirements for a student undertaking this project

  • LSHTM’s standard institutional eligibility criteria for doctoral study.
  • An MSc in either Medical Statistics, Statistics or Health Data Science, or a related MSc with a substantial quantitative component (e.g. Applied Mathematics, Engineering) would be ideal.  For UK students a first class or upper second class undergraduate degree in a subject with a substantial quantitative component (e.g. Maths, Statistics) is required. For non-UK students comparable academic training and/or experience is required.

Other useful information

  • Potential CASE conversion? = No


Scientific description of this research project

Outcomes for people with gastrointestinal cancers such as bowel, oesophageal and pancreatic cancers are limited by the late stage of diagnosis. Each year 16,000 bowel cancer, 7,800 oesophageal cancer and 9,000 pancreatic cancer patients die in the UK (CRUK figures). If detected early when treatment with surgery is possible, many will survive beyond five years. As diseases progress, the chance of surgical treatment decreases and the survival outcomes are poor. Therefore, early diagnosis is crucial to successful treatment.   

In their early stages, these cancers are difficult to detect, because symptoms are not specific to cancer. If cancer is suspected, a GP will refer the patient to hospital for an invasive test using endoscopy and if appropriate, taking a biopsy. These two assessments comprise the reference test, and identify whether there is a cancer and if so, the nature of the tumour (site, type, grade). Prevalence of cancer confirmed by invasive tests is around 4-7%. Thus, the majority of people referred for the reference test become anxious and are put through unpleasant tests unnecessarily.    

Professor Hanna at Imperial College has developed a non-invasive breath test for early diagnosis of cancers. Using machine learning classifiers, he and his group have shown that volatile organic compounds that are exhaled in the breath are strongly associated with cancers in the digestive system.  Separate predictors for bowel, oesophageal and pancreatic cancers were sensitive and specific for these cancers in early research studies. Ongoing programmes will refine and validate prediction models for diagnosis against the reference test using statistical methods.   

Project objectives 
This project aims to investigate the utility of statistical and machine learning methods for developing and evaluating clinical prediction models in this context. Two main areas will be investigated. 
1.  Methods for development and evaluation of clinical prediction rules separately for each cancer, or for all cancers combined (i.e. binary outcome).  
2. Methods for development and evaluation of a test, to predict the most likely cancer from a set of possible cancers.    

The student will review literature on development and validation of clinical prediction rules using both statistical and machine learning approaches.   Thereafter, the ability to identify the most likely cancer will be investigated using multi-outcome statistical and multi-classifier machine learning methods. Aspects such as overall study size, size of each cancer sample, conditional dependence between cancers, number of predictors and design of the experiments (case-control, cohort, enrichment) will be investigated. Data from the LSHTM-Imperial collaboration will be used to illustrate existing methods.    

Data availability 
The successful student will be added as a named researcher to the Data Sharing Agreement with Professor Hanna’s group at Imperial College. Appropriate ethical approval will be sought and granted before the data is accessed.   

Potential risks 
It is possible that we will not have finished data collection for all cancer studies planned. Ongoing recruitment is on track to complete by summer 2024. If data collection is delayed, the student can access previous studies or use simulation to explore different methods, pending completion of data collection and cleaning.

Further reading

(Relevant preprints and/or open access articles)

Additional information from the supervisory team

  • The supervisory team has provided a recording for prospective applicants who are interested in their project. This recording should be watched before any discussions begin with the supervisory team.
    Sharples-Keogh-Cowling Recording