Article Text

Download PDFPDF
Predicting need for hospital admission in patients with traumatic brain injury or skull fractures identified on CT imaging: a machine learning approach
  1. Carl Marincowitz1,
  2. Lewis Paton2,
  3. Fiona Lecky1,
  4. Paul Tiffin3
  1. 1 Centre for Urgent and Emergency Care Research (CURE), Health Services Research School of Health and Related Research, The University of Sheffield, Sheffield, UK
  2. 2 Department of Health Sciences, University of York Alcuin College, York, York, UK
  3. 3 Hull York Medical School Department of Health Sciences, University of York, York, UK
  1. Correspondence to Dr Carl Marincowitz, School of Health and Related Research (ScHARR), The University of Sheffield, Sheffield, UK; c.marincowitz{at}sheffield.ac.uk

Abstract

Background Patients with mild traumatic brain injury on CT scan are routinely admitted for inpatient observation. Only a small proportion of patients require clinical intervention. We recently developed a decision rule using traditional statistical techniques that found neurologically intact patients with isolated simple skull fractures or single bleeds <5 mm with no preinjury antiplatelet or anticoagulant use may be safely discharged from the emergency department. The decision rule achieved a sensitivity of 99.5% (95% CI 98.1% to 99.9%) and specificity of 7.4% (95% CI 6.0% to 9.1%) to clinical deterioration. We aimed to transparently report a machine learning approach to assess if predictive accuracy could be improved.

Methods We used data from the same retrospective cohort of 1699 initial Glasgow Coma Scale (GCS) 13–15 patients with injuries identified by CT who presented to three English Major Trauma Centres between 2010 and 2017 as in our original study. We assessed the ability of machine learning to predict the same composite outcome measure of deterioration (indicating need for hospital admission). Predictive models were built using gradient boosted decision trees which consisted of an ensemble of decision trees to optimise model performance.

Results The final algorithm reported a mean positive predictive value of 29%, mean negative predictive value of 94%, mean area under the curve (C-statistic) of 0.75, mean sensitivity of 99% and mean specificity of 7%. As with logistic regression, GCS, severity and number of brain injuries were found to be important predictors of deterioration.

Conclusion We found no clear advantages over the traditional prediction methods, although the models were, effectively, developed using a smaller data set, due to the need to divide it into training, calibration and validation sets. Future research should focus on developing models that provide clear advantages over existing classical techniques in predicting outcomes in this population.

  • trauma
  • research
  • trauma
  • head
  • imaging
  • CT/MRI

Data availability statement

Clinical data were collected and anonymised by members of the direct care team without consent but with NHS research ethics approval. The NHS research ethics approval limits access to individual-level patient data to members of the research team. We are able to provide summary data on reasonable request.

Statistics from Altmetric.com

Key messages

What is already known on this subject

  • We have previously empirically derived a clinical decision rule to select low risk patients with injuries on CT imaging following head trauma for discharge from the emergency department using traditional statistical methods, based on logistic regression. The decision rule is highly sensitive but lacks specificity and implementation would allow only a small proportion of patients to be discharged. Machine learning may theoretically improve the accuracy of prediction, allowing more patients to be safely discharged.

What this study adds

  • Using data from the same cohort as our previous study, we used a machine learning approach to predict which patients in the sample were likely to deteriorate. We found no clear improvement in prediction over a model previously developed using a classical statistical approach.

Introduction

There are 1.4 million annual attendances to emergency department (ED) in England and Wales following head trauma.1 Of these, 95% of patients attend with an initial Glasgow Coma Scale (GCS) score in the range 13–15 and are defined as having a minor head injury.2 Around 7% of these patients have brain injuries and skull fractures identified by CT Imaging.3 In the UK, patients with injuries identified by CT are routinely admitted for observation, although only a small proportion clinically deteriorate.4 Internationally, some advocate routine admission of patients with injuries on CT to higher-dependency areas due to the risk of deterioration, while other advocate use of the Brain Injury Guideline (BIG) criteria to select patients for discharge from the ED.5 6

Accurate risk prediction of clinically important deterioration in GCS 13–15 patients with traumatic injuries identified by CT imaging could allow the discharge of low risk patients from the ED. Patients with expanding intracranial haemorrhage can rapidly and catastrophically deteriorate. This risk must be weighed against the potential advantage of any reduction in hospital admissions. Thus, the use of predictive models to select patients for discharge may be controversial in some clinical settings. The consequences of discharging a patient who deteriorates (a ‘false negative’) are much greater than admitting a patient who remains stable (a ‘false positive’). Therefore, accurate prediction of patients who will not deteriorate is more useful than accurately predicting every patients’ risk of deterioration. We recently developed a risk prediction model and decision rule for discharge from the ED for this traumatic brain injury (TBI) population using traditional statistical approaches.7 8 Our derived decision rule outperformed existing guidelines, achieving a high sensitivity to a composite outcome of deterioration encompassing need for hospital admission, but lacked specificity.

Logistic regression, using maximal likelihood estimation, optimises predictive accuracy across the range of possible probabilities of deterioration. Advocates of machine learning have highlighted that more flexible modelling techniques may better capture non-linear relationships and interactions between the variables in the data. The use of ‘ensemble learning’, which combines the results of multiple models to make final predictions, is a way of addressing the ‘bias-variance trade-off’. That is, the potential bias from multiple models can be averaged out, or otherwise combined, to achieve more consistent predictive accuracy. Thus, theoretically, machine learning based prediction could achieve higher levels of accuracy compared with traditional statistical modelling approaches.

However, at least for structured data (ie, that already in numeric format) this has not been firmly established. A recent systematic review reported that, on average, machine learning-based models tended to outperform predictive models that use logistic regression techniques, but only for studies deemed at high risk of bias.9 Moreover, others have raised concerns that machine learning derived models are prone to ‘overfitting’. That is, they replicate the relationships in the data being used to train them accurately but may fail to generalise accurately to new, unseen data sets. There have also been concerns over a lack of transparency and consistency in reporting the results from observational studies using machine learning.10 This raises issues with the validity and generalisability of the results reported from machine learning studies that purport to form the basis of current or future clinical decision support tools.

We, therefore, aimed to use machine learning to develop a predictive model which can accurately identify patients with TBI and skull fractures on CT imaging at very low risk of deterioration who could be safely discharged. We used the same data set as in Marincowitz et al 7 8 so that we were able to understand the predictive potential of machine learning, compared with the tool developed using traditional statistical approaches. Our objective was to build a machine learning model and report our results in a way which was both transparent, reproducible and accurately quantified uncertainty around the predictive precision. By doing so, we aimed to address previous criticisms and establish whether the potential advantages of such an approach, employing the latest methods to machine learning using structured data, outweighed any limitations inherent to the approach in this context.10

Materials and methods

Study design

Data were analysed from an existing retrospective cohort study using case note review of patients with TBI presenting to the ED between 2010 and 2017 at three Major Trauma Centres in England: Hull University Teaching Hospitals National Health Service (NHS) Trust, Salford Royal NHS Foundation Trust, and Addenbrooke’s Hospital (Cambridge University Hospitals NHS Foundation Trust). Both a detailed study protocol7 and a cohort study using traditional statistical techniques8 have previously been published. We previously used multivariable logistic regression with bootstrap internal validation to derive a predictive model which included: initial GCS, preinjury antiplatelet or anticoagulant use, first neurological examination, number of injuries on CT imaging, severity of brain injury, severity of extracranial injuries and initial haemoglobin value. Our previously derived model is presented in online supplemental material 1.

Supplemental material

Inclusion criteria

Patients aged ≥16 with a presenting GCS score of 13–15 who attended the ED following acute head trauma and had injuries reported on CT brain scan. The latter was defined as: skull fractures, extradural haemorrhage, subdural haemorrhage with an acute component, traumatic intra-cerebral haemorrhage, contusions, traumatic subarachnoid haemorrhage and traumatic intra-ventricular haemorrhage.

Exclusion criteria

Patients were excluded where: a non-traumatic cause of intra-cranial haemorrhage was indicated, pre-existing CT abnormality prevented determining whether acute injury had occurred and patients transferred from other hospitals.

Primary outcome

A composite measure of deterioration aimed at encompassing need for hospital admission was used. This included up to 30 days following ED attendance any of: death attributable to TBI, neurosurgery, seizure, a drop in GCS >1, Intensive Care Unit (ICU) admission for TBI, intubation or hospital readmission for TBI. Where reason for death, ICU admission or readmission was unknown it was attributed to TBI deterioration.

Data collection

ED CT brain scan requests and reports were screened at each centre to identify patients with TBIs or skull fractures. Patients with identified injuries were matched to their full electronic and written case records to determine if they met the inclusion criteria data. Where they did so, data were extracted by trained research staff using a standardised electronic proforma on patient deterioration outcomes and candidate predictors.

Data preprocessing

For each run of model building and testing the data were equally split into three subsets. These formed a ‘training set’ on which to develop the predictive algorithm, a ‘calibration set’ to build the model for probability recalibration (see below), and a ‘test set’ which is ‘held back’ to validate the final algorithm. Stratified random sampling was used to ensure equal distribution of the primary composite outcome of deterioration between sets.

Predictive model building

Our predictive models were built using gradient boosted decision trees via the CatBoost package11 in R.12 Gradient boosted decision tree models consist of an ensemble of decision trees, aiming to optimise model performance. The method was selected as it is known to work well even with small and medium-sized data sets (ie, several hundred to several thousand observations). This approach combines a number of methodological approaches to prediction; the use of decision trees; ‘ensembling’—where numerous slightly differing models are created, and the results averaged or voted on, and; ‘boosting’ where the algorithm successively focuses on the observations where the outcome is increasingly difficult to predict. By combining all three approaches, gradient boosted trees tend to outperform algorithms which only use one or two of these methods. There is evidence for this in that the majority of winning solutions in the ‘Kaggle’ prediction competitions feature ensembles of boosted trees.13 CatBoost extends this approach by the way it treats categorical (and in this case, ordered) predictor variables. The software recodes such categorical variables to numeric, depending on their observed relationship with the outcome of interest. This can potentially increase the amount of information available to predict the outcome of interest. The CatBoost algorithm has two main ‘hyperparameters’ that can be changed in order to improve predictive accuracy and generalisability: the number of decision trees to grow, and; the number of variables to select at each node of the trees. The process of choosing hyperparameter settings is known as ‘tuning’.

Model building proceeded as follows (also see figure 1). When predicting relatively uncommon outcomes it is important to stop the algorithm focusing on achieving high accuracy by predicting the most prevalent outcome (in this case, a lack of deterioration). For this reason, we used ‘Synthetic Minority Over-sampling Technique’ (SMOTE) in order to create synthetic observations with the relatively uncommon outcome of deterioration, in the training data set.14 These synthetic patients are based on the actual data on patients with recorded deterioration, and are created using a ‘K-nearest neighbours’ approach to ensure the training data set has an apparent 50:50 split of participants with the two outcome types (deterioration/no deterioration). We used the default value of k=5. Thus, this preprocessing step helps the algorithm to train to predict the less common outcome (in this case, deterioration). The CatBoost model is then fitted to the training data set, learning how to link the predictor variables to the outcomes. This step involves a ‘tuning’ phase where the model hyperparameters (eg, number of decision trees) are altered in order to optimise predictive performance. A grid search over possible values of the hyperparameters was performed in order to find the combination of hyperparameters that maximises the area under the receiver operator characteristic curve (AUC- equivalent to the ‘C-statistic’) on the training data set. This was done on a sample of training data. The final model is then applied to the previously unseen test data set to predict the class (ie, deterioration or not) and probability of deterioration for each individual in the test data set.

Figure 1

Flow chart machine learning model building and validation process. PPV, positive predictive value.

The predicted probabilities from a decision tree classification tend to cluster around a central point in order to maximise the accuracy metric used to optimise the algorithm. This means that the accuracy at predicting one class vs the other is maximised. However, the resulting predicted probabilities tend not to reflect the true underlying probabilities. This matters if, for example, one wishes to change the threshold for the predicted probability for a case and a non-case in order to, say, minimise false negative cases. Predicted probabilities from such machine learning models can be mapped on to those more likely to reflect the true underlying probabilities through a process known as recalibration. This involves building a second model using a separate portion of training data, not previously used for building the original machine learning model. This second model seeks to predict the true underlying probabilities, as represented approximately by observed frequencies of the outcome type, from those predicted by the first-phase machine learning model. In this case we used an isotonic regression model on this separate subset of data (the calibration set) to link the predicted probabilities from the predictor variables to the approximate probability of observing the actual outcome (deterioration).15 Thus, running both the machine learning model and the recalibration model in series provided the final predicted probabilities which can be used to classify the patients in the final, unseen, validation, data set in terms of the risk of deterioration (high vs low risk). Metrics of model performance (eg, AUC, negative predictive value (NPV), etc) were then calculated.

Due to the stochastic nature of this algorithm development (eg, data set splitting, imputation, etc) we repeated the entire process 2500 times, and the performance metrics were stored for each run. The exception to this was the estimation of the optimal model hyperparameters in the ‘model tuning’ phase, which we performed only once. In this regard, tuning was only performed once, on a single training sample, which was itself then split into a tuning training set and a tuning validation set. Performing tuning only once eased computational requirements, which is possible due to the stability of the results generated from the tuning phase. The optimal model hyperparameters from the second iteration onwards are thus set at the values decided in the tuning phase for the first iteration. The overall performance of the models was evaluated by calculating the mean accuracy metrics over the 2500 iterations. A measure of the spread of the results was calculated using the values at the 2.5th percentile and the 97.5th percentile, to give the central 95% IQR.

As the aim of the model was to decide which patients were relatively safe to discharge from the ED, we selected an overall predicted probability threshold that led to a relatively high NPV, although at the expense of positive predictive value (PPV). That is, we wanted a predictive system that was relatively good at deciding which patients are safe to go home, even if a significant proportion were flagged as requiring further observation, which might be unnecessary. The cost of false positives, in terms of patient care and potential consequences, was lower than that for false negatives. Thus, our aim of recalibration was to achieve a diagnostic prediction system that performed at least as well as the BIG criteria8 (ie, an NPV of at least 96.5% and minimum PPV of 28%). Our use of a separate recalibration model for the initial predicted probabilities allowed us to move the threshold for the predicted probabilities in this way. This had no impact on overall model performance, with a negligible impact on the AUC values for the model, though recalibration reclassified the error types produced. This meant that the final model output could be adjusted to minimise the risk of false negatives (patients predicted to be at low risk, but who actually did deteriorate) while maintaining acceptably useful levels of specificity (ie, the ability to identify ‘true negatives’).

Missing data

Missing values for predictor variables were imputed using a single imputation via the Amelia II package for R, which uses an Expectation-Maximisation Bootstrap-based algorithm.16 This process was stochastic, and each iteration of model building included a new round of imputation. Thus, the missing data imputation was incorporated into the loop of data set splitting and model building. This was important as to account for the uncertainty of this process when evaluating the overall performance of the models.

Patient and public involvement

The Hull and East Yorkshire NHS Trust Trans-Humber Consumer Research Panel and Hull branch of the Headway charity were consulted in the initial stages of developing the research questions addressed in this study.

Results

Study population

Figure 2 summarises screening of ED CT requests and inclusion of patients on matching to case records at each centre and table 1 the population characteristics and candidate variables. The cohort was mostly male, with around half of patients >60 years of age and one quarter with either preinjury anticoagulant or antiplatelet use. The cohort was mostly male, with around half of patients aged over 60 and quarter with either preinjury anticoagulant or antiplatelet use. A total of 470 patients (27.7%; 95% CI 25.5% to 29.9%) clinically deteriorated as defined by the primary outcome. A total of 223 patients (13.1%; 95% CI 11.6% to 14.8%) underwent neurosurgery were admitted to ICU or were intubated. A total of 72 patients had deaths attributable to TBI. A total of 471 patients had data missing from at least one candidate variable on case note review.

Table 1

Variables and population characteristics

Figure 2

Population selection.

Model parameters

The final model hyperparameters were determined in the tuning phase. In this respect, 200 trees were created at each run, with nine predictor variables selected, at random from the data set, to be used to create a split at each node (branch split).

Model performance

For each of our 2500 model runs, our test data set consisted of a random sample of 576 patients. Of these, the median number of patients the model predicted could be discharged across our 2500 models was 26 (‘true negatives’), and the median number of deteriorations of those ‘discharged’ (ie, ‘false negatives’) was one. As can be seen from the results in table 2, the mean NPV indicates that, on average, 94% of those who were recommended for discharge by the model did not deteriorate. Across all 2500 runs of our model, the value of the 2.5th percentile for NPV was 81%, and the 97.5th percentile was 100%. The mean PPV of our models was 29% (2.5th–97.5th interpercentile range 28%–31%). The mean sensitivity was 0.99 (0.96–1.00) and the mean specificity 0.07 (0.01–0.17). The mean (AUC—equivalent to a C-statistic), an overall metric of the potential utility of the model, was 0.75, with the interpercentile range of this value being 0.71–0.78.

Table 2

Predictive ability of the machine learning based models in the test (validation) data sets according to mean accuracy metrics

The CatBoost process did not produce interpretable models as such. However, the output for each run of the model produced ‘importance’ metrics for the predictors. This metric gives a normalised score to each variable which describes how much the prediction changes if the value of the predictor changes. Ranking the predictors by the mean importance scores therefore gives some indication of which variables the model finds most useful in predicting deterioration status. In table 3, we provide the mean importance scores for the predictors, averaged over 100 runs. As can be seen in table 3, we observed that severity of the injury is deemed most important, followed by GCS, number of injuries, the particular hospital the patient was admitted to, and the presence of subdural haemorrhage.

Table 3

Ranked mean ‘importance’ of the features (predictors) in the model, averaged over 100 runs

Discussion

This is the first study to report the performance of a machine learning approach to predicting the need for hospital admission in this TBI population. Our final algorithm, over 2500 runs, reported a mean PPV of 0.29, mean NPV of 0.94, mean AUC (C-statistic) of 0.75, mean sensitivity of 0.99 and mean specificity of 0.07. These performance metrics are broadly the same as those recently reported for a classical approach to predictive modelling on the same data set using logistic regression and the BIG criteria, although we report a slightly lower mean NPV (94%) than both the BIG criteria (96.5%) and the logistic regression model (97.7%).8

The modelling process suggested that the most important variables for predicting deterioration were injury severity, GCS and the number of injuries. While a direct comparison with the previous logistic model developed on these data is not possible, due to some differences in data management and sampling (ie, in the present study the data set was divided into three portions), the largest ORs in the logistic model also related to injury severity, GCS and number of injuries.8 Other predictors in the logistic regression model included extracranial ISS value, anticoagulant and antiplatelet use, an abnormal neurological examination and haemoglobin value. The presence of specific types of injury appeared more important in the machine learning models and this may be due to the modelling being able to account for interactions between injuries when co-occurring.

Strengths and limitations

Our model appears similar to the previous logistic regression model in terms of both performance metrics and those variables apparently most important in predicting deterioration. The sole advantage of using the machine learning approach in this context appeared to be that the model was developed on much fewer data—approximately one-third than the previous one, employing a classical statistical approach. Our study had a sample size powered to derive a predictive model from our candidate variables using multivariable logistic regression for our original study. This, however, represents a relatively small sample size for developing machine learning models. Moreover, the effective sample size in the present study was smaller still because of the requirement to recalibrate the probabilities from the models being developed. Despite this, it managed to achieve broadly similar performance metrics.

Theoretically, given greater data availability, the machine learning model may have outperformed the classical approach. It may be possible to achieve larger effective sample sizes via alternative methodological approaches. We split our data into three equal sets (‘training’, ‘calibration’ and ‘test). However, this may not be the optimal division of the original data, and this could have been assessed using sensitivity analyses. Within this, it could also be worth, in future studies, considering stratified training sets to account for key predictor variables. It may also have been possible to reduce ‘data spend’ by using a cross-validation approach to model calibration, rather than having had a third, separate, portion. This would have required the data to be split only into training and test data sets. Training and calibration of the model would then take place on different ‘folds’ (ie, further subsets) of these data, rather than using a separate ‘calibration’ data set as we did in this study. In this study, we used a separate data set and isotonic regression as the approach was easily implemented in the workflow. In addition, the relative sparsity of one of the two outcome categories (deterioration) may have meant that the recalibration model may have benefited from a larger number of such outcomes being present in the data set portion it was built on. However, we recognise that alternative methodologies, such as recalibration using cross-fold validation, may have worked at least as well, or perhaps better, in the context of a relatively small data set.

Machine learning models often overfit to the data on which they are trained. This leads to poorer performance in external, validation data, and hence, impaired generalisability. The ‘CatBoost’ algorithm used here includes an ‘overfitting detector’ which can stop the model training process if overfitting is observed during the training process.17 Our use of previously unseen (‘hold out’) validation data samples also would have helped to ensure realistic estimation of the performance of our derived models. However, it should be highlighted that even though such validation data sets had not been used to train the models they were still derived from the same study population. Also, ideally, tuning would have been carried out on an independent, fourth portion of the data, rather than a random subsample of a single training data set (ie, a sixth of the total data set). The limited size of our sample precluded this. While cross-fold validation is also commonly used to initially tune the hyperparameters used by a machine learning approach this is also a resampling technique and would not have avoided this issue. This issue could also have contributed to some degree of overfitting, and again, adversely affected the generalisability of the model to completely independent data sets. Thus, it would be important, as part of future validation work, to assess the performance of this machine learning model in a totally independent sample, drawn from a completely separate population of patients.

Our use of a ‘k-nearest’ algorithm (SMOTE) to generate synthetic data to rebalance the outcome variable will have reduced some of the risk of CatBoost focussing on predicting the most common, ‘negative’ cases, at the expense of positive cases, where deterioration occurred. However, we only used the default value (k=5) for the ‘nearest neighbours’ method to generate synthetic observations for this step of our methods. Different values for k are unlikely to have substantially impacted on our findings. However, a sensitivity analysis over plausible values of k could have been performed to assess the stability of this assumption

The models derived with our machine learning approach would require the availability of reasonable amounts of computational power to be applied clinically and this represents a potential barrier for implementation into clinical practice. A simplified version of the model, which used fewer important predictor variables, as identified via the ‘importance’ metric, could be used. Such a reduced model may be easier to implement in a clinical setting although may not perform as well. Thus, there would be a trade-off between model complexity and potential model performance and utility.

Ideally, all predictor variables would be completely independent of each other. However, was not the case in this data set, as with the data used in the original study. However, given the way that variables are randomly sampled in the machine learning model building process, when recombined, the less powerful predictor of a pair of variables that were relatively dependent on each other would be discarded. In the present study, this might have also been reflected in the ‘importance’ values for the different predictors included. Excessive dependency between predictor variables would also have caused convergence issues with our models, and such problems were not observed. Nevertheless, future studies in this area would ideally consider collecting (or combining) variables to ensure the relative independence of the predictors from each other.

This study used a representative data set drawn from a population of patients presenting to the ED with traumatic brain injury. However, the training data were drawn only from three hospitals and therefore the generalisability cannot be assumed. Nevertheless, our use of iterative model building provided a better estimate of the uncertainty of our results, and thus, the potential generalisability than would normally be reported in machine learning based predictive studies. Also, by using a recalibration model within the process we were able to change the decision threshold to increase the NPV, while maintaining a relatively low, but potentially acceptable PPV. Moreover, we used the latest algorithms to make the most of categorical data, as well as employing methods to adjust for relatively uncommon (unbalanced) outcomes and missing data. In common with other machine learning methods, interpreting the predictive models is much more challenging than classical approaches, although importance metrics aid somewhat in this regard.

Implications

On the basis of these findings, there would not be a strong case for moving to a more complex modelling approach compared with logistic regression or rule-based algorithms at this time. However, it may be, as more data become available, the advantages for machine learning approaches may outweigh their limitations. Also, as more data are routinely electronically captured it might be that machine learning systems are able to capitalise on a wider range of predictor variables. Certainly, to date, the situations where machine learning seems to provide an advantage over conventional statistical approaches are where there are large quantities of unstructured data to learn from. Such clinical scenarios include classification tasks related to medical imaging18 or the natural language processing of free-text health records.19 Such research should be reported, transparently and according to consistent reporting standards, such as those that build on the Transparent reporting of a multivariable prediction model for prognosis or diagnosis (TRIPOD) guidelines for prognostic studies.20

Our machine learning models would select patients for discharge with around a 1 in 26 chance of subsequently deteriorating. Whether this would be perceived as a clinically acceptable risk would depend on both clinicians’ and patients’ risk appetites and the circumstances to which a patient was being discharged to. This is likely to be seen as too high a risk if a patient is being discharged somewhere where they are not going to be monitored by family or cannot easily return to hospital if their condition changes. Moreover, current National Institute for Health and Care Excellence guidelines advise, following head trauma, a patient should only be discharged from the ED if they can be observed at home by a responsible adult for at least 24 hours.

Future research should focus on comparing the model performance of this machine learning-based algorithm to the earlier logistic regression-based predictive model in an external validation data set. Moreover, it is important to assess the actual, real-world impact of any predictive decision-making tool on actual patient care and clinical outcomes. The acceptable risk of deterioration to both patients and clinicians when discharging a patient from the ED is subjective and will vary depending on the individuals’ risk appetite. Further research is needed to quantify acceptable risk of deterioration in this TBI population and how different risk prediction models could be used to support shared decision making in this context.

Conclusion

The predictive performance of our machine learning approach was similar to that of our logistic regression-based model. The risk of deterioration in a patient recommended for discharge, though relatively small, may, nevertheless, be still too high to be used clinically. Further research should be focused on developing models that provide clear advantages over existing, classical techniques for predicting outcomes in both this and external patient data sets. In addition, as in the present study, care should be taken to communicate the uncertainty over the results in order to convey a realistic appraisal of how such models are likely to perform across settings. Such rigour is essential if machine learning is to find its correct place within healthcare technology.

Data availability statement

Clinical data were collected and anonymised by members of the direct care team without consent but with NHS research ethics approval. The NHS research ethics approval limits access to individual-level patient data to members of the research team. We are able to provide summary data on reasonable request.

Ethics statements

Patient consent for publication

Ethics approval

NHS Research Ethics Committee Approval was granted by West of Scotland REC 4 reference: 17/WS/0204. As a retrospective case review conducted by members of the direct care team, consent was not requited.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Handling editor Katie Walker

  • Contributors The idea for the study was conceived by CM and PT with help from FL and LP. The analysis was completed by PT and LP with clinical specialist advice regarding the interpretation of results from CM and FL. All authors read and approved the final manuscript.

  • Funding FL is supported by the European Union Framework 7 Collaborative European Neurotrauma Effectiveness Research in Traumatic Brain Injury ((EC grant 602150)) and NHS Trusts 'Trauma Audit and Research Network - www.tarn.ac.uk' (Grant Number Not Applicable/NA). CM is a National Institute for Health Research (NIHR) Clinical Lecturer in Emergency Medicine (Grant Number Not Applicable/NA). PT is funded in his research by an NIHR Career Development Fellowship (CDF-2015-08-11). This publication presents independent research funded by the National Institute for Health Research, University of Sheffield and University of York.

  • Disclaimer The views expressed are those of the author(s) and not necessarily those of the University of Sheffield, University of York, the NHS, the NIHR or the Department of Health and Social Care.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.