Table 1

Key considerations when appraising an EM AI/ML paper

Key considerations
Internal validityStudy question
  1. Is the study question and aim clearly stated?

  2. Is a comparison with the current EM practice baseline and rationale for improvement explicitly stated?

  3. Is the use of ML justifiable as opposed to traditional statistical methods?

  4. Is the strategy for model development, validation and evaluation clearly described in a clinical EM context?

Baseline data
  1. Is sufficient good quality data available? Has a sample size assessment been made?

  2. Is data representative of the population and setting in which the model is being deployed?

  3. Has data preprocessing been clearly described and handled in an EM applicable manner? Is this reproducible?

  4. Have redundant (noisy or collinear) variables been identified and addressed?

  5. Has any baseline class imbalance been addressed, and is the method appropriate to the intended EM task?

  6. Is Ground Truth valid and reflects the EM ‘gold standard’.

Data split/Resampling
  1. Has the approach to data resampling been clearly described? (is a data flow diagram provided)

  2. Have internal and external validation test sets been clearly identified, and are they as independent as possible?

  3. Is the data split clear, rationalised and stratified if appropriate?

  4. Has data leakage been avoided?

Algorithm selection
  1. Has the rationale for selecting candidate algorithms been explained in relation to the study aims? Has overfitting been explicitly addressed?

  2. Are candidate algorithms appropriate to EM and representative of a range of complexities?

  3. Has a variable selection procedure been described? Are chosen variables plausible in the EM context?

  4. Have the metrics for model evaluation been defined and rationalised?

Model validation
  1. Are the model evaluation metrics appropriate to the task and clinical question? If not, has model performance been reported with a clinically applicable (EM) metric?

  2. Are estimates of model variance reported (from cross-validation/bootstrapping)?

  3. Has tuning of model hyperparameters been undertaken, and have the settings been reported?

  4. Have model calibration and discrimination been reported, and have these been tuned?

External validityReproducibility
  1. Has a statement regarding code (and data) availability been included?

  2. Are all steps in the model development described in sufficient detail to allow independent replication of the model pipeline?

  3. Has interpretability been explicitly and reasonably addressed (eg, by model visualisations)?

Generalisability
  1. Has external validation on a geographically (and temporally) independent test set been undertaken?

  2. Are performance metrics consistent with the estimates of spread obtained during internal validation?

  3. Have the reasons for good or poor performance on external validation been objectively explored?

  4. Is the final model suitable for deployment in the ED? Does it have high computational or resource requirements?

  5. Has an assessment been made of the potential for exacerbation of bias by the deployed model?

  • AI, artificial intelligence; EM, emergency medicine; ML, machine learning.