Article Text
Abstract
Objective Patients, families and community members would like emergency department wait time visibility. This would improve patient journeys through emergency medicine. The study objective was to derive, internally and externally validate machine learning models to predict emergency patient wait times that are applicable to a wide variety of emergency departments.
Methods Twelve emergency departments provided 3 years of retrospective administrative data from Australia (2017–2019). Descriptive and exploratory analyses were undertaken on the datasets. Statistical and machine learning models were developed to predict wait times at each site and were internally and externally validated. Model performance was tested on COVID-19 period data (January to June 2020).
Results There were 1 930 609 patient episodes analysed and median site wait times varied from 24 to 54 min. Individual site model prediction median absolute errors varied from±22.6 min (95% CI 22.4 to 22.9) to ±44.0 min (95% CI 43.4 to 44.4). Global model prediction median absolute errors varied from ±33.9 min (95% CI 33.4 to 34.0) to ±43.8 min (95% CI 43.7 to 43.9). Random forest and linear regression models performed the best, rolling average models underestimated wait times. Important variables were triage category, last-k patient average wait time and arrival time. Wait time prediction models are not transferable across hospitals. Models performed well during the COVID-19 lockdown period.
Conclusions Electronic emergency demographic and flow information can be used to approximate emergency patient wait times. A general model is less accurate if applied without site-specific factors.
- emergency care systems
- efficiency
- emergency departments
- emergency department management
- emergency department operations
- emergency department utilisation
Data availability statement
No data are available.
Statistics from Altmetric.com
- emergency care systems
- efficiency
- emergency departments
- emergency department management
- emergency department operations
- emergency department utilisation
Key messages
What is already known on this subject
Patients and families want to know approximate emergency wait times, which will improve their ability to manage their logistical, physical and emotional needs while waiting.
There are a few small studies from a limited number of jurisdictions, reporting model methods, important predictor variables and accuracy of derived models for predicting wait times.
What this study adds
Our study demonstrates that predicting wait times from simple, readily available data are complex and provides estimates that are not as accurate as patients would like; however, rough estimates may still be better than no information.
We present the most influential variables regarding wait times and advise against using rolling average models, preferring random forest or linear regression techniques.
Emergency medicine machine learning models may be less generalisable to other sites than we hope for when we read manuscripts or buy commercial off-the-shelf models or algorithms. Models developed for one site lose accuracy at another site and global models built for whole systems may need customisation to each individual site. This may apply to data science clinical decision instruments as well as operational machine learning models.
Introduction
Deciding where to seek care for acute medical problems is complex and nuanced. Many decisions are made with limited information and at times little transparency from health services. Most people hope to be seen by a definitive provider immediately on arrival but usually have to wait for treatment. Emergency department (ED) proximity and wait times are the major influencers on patient choice of facility.1–4 Wait time visibility assists with meeting physical, logistic and psychological needs of patients.5 There is increasing consumer advocacy for transparency of information about health service resources. There is also health service interest in displaying wait times. Many examples exist in the USA, Canada and emerging interest has been seen in Australia and other jurisdictions.6
Information technology capabilities and applied data science techniques are becoming increasingly available to acute care services. Many EDs collect a large volume of electronic point-of-care patient data, relating to demographics, flow and clinical care. In a community where data from multiple EDs are available, knowledge of queue lengths could facilitate optimal patient load-balancing across acute care facilities. This has the potential to reduce the harms of long waits.
Previously published information is available regarding how to predict wait times in emergency medicine. Manuscripts report a variety of predictor variables, model techniques and accuracy, from either a single centre or small number of sites. Sun et al 7 used quantile regression techniques in a single ED and found that by using triage categories, the number of unseen patients and the number of new patients treated by physicians in the last hour, they could predict wait times to an accuracy of ±12 min. Ang et al 8 compared statistical to machine learning models using data from four EDs in the USA in response to inaccurate commercial wait time predictors, finding a regularised regression model (Q Lasso) to be the best model with less underestimation of wait times using only time-of-day and day-of-week data. Arha9 used a tree-based regression model, with simple predictor variables available at triage. Senderovich et al 10 found that by adding congestion variables, patient flow predictors had increased accuracy.
There is limited knowledge regarding wait time predictors performance across a variety of jurisdictions, patient catchments and healthcare resources. There is no knowledge about whether one model might be able to be applied across multiple EDs for system-wide implementation or how predictive models perform during unexpected events with variations in demand (eg, COVID-19).
Objectives
The primary objective of the study was to develop and internally validate predictive algorithms for patient wait times (triage-to-provider). Secondary objectives include determining the relative importance of each predictor variable and model method, whether models are transferable across different EDs (external validation) and the performance of the models during special events (COVID-19).
Methods
Study design and setting
This is an observational study using retrospective administrative data to develop, compare and validate prediction models for patient wait times at EDs. Data from 2017 to 2019 from 12 EDs were used for the main study, followed by data from January to June 2020 from three EDs to test performance during COVID-19 conditions.
Mandatory point-of-care emergency patient demographic, flow and clinical data are collected for every patient by clerks and clinicians in Australia. In Victoria, these defined data populate the governmental Victorian Emergency Minimum Dataset (VEMD).11 Data available at time of triage were used as predictor variables.
There are 24 million residents in Australia and emergency medicine manages eight million patient episodes annually. The majority (93%) of Australian residents attend government-funded, public EDs with no patient copayments. There are no restrictions on individual choice of ED. Regional to tertiary departments were invited to participate if they were part of an academic health science centre or were engaged via research networks. Ten Melbourne and two Queensland EDs participated, comprising one private, one paediatric, four major, two large metropolitan and four medium metropolitan hospitals. Hospital #7 (H7) displayed predicted patient wait times (prior in-house models) online, in their waiting room and to Ambulance Victoria during the study.
Data sources and measurements
Electronic medical record software applies time-date stamps to clinician activities (eg, triage). Clerical staff collect demographic data from patients at initial registration. Clinical staff record data while attending to a patient. The VEMD datasets from each hospital were the primary source of data for this study. VEMD data are routinely checked for completeness, accuracy and administrative errors by an emergency physician at each site prior to submission to the Victorian Government.
Three years of retrospective, deidentified VEMD data were obtained from 12 hospitals in Australia (mainly Melbourne). Hospital names were replaced with alphanumeric codes prior to analyses. All episodes of care were eligible for inclusion in the study. Data were collected in early 2020 and arranged into training (2017 and 2018) and testing datasets (2019), maintaining the temporal order based on patient arrival times. The training dataset was used for exploratory analysis and learning prediction models. The testing dataset was used to internally and externally validate the prediction models. Further retrospective data were collected from three EDs, for the period of 1 January 2020 to 30 June 2020 coinciding with major variations in emergency attendances secondary to COVID-19 concerns (April, May 2020). These data were used to evaluate the stability of model performance during unexpected circumstances.
Variables
Variables used in this study are presented in online supplemental appendix 1A. Variables collected after triage/registration were excluded from the models, except those required for calculation of the dependent variables. We used 19 predictor variables (13 VEMD and 6 derived) in total.
Supplemental material
Outcomes
The primary outcomes of this study were triage-to-provider wait times for all patients, predicted at triage. Secondary outcomes included the accuracy of each predictive model (internal validation); determining if a global model or individual models performed better; identifying the best technique to generate these models; the relative contribution of each variable to the models; assessment of how each model performs at different sites (cross-site, external validation) and evaluation of how the models perform during COVID-19 conditions or unusual circumstances. Researchers weren’t blinded to outcomes. The outcome choices and definitions were informed by a large, multisite, qualitative study of community members, consumers, paramedics and health administrators.5 These participants recommended a prediction accuracy of ±30 min (unpublished data).
Analysis
Study size
We used time-based sampling similar to previous studies of wait time prediction that have used time periods ranging from 1 month to 1 year. Time-based sampling uses sampling over a set period, so that there is temporal separation of data used to train the model and those used for validation.7–9 We obtained 3 years of data from each hospital to account for seasonal variations in patient visits. Multiple hospitals were enrolled to allow cross-site validation evaluations. The accepted convention of using a minimum of 30–50 data points per variable was applied.
Data cleaning, outliers and missing data
Patient data rows were checked for missing values related to the primary outcomes and episodes were removed from analysis if the primary outcome variables were missing. We therefore removed patients who left without being seen by a provider (n=133 204 (6.85%)). Other missing values were replaced with ‘unknown’ or ‘other’ categories using VEMD descriptors. Three hospitals did not collect ambulance data. The total number of unique patient episodes where the value of at least one of the predictor variables is ‘unknown/other’ is (n=1 733 247), covering a total of eight predictor variables (online supplemental appendix 1B).
Negative values for triage-to-provider time (n=236 (0.01%)) were removed from the analysis. We also removed patient data where the wait time exceeded the maximum of 360 min and the predefined statistical outlier threshold value (defined as 1.5 times the IQR (Q3−Q1) over Q3) which were mainly generated by administrative data entry errors for triage-to-provider time (n=13 612 (0.7%)).
Standardising and encoding data
Hospitals providing non-VEMD formatted data (Queensland) had their data converted to VEMD format. One-hot encoding12 was applied to all categorical variables prior to prediction model development as all categorical variables were nominal with the exception of triage category. We assessed that it was preferable to lose order information for triage by applying one-hot encoding than to treat triage category as a continuous variable as the distances within levels of triage category were non-linear.
Model building and recalibration
A Python-based machine learning library (scikit-learn) and analysis tool (statsmodels) were used for model development. Guided by wait-time prediction literature,8–10 13 we used three statistical and machine learning techniques (linear regression, random forests and elastic net regression) and a rolling average approach (the mean wait time of the previous k=4 observations). We included all predictor variables in model construction and undertook a posthoc variable importance analysis. We rebuilt the models with the most important variables only and compared the performance of the simplified models to the initial models. To foster future replications, we provide code snippets for model construction in an online repository: https://doiorg/105281/zenodo459978.
The ‘last-k’ variable is the mean triage-to-doctor wait time for the last ‘k’ patients seen by a provider. To determine the appropriate value of ‘k’ for this study, we performed a sensitivity analysis by observing the performance of prediction models constructed using different k values (ie, 3–10). We found that the performance differences were statistically indistinguishable across different values of k and thus selected k=4 as this produced models with the best performance.
Validation
For site-specific accuracy testing (internal validation), we used a time-wise holdout validation approach, where data were split according to defined sampling periods, and a time-defined dataset put aside for use as a test set.14 Patient records were sorted for each hospital by their arrival time. Data from 2017 to 2018 were used to construct site-specific prediction models for all individual hospitals, while 2019 data were used to evaluate prediction models within each hospital. For cross-site evaluation of site-specific models, we used a time-wise cross-site external validation approach.15 Site-specific models were validated against 2019 data from other hospitals (eg, train with Hospital A data from 2017 to 2018, then test with Hospital B data from 2019), resulting in 132 pairwise combinations.
We also undertook geographical and temporal cross-state external validation. Global prediction models were constructed using 2017 and 2018 data from hospitals in Victoria and evaluated using 2019 data from Queensland. We also tested the global model performance against combined 2019 Victorian all-site data.
We used two boosting ensemble algorithms (light Gradient Boosting Machines (GBM) and eXtreme gradient boosting) and one hyperparameter-optimised random forest algorithm (Random Forests) in our model validation. We found non-statistically significant improvements using these models and as they came at considerable computational cost, we excluded them from the main analysis.
For validation during unexpected events, we compared model accuracies between the first 6 months of 2019 (surrogate for normal conditions) and the first 6 months of 2020 (surrogate for unexpected events, for example, COVID-19), using data and models from three EDs.
To assess model performance, we calculated the absolute errors (AE) between the actual time and the predicted time for all models and hospitals. We then calculated the median of these distributions of absolute errors (MAE) to identify the best model for each hospital and across all 12 hospitals.
Statistical methods
The Scott-Knott effect size difference test was used to rank performance, based on MAE, of the prediction models for internal validation.16 This is a multiple comparison approach that produces statistically distinct and non-negligible (effect size) groups of distributions. We used the implementation provided by the sk_esd function of the ScottKnottESD R package V.2.0.3. Mann Whitney U tests were used to identify whether the performance (MAE) difference between two models was statistically significant; then Cliff’s delta tests,17 ldl, were used to measure the effect size. The interpretation of Cliff’s delta values is as follows: ldl <0.147 negligible, ldl <0.33 small, ldl <0.474, otherwise large.18 We used cliff.delta function of the effsize R package V.0.7.8 for calculating Cliff’s delta. For all statistical tests, we used a statistical significance level of ɑ=0.05 and sought non-negligible effect sizes.
Patient and public involvement
The primary outcome of this study was determined by a qualitative study involving patients, the public and other stakeholders.5 Consumers and community stakeholders contributed to the design and write up of the study.
Results
Characteristics of study subjects
Twelve EDs contributed data. Two sites were unable to obtain ethics approval (regional referral, medium metropolitan). Flow through the study is presented in figure 1. Department and patient demographics are presented in table 1. The total number of patient episodes included in the study was 1 930 609 with 1 388 509 in the training and 542 100 in the testing datasets. Overall admission rates were 29% and 23% of patients arrived by ambulance.
Wait time proportional distributions were similar throughout sites, although specific wait times at each site varied, with a median site range of 24–54 min for triage-to-provider time (figure 2). Distribution of outcomes is right skewed but we did not apply any transformation to require positive predictions, since predictions of negative values are expected to be rare events and can be replaced with zero in deployment.7
Main results
Internal validation of site-specific models (same hospital, using later time period)
The performance rankings produced by the Scott-Knott effect size difference test showed that Random Forests and Linear Regression performed the best (first rank) for all studied hospitals (n=12) followed by Elastic Net (n=11) and Rolling Average (n=5). Random Forests, Linear Regression, and Elastic Net outperformed Rolling Average at seven hospitals. The MAE of Random Forests varied from 22.6 min (95% CI 22.4 to 22.9) for H7 to 44.0 min (95% CI 43.4 to 44.4) for H2. The distributions of the AE of internal validation for ED wait-time (triage-to-provider) prediction are shown (figure 3). The prediction models predicted a wait time to within±30 min of the actual wait time between 40% and 63% of the time, depending on the site.
The performance differences between Random Forests and Linear Regression were negligible for all hospitals according to Cliff’s delta effect size. Rolling average models consistently underestimated patient wait times. More details regarding the actual errors are shown in online supplemental appendix 1C.
Variable importance analysis
Triage Category (median importance score=65%, IQR (54%–74%)), Arrival Time (Hour) (median importance score=15%, IQR (12%–25%)), and the average wait time of the last k-patients (median importance score=15%, IQR (7%–21%)) were the three most important Random Forest predictive variables across all sites. The distributions of importance scores are shown in figure 4.
Simplified models
Simplified models were built with the top-ranked variables, which accounted for 95% of the relative variable importance. They demonstrated similar accuracy to full models with all variables. Performance differences between Random Forest simplified models and full models were not statistically significant and had negligible effect sizes for all hospitals. Distributions of the AE of these simplified models are shown in online supplemental appendix 1D.
Site-specific model performances at other single sites
Single site models perform better for the specific hospital they were developed for, compared with other sites. Out of 132 pairwise Random Forest combinations, 119 combinations had statistically significant performance differences with negligible to medium effect size. Ninety-seven (~82%) yielded higher errors by 0.02–39.6 min and 22 (~20%) yielded lower errors by 0.01–12.4 min compared with their site of origin. This suggests that site-specific patient wait time prediction models are not transferable to different hospitals. The distributions of AE for triage-to-provider time prediction are shown in online supplemental appendix 1E.
Global model performance
Multihospital global models that were constructed from Victorian 2017–2018 data performed similarly when tested with 2019 data from hospitals in Victoria and Queensland. The global models didn’t perform as well as site-specific models. The MAE of these global models varied from 32.2 min (95% CI 32.1 to 32.3) for Linear Regression to 42.6 min (95% CI 42.5 to 42.7) for Elastic Net when tested with Victorian hospital data and varying from 36.1 min (95% CI 35.6 to 36.6) for Linear Regression to 42.4 min (95% CI 42.1 to 42.7) for Elastic Net when tested with patients from hospitals in Queensland. The distributions of the AE of these global models are shown in figure 5.
Impact of COVID-19 on model accuracy
Three hospitals provided 2020 data (n=74 398) covering some of the reduced attendances COVID-19 period in Victoria. We observed that patient wait time models that were built using past data from 2017 and 2018 still performed at reasonable accuracy with MAE differences ranging from 0.03 to 6.0 min. Though these performance differences are statistically significant, except for Random Forests at H10 and Elastic Net at H12, the effect sizes were all negligible for these hospitals. Three models yielded higher errors by 1.7–5.00 min and four yielded lower errors by 0.5–6.0 min. Distributions of the AE of models during COVID-19 are shown in online supplemental appendix 1F.
Limitations
Limitations of the study include using only administrative demographic and ED flow data and using only Australian data. There are no direct measures of resource availability or processes within the ED (eg, nursed cubicles, streaming within the ED), hospital capacity (eg, available beds) or community resourcing (eg, ambulances to transport patients to nursing homes). There were also no measurements of patient comorbidity or diagnosis used in the models. Inclusion of this information may improve prediction accuracy in future models. Excluding did not wait patients or including triage category 1 patients from the study may have over or underestimated true wait times. Additionally, we do not have information about how this model would perform during a disaster with a rapid surge in attendances. Models present general estimates of patient wait times for those arriving at the ED, and they do not generate individualised wait times for each patient.
Models can generate nonsensical outputs, for example, linear regression can generate negative predictions which do not make practical sense. In practice, negative predictions can be replaced with 0. We observe the best prediction results in H7 which is the only hospital where wait-time predictions, from an in-house developed model, are shared on their website and on site. This may have affected the behaviour of lower acuity patients in choosing to visit H7 with more flexibility, at times when less waiting is expected, which could have resulted in a more homogenous—and easier to model—distribution than the other hospitals in our sample. Machine learning models may or may not introduce or amplify bias in healthcare. There are currently no reliable ways of testing for bias in machine learning when applied to healthcare datasets (personal communication, Dr Aldeida Aleti, Monash University) and so we are unable to determine if these outputs are biased for or against any particular group of patients.
Discussion
In summary, using emergency patient demographic and flow data from 12 studied hospitals, it is possible to build triage-to-provider predictive waiting time models to an accuracy of ±22.6 to 44.0 min. Predictions were within ±30 min of the actual wait time between 40% and 63% of the time, varying by site. The best performing models used Random Forests and linear regression methods for triage-to-provider prediction. Average wait time of the last k-patients, triage category and patient arrival time were the most important predictor variables. Accuracy is reduced when a model developed for one site is used at another site or a global model (developed by multiple sites) is used. When special events occur such as COVID-19 reduction in attendances, prediction accuracy is maintained.
The accuracies obtained for triage-to-provider times (23–44 min) are less than those reported from a Singapore study.7 Sun et al built different models for each acuity level and this approach to prediction at a finer granularity level may explain the performance difference. They omitted acuity category 1 and reported accuracies of 11.9 min for acuity category 2 and 15.7 min for acuity category 3. We chose a single prediction based on consumer feedback that triage categories are not understood by patients and families, but this may not suit all populations, especially as data and health literacy increases.5 Ang et al modelled low-acuity patients only and reported performance in terms of mean squared error, arguing that median based measures tend to underestimate due to right skewed distribution of wait-times.8 We observed this with Rolling Average models, but not others. Ang et al reported 9.4 min for non-absolute median error, which we outperform at a range of 1.1–8.9 min depending on the site, even after inclusion of all triage categories in the analyses.8
To our knowledge, this is the one of few emergency medicine prediction model studies to date, to undertake broad internal (between sites, global model) and external (temporally, geographically, COVID-19) validations. Importantly, we found that using a model derived from data from one site can be used at new sites, but will come at a cost in accuracy, prompting caution prior to promoting a ‘one size fits all’ model. Similarly, global models trained using aggregate data from all sites may reduce time and cost spent developing individual models but are less accurate at individual sites. We should be wary of seemingly accurate models which have only been tested on data similar to that on which they were trained. External validation of data science models should be undertaken prior to implementation at new jurisdictions, particularly where clinical care decisions might be assisted by machine learning algorithms or models.
This is the first literature describing how models perform during unusual events. We demonstrated that COVID-19 lockdowns did not have a negative impact on model accuracy. In Australia, this period of time was one of both significantly reduced emergency attendances and reduced productivity for physicians due to the increased complexity of managing patients and departments.19 These data do not cover periods of surge.
Qualitative work has shown that patients want access to wait times and would use times to address a large variety of needs.5 This study has shown that it is possible to predict approximate wait times; however, it has also demonstrated that the range of predictions is less accurate than desired by stakeholders. Some information about wait times may still be useful to patients, even if the prediction range is broad. Overestimated and underestimated predictions may be perceived differently by patients and families. Overestimated predictions may be perceived positively if patients wait a shorter time than predicted, but could deter patients from seeking care. Random Forests tend to overestimate more than linear regression. Rolling average models underestimated wait times the most (figure 5) and could either make emergency flow seem better than it is or generate dissatisfaction from patients when waits exceed predictions.20–24
In summary, using limited data available at point-of-care, wait times can be predicted to ±23–44 min. Models should be individually built for each hospital and are likely to perform the same during COVID-19 like conditions.
Data availability statement
No data are available.
Ethics statements
Patient consent for publication
Ethics approval
The study received Monash Health ethics committee approval (RES-19-0000-763A).
Acknowledgments
Lisa Kuhn, Anne Spence, Cathie Piggot: governance assistance. John Papatheohari, David Rankin: project sponsors and advisors. Mrs Katarina Tomka: project facilitator, Monash University.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Handling editor Shammi L Ramlakhan
Twitter @gabyblech, @EpidemicAmy
Collaborators Rachel Rosler: network sponsor, Melanie Stephenson: literature review, Kim Hansen: risk advisor, Ms Ella Martini: consumer, Dr Hamish Rodda: emergency informatics advisor, project sponsor, Dr Judy Lowthian: district nursing researcher.
Contributors Principal investigator: KW. Funding: KJ, KW, MB-M. Study design and protocol: KW, BT, CT, JJ, WW. Study protocol revisions: all authors. Ethics/governance: KW, AL. Site chief investigators: HA, GB, BP, KW, AS. Data collection: AL, HA, BP, KW, AS. Data analysis: JJ, CT, BT. Manuscript: KW, JJ, CT, BT. Manuscript revisions: all authors. Manuscript guaranteed by KW and BT.
Funding The Australian government, Medical Research Future Fund, via Monash Partners, funded this study. Researchers contributed in-kind donations of time. The Cabrini Institute and Monash University provided research infrastructure support.
Competing interests Some authors and collaborators are emergency physicians or directors, and others work in community health (prehospital and district nursing). One collaborator is a consumer. The Australian government, Medical Research Future Fund, via Monash Partners, funded this study. Researchers contributed in-kind donations of time. The Cabrini Institute and Monash University provided research infrastructure support.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.