Article Text


Reliability and validity of the Manchester Triage System in a general emergency department patient population in the Netherlands: results of a simulation study
  1. I van der Wulp,
  2. M E van Baar,
  3. A J P Schrijvers
  1. Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht, The Netherlands
  1. Ms I van der Wulp, Julius Center for Health Sciences and Primary Care, UMC Utrecht, P O Box 85500, 3508 GA Utrecht, The Netherlands; i.vanderwulp{at}


Objective: To assess the reliability and validity of the Manchester Triage System (MTS) in a general emergency department patient population.

Methods: A prospective evaluation study was conducted in two general hospitals in the Netherlands. Emergency department nurses from both hospitals triaged 50 patient vignettes into one of five triage categories in the MTS. Triage ratings were compared with the ratings of two Dutch MTS experts to measure inter-rater reliability. Nineteen days after triaging the patient vignettes, triage nurses were asked to rate the same vignettes again to measure test-retest reliability. Reliability in relation to the work experience of emergency department nurses was also studied. Validity was assessed by calculating percentages for overtriage, undertriage, sensitivity and specificity.

Results: Inter-rater reliability was “substantial” (weighted kappa 0.62 (95% CI 0.60 to 0.65)) and test-retest reliability was high (intraclass correlation coefficient 0.75 (95% CI 0.72 to 0.77)). No significant association was found between the experience of emergency department nurses and the reliability score (kappa). Undertriage occurred more frequently than overtriage, especially in elderly patients (25.3% vs 7.6%). Sensitivity for urgent patients in the MTS was 53.2% and specificity was 95.1%. The patient vignettes representing children aged <16 years revealed a higher sensitivity (83.3%).

Conclusions: Inter-rater reliability is “moderate” to “substantial” and test-retest reliability is high. The reliability of the MTS is not influenced by nurses’ work experience. Undertriage mainly occurs in the MTS categories orange and yellow. The MTS is more sensitive for children who need immediate or urgent care than for other patients in the emergency department.

Statistics from

Triage systems are now often used in emergency departments (EDs). The need for triage systems originates in an increased demand for systemic working and increased workload.1 The Manchester Triage System (MTS) has been implemented in several European countries.2 In the Netherlands, most EDs have implemented the MTS (87%) in recent years.3 The MTS, developed in Manchester, UK by their triage group in 1994, consists of 52 flowcharts. Each flowchart represents a specific complaint which can be subdivided into five categories: illness, lesion, children, abnormal behaviour and major incidents. A flowchart describes discriminators by which it is possible to assess the acuity of the patient’s problem. Five categories can be distinguished: red (immediately), orange (can wait 10 min), yellow (can wait 60 min), green (can wait 2 h) and blue (can wait 4 h).1

Because only a limited number of studies are available on the reliability and validity of the MTS, it is not known whether patients are correctly triaged when this system is used. From a patient safety point of view, this situation is not acceptable. Goodacre et al4 assessed inter-rater reliability in four emergency physicians who rated 50 randomly selected patient cases in three audit rounds. Inter-rater agreement was found to be “moderate”. Unfortunately, this study measured inter-rater agreement by physicians while triage is usually performed by nurses. A literature search found no studies which studied the relationship between work experience of ED nurses and inter-rater reliability of the MTS.

The validity of triage systems is usually measured in terms of sensitivity, specificity and percentages of overtriage and undertriage. We found that the validity of the MTS has been studied four times. Cooke et al5 studied the validity in a population of patients who needed admission to critical care areas. They concluded that the MTS was a sensitive tool. Speake et al6 measured sensitivity and specificity of the MTS when identifying patients with high risk cardiac chest pain. Seven research workers reassessed patients with chest pain who visited the ED in a 4-week period. Sensitivity was 86.8% and specificity was 72.4%. Dann et al7 studied whether the MTS can detect the critically ill. Sensitivity for allocating patients with mild pain into a non-urgent triage category was 14.9% while specificity was 97.7%. Roukema et al2 studied validity by assessing the correlation between triage category and resource utilisation and hospitalisation in children. In addition, they compared the triage category with a predefined gold standard. All of these studies have focused on specific patient subgroups. No studies were found that assessed the validity of the MTS in a general ED patient population.

The objective of our study was prospectively to assess the reliability and validity of the MTS in a general ED patient population in the Netherlands. Our research questions were:

  • What is the inter-rater and test-retest reliability of the MTS?

  • What is the association (if any) between nurses’ work experience and inter-rater reliability?

  • How much overtriage and undertriage occurs when the MTS is used?

  • What is the sensitivity and specificity of the MTS?


Study design

The present study is a prospective evaluation of the MTS. The protocol was reviewed by the medical ethical committee of the University Medical Center Utrecht and all the participating hospitals.

Study setting and population

The study was conducted in two general hospitals located in the province of Utrecht. These hospitals implemented the MTS in 2005. The first hospital has an annual number of 36 000 visitors to its ED with 35 nurses performing triage. The second hospital has an annual number of 23 000 ED visitors with 20 nurses performing triage. All triage nurses had been instructed on how to use the MTS before triaging patients in their ED.

Data collection

Reliability and validity were prospectively measured using 50 ED patient vignettes. These vignettes were generated from a random sample of ED patients from a general hospital not participating in the study (Sint Elisabeth Hospital, Tilburg). The patient vignettes contained the following information: sex, date of birth, time of arrival at the ED, complaint, transport, referrer and vital signs (table 1).

Table 1 Example of a patient vignette

The triage nurses of both hospitals were asked to assign a triage category to every patient vignette using the MTS. They were allowed to use the same sources as they do when performing triage in their own ED (ie, a computer program). Nineteen days later they were asked to complete the same patient vignettes again to assess test-retest reliability. These scores were compared with the triage category assigned by two Dutch MTS experts (gold standard). Each expert independently rated the vignettes. The ratings were then compared and discussed in case of discrepancy. The vignettes were rated by using the Dutch version of the book by Mackway-Jones.1 The experts are members of the Dutch Trauma Nursing Foundation (STNN) and work as ED nurses in two hospitals in the Netherlands. They translated and introduced the MTS into the Netherlands as well as teaching nurses all over the country how to triage using the system. Both experts are certified MTS instructors.

Information was also collected on nurses’ age, sex and work experience as an ED nurse.

Data analysis

Reliability is expressed in terms of inter-rater agreement and test-retest reliability. Inter-rater agreement was calculated by both weighted and unweighted kappa (range −1 to 1). Former studies on the reliability of triage systems calculated unweighted kappa as a measure of inter-rater agreement.8 9 Inter-rater agreement for ordinal scales (such as the MTS) is preferably measured by a weighted kappa statistic. There are different methods for calculating weighted kappa; we chose the most commonly used quadratically weighted kappa.10 The classification model designed by Landis and Koch11 was used for the interpretation of kappa (table 2). These analyses were performed using Agree for Windows, a computer program which specialises in calculating kappa statistics in numerous situations.

Table 2 Interpretation of kappa

Test-retest reliability was calculated by the intraclass correlation coefficient (range 0–1). The association between nurses’ work experience and inter-rater agreement score was calculated by Spearman’s rho correlation coefficient. Validity was assessed by calculating sensitivity, specificity and the percentages of overtriage and undertriage. Overtriage was defined as the percentage of patient vignettes that received a more urgent triage category rating from the triage nurse than from the expert. Undertriage is the opposite of overtriage. These analyses were performed using SPSS for Windows Version 14.



Of 35 nurses from hospital 1, 28 returned the vignettes in part 1 of the study and 17 in part 2. The overall response rate was 64.3%. No significant differences were found in the characteristics of the nurses between the responding and non-responding groups. All 20 nurses from hospital 2 returned the vignettes in part 1 of the study and 17 in part 2. The overall response rate was 92.5%. Table 3 shows the characteristics of the respondents from both hospitals.

Table 3 Characteristics of participating nurses


Unweighted kappa was 0.48 (95% CI 0.45 to 0.50), which represents “moderate” agreement. Quadratically weighted kappa was 0.62 (95% CI 0.60 to 0.65), which represents “substantial” agreement. The intraclass correlation coefficient between nurses who participated in the two parts of the study was 0.75 (95% CI 0.72 to 0.77), which indicates that nurses are consistent in triaging patients.

Nursing experience

Analyses of the ED nurses’ work experience by outcome (kappa statistic) showed no significant association. There was a small negative non-significant correlation for ED nurses’ experience (Spearman’s rho  =  −0.22, p = 0.141). Similar results were found for general nursing experience (Spearman’s rho  =  −0.12, p = 0.419).


Almost one-third (32.9%) of the patient vignettes were rated differently from the ratings from the experts, of which 7.6% showed overtriage and 25.3% undertriage (table 4). Most patient vignettes triaged one level above or under the triage rating of the expert. Undertriage mainly occurs when the experts triage a patient vignette into the orange category while nurses triage it into the yellow category (32% of all undertriage). Similarly, experts triage patient vignettes into the yellow category while nurses triage it mainly into the green category (57% of all undertriage). Only 2.1% of patient vignettes were triaged two or more levels above or below the triage rating of the expert. The sensitivity for MTS categories red and orange was moderate (53.2%) and the specificity for MTS categories yellow, green and blue was high (95.1%, table 4).

Table 4 Percentage of agreement between ratings of nurses and experts

Eleven of the 50 patient vignettes concerned children under the age of 16 years. A closer look at these vignettes reveals 8.2% overtriage and 24.7% undertriage. Undertriage in children mainly occurs in the yellow category where nurses triage the patient into the green category. The sensitivity for MTS categories red and orange in children was 83.3% while the specificity was 93.7%.

Six patient vignettes concerned patients over 65 years of age. Undertriage occurred more frequently than overtriage (43.3% and 5.0%, respectively). Sensitivity and specificity were comparable with the general ED patient population (52.5% and 97.5%, respectively).


In this study we have assessed the reliability and validity of the MTS in a general ED population. Inter-rater reliability was shown to be “moderate” to “substantial” and nurses were consistent in their triage decisions. No significant association was found between the number of years of ED work experience and inter-rater reliability. Undertriage occurred frequently, especially in the MTS categories orange and yellow. Patient vignettes concerning children aged <16 years showed a higher sensitivity. A focus on elderly patients >65 years of age revealed considerable differences in undertriage compared with the general ED patient population.

Our study has some limitations and our results should be interpreted with care. Brenner et al12 concluded that the number of categories in the scale influences the kappa. Unweighted kappa decreases when the number of categories increases, while (quadratically) weighted kappa increases. Also, the marginal distribution in the contingency table has an influence on the kappa statistic.13

We assessed the test-retest reliability by using a time period of 19 days. It is possible that, after this time, nurses could still remember some of their former ratings. Nevertheless, the period between the first and second part of the study was based on the study by Streiner and Norman14 who stated that a retest interval of 2–14 days is usual. We did not find any association between work experience and inter-rater reliability. This might have been due to the small number of nurses who participated in the study. A larger population with a broader range of experience should be included in future studies.

One limitation of validity studies for triage systems is the lack of a gold standard. Two solutions for this problem are, however, available—using an ED database or MTS experts. We chose to cooperate with two MTS experts. Their ratings were based on expert consensus. It is possible that some vignettes would be rated slightly differently if different experts performed the rating. However, owing to their considerable theoretical and practical experience in using the MTS, we do not expect substantial deviations from current ratings. This would therefore only have a marginal influence on our results.

We refrained from using the patient’s condition as a gold standard as we did not have access to the medical files of the patients and, in addition, we did not have information on the real clinical decisions and treatments given to the patients from the vignettes. We therefore decided to use expert opinion as a gold standard. We hope to be able to include the patient’s clinical condition as a gold standard in future research.

We used percentage overtriage and undertriage as an indicator of validity. This has been used in previous triage research.2 15 The MTS evaluates the maximum waiting time for a patient to see a doctor based on their complaint. When the waiting time is structurally too high or low, the system is not working correctly. Although the clinical condition of an individual patient is preferred for determining overtriage and undertriage, we think the use of experts is the second best solution. Percentages for overtriage and undertriage may have been influenced by the limited number of patient vignettes (n = 50) and because paper-based vignettes lead to more undertriage than real triage.16

A comparison of our results with former studies shows some interesting differences. Goodacre et al4 studied inter-rater reliability and found similar results (0.31<κ<0.63) to our study. However, it is not clear whether they calculated weighted or unweighted kappa. Our results concerning the high prevalence of undertriage are not in line with the findings of Roukema et al2 who reported 40% overtriage and 15% undertriage. The sensitivity and specificity also differed, with sensitivity and specificity for detecting MTS categories red and orange of 63% and 78%, respectively, compared with 53% and 95% in our study. These differences could possibly be explained by the fact that Roukema et al2 studied the MTS in a children’s hospital and used a different gold standard (ED database).

The number of studies that can be compared with our study is limited because other studies focused on specific diagnostic groups of patients while we studied a general ED population.

In conclusion, inter-rater reliability of the MTS is moderate to substantial and test-retest reliability is high. There is no association between the amount of nurses’ work experience and the use of the MTS. Undertriage (especially in elderly patients) in the MTS categories orange and yellow is a problem. The MTS shows moderate sensitivity for recognising patients who need immediate or urgent care. In children the MTS is more sensitive.


The authors thank all the nurses and staff at the Meander Medical Centre Amersfoort and Mesos Medical Centre Utrecht, the Netherlands; MTS experts Ronald de Caluwé and Piet Machielse; and Sint Elisabeth Hospital, Tilburg, The Netherlands.


View Abstract


  • Funding: None.

  • Competing interests: None.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.