Article Text

Download PDFPDF

A multicentre observational study to evaluate a new tool to assess emergency physicians' non-technical skills
  1. Lynsey Flowerdew1,
  2. Arran Gaunt2,
  3. Jessica Spedding3,
  4. Ajay Bhargava2,
  5. Ruth Brown4,
  6. Charles Vincent5,
  7. Maria Woloshynowych5
  1. 1Centre for Patient Safety and Service Quality, Imperial College London, London, UK
  2. 2Conquest Emergency Department, East Sussex Healthcare NHS Trust, Hastings, UK
  3. 3The Royal London Hospital Emergency Department, Barts and the London NHS Trust, London, UK
  4. 4St Mary's Emergency Department, Imperial College Healthcare NHS Trust, London, UK
  5. 5Centre for Patient Safety and Service Quality, Department of Surgery and Cancer, Imperial College, London, UK
  1. Correspondence to Dr Lynsey Flowerdew, Imperial College London, 5th Floor, Wright Fleming Building, St Mary's Campus, Norfolk Place, London W2 1PG, UK; l.flowerdew{at}


Objective To evaluate a new tool to assess emergency physicians' non-technical skills.

Methods This was a multicentre observational study using data collected at four emergency departments in England. A proportion of observations used paired observers to obtain data for inter-rater reliability. Data were also collected for test-retest reliability, observability of skills, mean ratings and dispersion of ratings for each skill, as well as a comparison of skill level between hospitals. Qualitative data described the range of non-technical skills exhibited by trainees and identified sources of rater error.

Results 96 assessments of 43 senior trainees were completed. At a scale level, intra-class coefficients were 0.575, 0.532 and 0.419 and using mean scores were 0.824, 0.702 and 0.519. Spearman's ρ for calculating test-retest reliability was 0.70 using mean scores. All skills were observed more than 60% of the time. The skill Maintenance of Standards received the lowest mean rating (4.8 on a nine-point scale) and the highest mean was calculated for Team Building (6.0). Two skills, Supervision & Feedback and Situational Awareness-Gathering Information, had significantly different distributions of ratings across the four hospitals (p<0.04 and 0.007, respectively), and this appeared to be related to the leadership roles of trainees.

Conclusion This study shows the performance of the assessment tool is acceptable and provides valuable information to structure the assessment and training of non-technical skills, especially in relation to leadership. The framework of skills may be used to identify areas for development in individual trainees, as well as guide other patient safety interventions.

  • Emergency medicine
  • assessment
  • leadership
  • non-technical skills
  • safety
  • training
View Full Text

Statistics from


Non-technical skills are ‘the cognitive, social and personal resource skills that complement technical skills, and contribute to safe and efficient task performance’.1 Examples include communication and decision making. There is increasing empirical evidence linking non-technical skills to patient safety specifically in the emergency department (ED).2 Poor performance of non-technical skills has been shown to be a significant contributor to medical error. Conversely, effective non-technical skills can avert error or mitigate the effects of error. For example, appropriate use of assertiveness to speak out when patient safety is at risk,3 or the use of closed-loop communication to ensure an important message has been heard correctly.4 In particular, good leadership has been shown to be a vital skill for emergency medicine (EM) physicians.5 Traditionally, the development of UK trainees has been somewhat lacking in this area, with many registrars getting only limited feedback and training in non-technical skills related to leadership and teamwork.

Formative assessment is vital to promote reflection, monitor progress and guide future learning by prioritising training needs.1 ,6 A tool has been developed to formatively assess non-technical skills for EM trainees in the workplace, focusing particularly on the leadership role.7 This uses direct observation of performance to provide structured and specific feedback to trainees on key non-technical skills, such as managing workload and situational awareness. The assessment tool consists of a number of non-technical skills, grouped into categories, and each skill is linked to explicit behavioural markers which illustrate good and poor practice, for example, ‘encourages team members’ input in decision making’. This is combined with a rating scale to enable assessors to rate skills based on observable behaviours. Such tools are known as ‘behavioural marker systems’ and have been used to structure observation and assessment of non-technical skills in other areas of healthcare, such as anaesthetics and surgery.8 ,9 Development of the assessment tool used data triangulated from a number of sources including a review of the literature, examination of relevant curricula, interviews with ED staff and a series of observations.

The ‘utility index’ is a framework to evaluate assessment tool design and has five components: (1) reliability; (2) validity; (3) educational impact; (4) cost efficiency and (5) acceptability.10 Preliminary evaluation of the tool in a previous study involved assessing content validity and acceptability using a survey of experts.7 The current study was designed to focus on assessing the tool's reliability and understand sources of rating error in a real-life setting. It is particularly challenging to achieve adequate levels of reliability using workplace-based assessments. In a large study designed to look at implementing workplace-based assessments across the UK, the authors commented that the mini-CEX (Mini Clinical Evaluation Exercise) was subject to ‘significant assessor error’.11 Poor inter-rater reliability has also been documented for the CEX using single encounters (intra-class coefficients between 0.23 and 0.5).12 Furthermore, test-retest reliability (reproducibility) based on a single encounter for the mini-CEX was calculated to be only 0.23.13 The primary aims of this multicentre study were to evaluate inter-rater reliability and test-retest reliability of the assessment tool. Secondary aims were to determine the observability of each of the 12 skills in the tool, to describe the range of non-technical skills displayed by ED registrars and to determine whether observed non-technical skills vary between the departments included in the study.



This was a multicentre observational study using direct observation in the workplace. The aim was to collect 100 observational episodes over a 3-month period, each lasting approximately 1 h, using trained observers. Data were collected from four NHS Emergency Departments comprising two teaching hospitals and two district general hospitals in London and the south of England.


Participants were EM doctors including both training-grade registrars and non-training-grade equivalent doctors. Participants were approached on the days that the observer was available to undertake the observations.

Four observers collected data. All observers were EM trainees from the South East Thames region who had received instruction in use of the tool. The first two workplace-based observations for each observer were considered training exercises, and novice observers undertook the assessments paired with the lead investigator. After each observation, both observers completed the assessment form and any discrepancies were discussed to aid calibration.


Ethical approval was granted from North West London REC1, and site-specific approval was sought from each of the four study sites. The study is registered in the UK Clinical Research Network Portfolio.

Participants were emailed a copy of the Participant Information Sheet, and written consent was sought prior to commencing the first observation. Verbal consent was sought prior to each subsequent observation, and all data were anonymised.

The assessment involved following the registrar for a 45–70-min period and recording detailed field notes on observed behaviours. At the end of the observation period, the investigator used the field notes to populate the assessment form (figure 1) and give a rating for each skill based on documented behaviours. Participants were offered informal individual feedback by observers. Approximately one-third of all assessments used a pair of observers to obtain inter-rater reliability data. When more than one observer was used, each observer was blinded from the other observer's ratings. After each paired assessment, ratings were compared and disagreements in ratings were discussed (ratings were not altered). For single observer assessments, rating sheets were reviewed by a second member of the research team and then discussed with the observer. When observers agreed that a rating error had been made this was recorded using hand-written notes.

Figure 1

Tool for the assessment of emergency physicians' non-technical skills.

Data analysis

Inter-rater reliability

Inter-rater reliability was evaluated by calculating Intra-Class Coefficients (ICC). As with other reliability coefficients there is no standard acceptable level of reliability using the ICC, but acceptable values have been quoted between 0.614 and 0.7.15 ,16 A cut-off of ≥0.7 was chosen for this study. ICCs were calculated for each skill and for the overall scale, and this was repeated for each pair of observers. Also, ICCs were calculated for the mean scores across the whole of the instrument, reflecting how other workplace-based assessments have been evaluated in the UK.11 Records of observer discussions were reviewed to summarise sources of error (eg, a behaviour that appeared to be incorrectly categorised) or bias (eg, high or low scores that did not appear to fit with the evidence documented).

Test-retest reliability

Test-retest reliability measures the consistency of an assessment tool when administered on different occasions to the same sample (ie, stability of scores over time or cases). This was calculated for both individual skills and mean scores using Spearman's ρ. Data were taken from the first two assessments of participants who were observed more than once. Normally, a correlation of more that 0.7 is considered acceptable.17


Percentages (ie, the number of assessments where the skill was observed at least once divided by the total number of assessments) were used to determine how often each of the 12 skills were observed. There is no agreed level at which a skill should be observed to warrant inclusion in a tool. However, it is logical to expect a skill to be observed more often than not (ie, >50%).18

Description of the range of skills

Each of the individual skills was analysed to determine the average rating (mean) and dispersion of ratings (2 SDs from the mean).

Comparison of hospitals

The Kruskal–Wallis Test was used to compare hospitals to determine if any significant difference existed in skill level. Post hoc procedures involved using Mann–Whitney tests to determine which pair of hospitals was significantly different for a particular skill. Bonferroni corrections were used to reduce the risk of Type 1 errors and set at p<0.0167.


A total of 96 assessments were completed and 43 registrars were observed.

Inter-rater reliability

At a scale level, none of the three pairs of observers achieved the desired level of 0.7 for inter-rater reliability (0.575, 0.532 and 0.419). However, ICCs calculated using mean scores for each assessment were more acceptable (0.824, 0.702 and 0.519). The confidence intervals (CIs) for ICCs of individual skills are very large, due to the small sample size, making it difficult to make any meaningful interpretations (see online appendix). However, comparing values across the observers it is possible to identify skills that had particularly low ICCs and those that were generally high. For example, Authority and Assertiveness had low ICCs for all three pairs (−0.111 to 0.455) as did Situational Awareness (SA), Informing the Team (0–0.474). Similarly, Workload Management had generally high ICCs (0.595–0.778), as did Team Building (0.641–0.714).

A review of assessment sheets and discussion notes identified possible sources of variation and solutions to rater error. This is summarised in table 1 and further examples are given in the online appendix.

Table 1

Sources of rater error and potential solutions

Test-retest reliability

Eighteen participants were observed more than once. Spearman's ρ calculation using all 12 individual skills was 0.26 and 0.70 when using mean scores.


The observability of each skill is shown in table 2. While all skills achieved the acceptable level of 50%, four were observed less frequently than the others. These were Supervision & Feedback (77%), Authority and Assertiveness (68%), Outcome Review (76%) and Anticipation (70%).

Table 2

Observability of each of the 12 skills

Description of the range of skills

Figure 2 illustrates the mean and dispersion of ratings for each skill. A series of bar charts showing the frequency of ratings for each skill are available in the online appendix. Only one skill, Maintenance of Standards, had a mean <5 (mean 4.8). Other skills with relatively low means included Supervision & Feedback, Authority and Assertiveness, Decision Making (DM)—Selecting & Communicating and Anticipating. The highest mean was calculated for Team Building (mean 6.0). Skills with a larger range of ratings were Maintenance of Standards, Workload Management and SA-Gathering Information, while skills with relatively low distribution of ratings included Authority & Assertiveness, DM-Outcome Review and SA- Informing the Team. Examples of observed good and poor behaviours are shown in table 3 (a complete list is available from the authors).

Figure 2

Mean ratings and dispersion of ratings.

Table 3

Examples of poor and good behaviour observed during the study

Comparison of hospitals

Two of the 12 skills, Supervision & Feedback (Skill 3) and SA-Gathering Information (Skill 10), had significantly different distributions of ratings across the four hospitals (p<0.037 and 0.007, respectively) as illustrated in figure 3. Hospital 1 participants performed significantly poorer at the skill, Supervision & Feedback, compared with the other three hospitals. Discussion with observers revealed that Hospital 1 had the highest level of ‘shop-floor’ consultant presence, so registrars may have relinquished teaching and supervision to the consultant, thereby missing valuable opportunities for teaching more junior doctors. Hospital 3 performed significantly better at the skill SA-Gathering Information compared with the other three hospitals. Observers noted that registrars took on a formal leadership role under the supervision of consultants. There was a noticeable increase in situational awareness behaviours related to organisational issues, such as waiting times, staffing levels and monitoring staff communications (eg, overhearing nurses speaking about a sick patient) as they appeared to take on more responsibility for the coordination of the department. Although not statistically significant, observers noted that registrars in Hospital 2 performed particularly poorly in this skill. There was no identifiable lead doctor, and the nurse in charge took on the main leadership role.

Figure 3

Results of the Kruskall–Wallis test. This figure is produced in colour in the online journal. (Please visit the website to view the colour figure).


This multicentre study evaluates a new tool to assess EM trainees' non-technical skills. It also gives a unique insight into the non-technical skills required by ED registrars working in departments in the UK. Furthermore, this framework of skills provides a basic syllabus to help focus training and patient safety interventions related to non-technical skills. For example, the tool could be used during teaching sessions to structure discussion around real-life or hypothetical clinical scenarios. Alternatively, the framework could be used to help provide a more detailed investigation of clinical incidents to identify teamwork actions that may have contributed to the event, or to consider where improved non-technical skills may have averted error. Skills assessment may also highlight areas in an ED where improved systems could enhance patient safety, for example, by introducing a formal handover proforma, or ensuring a shift leader is designated. In particular, the list of skills and behaviours provides inexperienced registrars with a guide as to what may be expected of them in a leadership role.

While inter-rater reliability did not achieve the desired level of 0.7 at the scale level, when trainees' mean scores were used, ICC values were more acceptable and ranged between 0.591 and 0.824. This suggests that observers may reliably agree on overall performance of non-technical skills. Similarly, test-retest reliability showed good correlation when using mean scores (0.70). Average scores have been used by researchers to evaluate other assessment tools (mini-CEX, Direct Observation of Procedural Skills (DOPS), Multi-Source Feedback (MSF) and Anaesthetists' Non-Technical Skills (ANTS)) as this eliminates error due to misclassification.11 ,13 ,16 The reliability results of this study compare favourably with other published research using single encounters.11–13 ,17 For example, ICCs for the CEX were between 0.23 and 0.5,12 and test-retest reliability for the mini-CEX was only 0.23.13 Problems with reliability are acknowledged in the document ‘Developing and maintaining an assessment system—a PMETB guide to good practice’,19 in which authors accept that workplace-based assessments may not achieve adequate reliability compared with other assessments. This document also warns against placing too much emphasis on numerical scores, and highlights the importance of narrative information. For these reasons, documenting specific examples of observed behaviour and discussing these in the debrief session is considered vital for the assessment of ED trainees' non-technical skills.

Behavioural marker systems developed in other areas of medicine, such as anaesthetics and surgery, have focused on measuring of inter-rater reliability of an isolated event, as used in this study.8 ,15 ,20 This is necessary for high-stakes assessments that occur on a single occasion, which is the case for licensing UK pilots. However, evaluating workplace-based assessments using Generalisability Theory21 overcomes the issues of poor reliability based on single encounters. Using this method, each trainee is assessed on multiple occasions using a number of different assessors.11 ,17 Generalisability Theory allows estimation of the number of assessments needed to achieve adequate levels of reliability. For example, at least eight different assessors observing at least two encounters each are recommended for the mini-CEX.11 While studies using Generalisability Theory have clear advantages, the process requires training a large number of assessors and is both time consuming and expensive. Overall, the results of this multicentre study compare favourably with other workplace-based assessments and suggest further research using Generalisability Theory may be warranted.

A number of sources of errors were identified to explain variation in ratings between observers. Some errors are common to all behavioural marker systems (eg, concealed cognitive processes), or are weaknesses of the research process. However, other errors could be reduced by improved assessor training, both related to the tool as well as non-technical skills in general. Problems arise from the fact that non-technical skills are not clearly defined in the EM literature, and empirical evidence to determine exactly what constitutes good and poor behaviour is sparse. The importance of a good debrief are highlighted to discuss the reasoning behind certain behaviours. Furthermore, if good or poor behaviour is noted but incorrectly categorised (a significant source of rater error) the trainee will still gain educational benefit from the assessment process as long as an adequate debrief takes place.

The frequency of observation for each of the 12 skills was adequate. However, possible reasons for lower frequency of observation include: (1) the skill involves highly cognitive processes (eg, Anticipation); (2) the skill is mainly required when taking on a formal leadership role (eg, Supervision & Feedback) and (3) the skill is only required under certain circumstances (eg, Authority & Assertiveness). Targeting observations to a specific skill or multiple encounter assessments as previously recommended would help to improve the rate of observation of skills.

The results describing average ratings for each skill may have application for training non-technical skills, especially when combined with qualitative analysis of field notes. For example, Maintenance of Standards had the lowest mean rating, and review of field notes highlighted recurrent issues related to hand hygiene, safety of sharps and documentation. Observers frequently commented on missed opportunities for teaching (Supervision & Feedback) and field notes suggested that registrars often failed to adequately explain reasoning for decisions, or to discuss risks within the team (DM Selecting & Communicating Options). Training courses aimed at improving non-technical skills need to explicitly address these issues. Results showing dispersion are helpful to demonstrate which skills are likely to have higher discriminatory value for separating poorly performing trainees from those performing well. If only limited time is available to perform an assessment, focusing on skills with greater dispersion, such as Maintenance of Standards and Workload Management, may be most useful.

Comparing hospitals revealed that consultant ‘shop-floor’ presence affects how trainees exhibit skills such as Supervision & Feedback and Situational Awareness. This gives some insight into the potential effects of shifting leadership roles from the ED registrar to the consultant, and is an issue that requires further research.


The use of registrar observers may be criticised as, to date, workplace-based assessments for trainees are almost exclusively undertaken by consultants. However, the ability to judge non-technical skills is not limited to consultants and there may be some benefits of trainees rating colleagues of a similar level. For example, one weakness of workplace-based assessment is the Hawthorne effect and this is likely to be lower when observed by someone of a similar grade. The numbers used in the study are small compared to some other studies used to evaluate tools for use in the workplace.11 ,17 ,22 As a result, CIs for individual skills were large, and it is difficult to make any firm conclusions. However, reliability results were encouraging and the next step would be to undertake a larger evaluation study using Generalisability Theory.


This study shows that the performance of the assessment tool is acceptable and provides valuable information to structure the assessment and training of non-technical skills, especially in relation to leadership. The framework of skills may be used to identify areas for development in individual trainees as well as guide other patient safety interventions.


The Clinical Safety Research Unit is affiliated with the Centre for Patient Safety and Service Quality at Imperial College Healthcare NHS Trust, which is funded by the National Institute of Health Research. The research described here was supported by the National Institute of Health Research and sponsored by Imperial College, London.


View Abstract

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors LF, RB, CV and MW designed the study. LF, AG, JS and AB collected data and contributed to interpretation of data. LF drafted the article, and all authors reviewed the manuscript and approved the final version for publication.

  • Funding This study was funded by the National Institute of Health Research. They had no role in the conduct of the study or the decision to submit the paper for publication. The Clinical Safety Research Unit is affiliated with the Centre for Patient Safety and Service Quality at Imperial College Healthcare NHS Trust, which is funded by the National Institute of Health Research.

  • Competing interests None.

  • Ethics approval Ethics approval was provided by North West London REC1.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement Further details of behaviours observed during the course of the study are available from the corresponding author at l.flowerdew{at}

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.