Article Text
Statistics from Altmetric.com
Objectives
-
Describe central tendency and variability
-
Summarising datasets containing two variables
In covering these objectives we will deal with the following terms:
-
Mean, median and mode
-
Percentiles
-
Interquartile range
-
Standard deviation
-
Standard error of the mean
In the first article of this series, we discussed graphical and tabular summaries of single datasets. This is a useful end point in its own right but often in clinical practice we also wish to compare datasets. Carrying this out by simply visually identifying the differences between two graphs or data columns lacks precision. Often therefore the central tendency and variability is also calculated so that more accurate comparisons can be made.
Central tendency and variability
It is usually possible to add to the tabular or graphical summary, additional information showing where most of the values are and their spread. The former is known as the central tendency and the latter the variability of the distribution. Generally these summary statistics should not be given to more than one extra decimal place over the raw data.
Key point
Central tendency and variability are common methods of summarising ordinal and quantitative data
CENTRAL TENDENCY
There are a variety of methods for describing where most of the data are collecting. The choice depends upon the type of data being analysed (table 1).
Applicability of measure of central tendency
Mean
This commonly used term refers to the sum of all the values divided by the number of data points. To demonstrate this consider the following example. Dr Egbert Everard received much praise for his study on paediatric admissions on one day to the A&E Department of Deathstar General (article 1). Suitably encouraged, he reviews the waiting time for the 48 paediatric cases involved in the study (table 2).
Waiting time for paediatric A&E admissions in one day to Deathstar General
Considering cases 1 to 12, the mean waiting time is:
46+45+44+26+52+14+19+18+14+34+18+ 22/12
Which is:
352/12 = 29.3 minutes
Key point
Mean = Σx/n. Where Σx = the sum of all the measurements and n = the number of measurements
Consequently the mean takes into account the exact value of every data point in the distribution. However, the mean value itself may not be possible in practice. An example of this is the often quoted mean of “2.4 children” per family. Another major advantage of using the mean is that combining values from several small groups enables a larger group mean to be estimated. This means Egbert could determine the mean of all 48 cases by repeating this procedure for cases 13 to 24; 25 to 36 and 37 to 48. The four separate means could then be added together and divided by 4 to give the overall mean waiting time.
The problem in using the mean is that it is markedly affected by distribution of the data and outliers. The latter are the extreme values of the data distribution but are not true measures of the central tendency. The mean is therefore ideally reserved for normally distributed data with no outliers.
Key points
-
The mean reflects all the data points
-
Small group means can be combined to estimate an overall mean
-
The mean is commonly used for quantitative data
Median
The mean cannot be used for ordinal data because it relies on the assumption that the gaps between the categories are the same. It is possible however to rank the data and determine the midpoint. When there is an even number of cases the mean of the two middle values is taken as the median. In Egbert's study, the median waiting time for cases 1 to 12 would therefore be calculated by:
Listing the data points in rank order:
14 14 18 18 19 22 26 34 44 45 46 52
Determining the midpoint that lies half way between the 6th and 7th data point:
(22 + 26)/2 = 24 minutes
Key point
Median = value of the (n+1)/2 observation. Where n is the number of observations in ascending order
In contrast with the mean, the median is less affected by outliers, being responsive only to the number of scores above and below it rather than their actual value. It is therefore better at describing the central tendency in non-normally distributed data and when there are outliers. However, in bimodal distribution, neither median or mean will provide a good summary of the central tendency.
A common example of the use of medians is seen in the longevity of grafts, prostheses and patients. It is used because the loss to follow up, and the unknown end point of the trial, prevents a mean being used to determine the average survival duration.
Key points
-
Medians only reflect the middle data points
-
You cannot combine medians to get an overall value
-
Median is commonly used for ordinal data, and for quantitative data when the data are skewed or contains outliers
Mode
This equals the commonest obtained value on a data scale. The mode is also used in nominal data to describe the most prevalent characteristic particularly when there is a bimodal distribution. Consequently in Egbert's study, there would be two mode values for cases 1 to 12—that is, 14 and 18 minutes.
The mode can be demonstrated graphically. Another advantage is that it is relatively immune to outliers. However, the peaks can be obscured by random fluctuations of the data when the sample size is small. Furthermore, unlike the mean and median, it can change markedly with small alterations in the sample values.
Key points
-
Mode only reflects the most common data points
-
Modes cannot be combined to get an overall value
-
Mode is commonly used for nominal data
Key points
-
The mean, median and mode will all be the same in normally distributed data but differ when the distribution is skewed (fig 1)
-
When dealing with skewed data, the mean is positioned towards the long tail, the mode towards the short one and the median is between the two. The median value divides the curve into two equally sized areas (fig 1(B) and (C))
-
The mode is a good descriptive statistic but of little other use
-
The mean and median provide only one value for a given dataset. In contrast there can be more than one mode
Location of mean, median and mode with different distributions. (A) Normal distribution. (B) Right (positive) skewed distribution. (C) Left (negative) skewed distribution.
VARIABILITY
This is a measure of the distribution of the data. It can be described visually by plotting the frequency distribution curve. However, it is best to quantify this spread using methods dependent upon the type of data you are dealing with (table 3). Though there are measures of spread of nominal data, they are rarely used other than to provide a list of the categories present.1 We will therefore concentrate on methods used to describe the variability of ordinal and quantitative data.
Methods of quantifying distribution
Range
This is the interval between the maximum and minimum value in the dataset. It can be profoundly effected by the presence of outliers and does not give any information on where the majority of the data lies. As a consequence other measures of variability tend to be used.
Percentiles
It is possible to divide up a distribution into 100 parts such that each is equal to 1 per cent of the area under the curve. These parts are known as percentiles. This process is used to identify data points that separate the whole distribution into areas representing particular percentages. For example, as described previously, the median divides the distribution into two equal areas. It therefore represents the 50th centile. Similarly the 10th centile would be the data point dividing the distribution into two areas one representing 10% of the area and the other 90% (fig 2).
Frequency distribution showing the 10th centile.
To determine the value of a particular percentile we use the same process required to work out the median. To demonstrate this consider how Egbert would determine the 25th centile in his waiting time data.
Key point
Percentiles are values that identify a particular percentage of the whole distribution
Firstly, he needs to sort them into rank order (table 4). He then locates the position of the 10th percentile by:
Waiting time for paediatric A&E admissions in one day to Deathstar General—sorted by rank order
10/100 × (n +1) where n is the number of data points in the distribution.
The position of the 10th percentile is therefore 0.1 × 49 = 4.9
This means the 10th centile lies nine tenths of the way from the 4th (that is, 10 minutes) to the 5th (that is, 11 minutes) data points. Consequently the 10th percentile is 10.9 minutes. In other words, 10 per cent of all the waiting times in Egbert's study have a value of 10.9 minutes or less. An equally correct way of expressing this result would be say that 90% of all the waiting times in Egbert's study have a value of 10.9 minutes or greater (fig 2).
Key points
-
Value of the X percentile = X (n+1)th observation/100. Where n is the number of observations in ascending order
-
Percentiles can only be calculated for quantitative and numeric ordinal data
The two most commonly used percentiles are the 25th and 75th ones. As they are marking out the lower and upper quarters of the distribution they are known as the lower and upper quartile values. The data points between them represent the interquartile range.
Interquartile (IQ) range
As the lower and upper ends of the distribution are eliminated, the interquartile range shows the location of most of the data and is less affected by outliers. It represents a good method of summarising the variability when dealing with ordinal data or distributions that are not “normal”.
The IQ range can also be used to provide a good graphical representation of the data. As the lower and upper quartile values represent the 25th and 75th percentile, and the median the 50th, these can be used to draw a “box and whisker” plot (fig 3). The box represents the central 50% of the data and the line in the box marks the median. The whiskers extend to the maximum and minimum values. In the case of outliers the whiskers are restricted to a length one and a half the interquartile range with other outliers being shown separately (fig 3). Box and whisker plots allow the reader to quickly judge the centre, spread and symmetry of the distribution. In so doing it enables two distributions to be compared (see later). Nevertheless, from a mathematical point of view, the IQ range does not provide the versatility required for detailed comparisons of normally distributed quantitative data. For this we need the variance and standard deviation.
A box and whisker plot.
Key points
-
The IQ range is a good method of describing data but not the most efficient way of describing data variability when comparing groups
-
The IQ range is commonly used to describe the variability of ordinal data and quantitative data that are skewed or have outliers.
Variance
An obvious way of looking at variability is to measure the difference of each value from the sample mean—that is, the calculated mean of the study group. However, in simply adding up these numbers you would always arrive at zero because the sum of the differences bigger and smaller than the mean are equal but opposite. However, by squaring all the differences before adding them together, a positive result is always achieved. The sum of these squares about the mean is usually known as the “sum of the squares”. For example, in a study the following differences were found between the mean and each value:
−3, −2, −2, −1, 0, 0, 0, 1, 1, 1, 2, 3
Therefore the sum of the squares is:
+4 +4 + 1 + 0 +0 +0 + 1 +1 +1 + 4 + 9 = 34
The size of the sum of the squares will vary according to the number of observations as well as the spread of the data. To get an average, the sum of the squares is divided by the number of observations. The result is called the “variance”.
Key point
Variance = Σ (Mean − x)2/(n). Where Σ (Mean − x)2= the sum of the squares of all the differences from the mean
Standard deviation
Variance has limited use in descriptive statistics because it does not have the same units as the original observations. To overcome this the square root of the variance can be calculated. This is known as the standard deviation (SD) and it has the same units as the variables measured originally. If the SD is small then the data points are close to one another, whereas a large SD indicates there is a lot of variation between individual values. In fact the mean for positive values is no longer an adequate measure of the central tendency when it is smaller than the SD.
Key points
-
The standard deviation is a measure of the average distance individual values are from the mean
-
The bigger the variability the bigger the standard deviation
-
The standard deviation is used to measure the variability of quantitative data that are not markedly skewed and do not have outliers
When the data distribution is normal, or nearly so, it is possible to calculate that 68.3% of the data lies +/− 1 SD from the mean; 95.4% between +/− 2 SD; 99.7% between +/− 3 SD (fig 4).
A normal distribution curve divided up into different multiples of the standard deviation (SD).
The “normal range” corresponds to the interval that includes 95% of the data. This is equal to 1.96 SD on either side of the mean but is often approximated to mean (2SD). If this criterion is used to diagnose health and disease, it follows that 5% of all results from normal individuals will be classified as being pathological (fig 4).
STANDARD ERROR OF THE MEAN
It is useful at this stage to consider the standard error of the mean (SEM). Usually in clinical practice we do not know what the true overall mean is. All we have are the results from our study and the standard deviation. Fortunately it can be shown mathematically that, provided our study group size is over 10 and randomly selected, the overall mean lies within a normal distribution whose centre is our study group's mean (fig 5). The standard deviation of this distribution is called the SEM. This can be calculated by dividing the sample's standard deviation by the square root of the number of observations in the sample.
Distribution of the mean of samples of various sizes.
Key point
SEM = S√nD√n
As the distribution is normal, the same principles apply as found with the SD above. Consequently there is a 95% chance that the overall mean lies +/− 1.96 SEM from the sample mean.
To demonstrate these points, consider what Watenpaugh and Gaffney are saying when they write, “One hour after infusion, the capillary reabsorption of fluids was −236 +/− SEM 102 ml/h”.2
This means we can be 95% sure that the overall mean reabsorption rate lies somewhere between:
−236 +/− (1.96 × 102) ml/h = −435.9 ml/h to −36.1 ml/h
Key points
-
SEM is used when describing the overall mean, not the study group's mean
-
SEM is a measure of the variance of the overall mean as estimated from the study group. It is not a measure of the distribution of the data.
The size of the SEM tends to decrease as the study group size gets larger or the variation within the sample becomes less marked. Consequently samples greater than 30 with little variation will have small SEM. As a result, the overall mean can be reliably pin pointed to lie within a narrow range.
So far in article 1 and 2, we have concentrated on summarising single variable datasets. In practice this is often a preliminary step to combining and comparing different collections of data.
Summarising data from two variables
Again the method used for this is dependent upon the type of data being dealt with (table 5). In subsequent articles we will discuss how each of these various types of comparisons are carried out. For now, we need to consider the important points that have to be borne in mind when presenting the summary of two variables.
Methods used to compare different types of data
CROSS TABULATION
When comparing nominal datasets, a frequency table can be drawn showing the number and proportion (percentage) of cases with a combination of variables. If there are large numbers, and marked variations, it is often better to present the data as a percentage of either the row or column totals. This makes interpretation by the reader easier. However, when using percentage, you need to indicate the total number from which they were derived. If one, or more, of the datasets are ordinal or quantitative then the groups must be arranged in the correct order (fig 6).
Cross tabulation using an example of quantitative data (waiting times) and ordinal data (Triage category) from Egbert's study on paediatric A&E attendances.
BAR CHART
You can use bar charts when comparing nominal and ordinal datasets. In the latter case the groups must be arranged in the most appropriate order (fig 7). This comparison has to be done visually and is therefore prone to error.
Frequency of male paediatric admissions in each triage category.
GRAPHICAL REPRESENTATION
The dot plot provides a useful way of allowing ungrouped, discrete data to be visually compared. As with the bar chart, in the dot plot the values of the variable are listed on the vertical axis and the categories on the horizontal.
Figure 8 shows the values of an objective measure of muscle bulk in the hand according to a simple clinical classification of muscle wasting for 61 elderly subjects. With the dot plot each individual value is shown (fig 8 (A)). It is therefore easy to see that there is a large amount of scatter and that the distributions are slightly skewed. As described previously the box and whisker plots can be used to summarise the data and demonstrate the difference between the groups (fig 8 (B)). This shows those with most wasting tend to have smaller cross sectional areas (CSA). Further investigation reveals that when the CSA are log transformed the distribution is more symmetrical. We can therefore now use means and SD to summarise the transformed data (fig 8C). These are shown against a logarithmic scale in their original units as it makes them easier to interpret.
Three methods of showing the central tendency and variance using the same data. (A) Scatter plot. (B) Box and whisker plot. (C) Scatter plot using a logarithmic scale.
Graphical representation of central tendency and variance allow data to be summarised in a readily accessible format (figs 8 (B) and 9 (A)). This is commonly done by marking the central tendency and providing a measure of the variability. However, if it is important to represent each reading, a dot plot can be used. This enables the corresponding points to be joined when considering a series of readings from the same subject (fig 9(B)).
Two methods of presenting the same data. (A) provides a measure of the central tendency and variance using a box and whisker plot. However, if the change in each subject is needed, individual values should be plotted and joined (B).
SCATTER PLOT, CORRELATION AND REGRESSION
A scatter plot enables the relation between quantitative-continuous variables to be demonstrated (fig 10). By convention the variable that is “controlling” the other is known as the independent variable and its value is recorded on the horizontal axis. These independent variables are those that are varied by the experimenter, either directly (for example, treatment given) or indirectly through the selection of subjects (for example, age). The dependent variables are the outcome of this experimental manipulation and they are plotted on the vertical axis.
Grip strength and cross sectional area (CSA) of index finger-thumb web space.
Summary
When confronted with a vast amount of information, first consider the type of data present. Then summarise appropriately (see article 1), noting the central tendency and variability. Once this is complete it is possible to combine the information from two or more variables taking into account what types of data they are.
Quiz
-
If data are skewed to the right, which is greater the median or the mean?
-
Describe two main differences between the mean and the median.
-
Consider Egbert's waiting time data listed in table 4.
-
What is the mean, median and mode?
-
What is the 75th centile?
-
What is the interquartile range?
-
-
Figure 11 shows a box and whisker plot from Egbert's study.
-
What do the box's represent
-
What does the line in the box's represent
-
How would you describe the box and whisker plot of triage category (green) and booking time?
-
-
One for you to do on your own. Mullner et al carried out an experiment to look at myocardial function following successful resuscitation after a cardiac arrest.3 Part of their results are shown in table 6:
-
Identify the different types of data
-
Identify the central tendency and variability for each variable
-
Present a tabular summary of the patient's ages
-
What graphical summary would you use to compare the frequency of VF with site of the myocardial infarction?
-
What graphical summary would you use to show the comparison between the site of the myocardial infarction and the duration of the arrest?
-
What method would you use to show the comparison between the systolic blood pressure and duration of arrest?
-
Box and whisker plot of booking time by triage category.
Answers
-
The mean
-
The mean reflects all the data points whereas the median uses only the middle ones. Secondly, unlike medians, small group means can be combined to give an overall mean.
-
-
Mean, median and mode:
Mean = 1100/48 = 22.9 minutes
Median = 49/2nd data point = half way between the 24th and 25th data point = 19 minutes
Mode = 19 minutes
-
The 75th centile equals: 75× 49/100th data point = 36.75th data point. This means it is 0.75 of the way from the 36th to the 37th data point. However, using table 4, both data points are 27 minutes.
The 75 th centile is therefore 27 minutes.
-
The interquartile range equals:
This is the range between the 25th and the 75th centile
Using the process described in 3 (b), the 25th centile is 15 minutes.
The interquartile range is therefore 15–27 minutes.
-
-
-
The middle 50% of the data
-
The median
-
Negative skewed distribution (long tail on the side with the lowest numbers)
-
Further reading
Altman D, Bland J. Quartiles, quintiles, centiles and other quantiles. BMJ 1994;309:996.
Bowlers D. Measure of average. In: Statistics from scratch. Chichester: John Wiley, 1996:83–113.
Coogan D. Statistics in clinical practice. London: BMJ publication, 1995.
Gaddis G, Gaddis M. Introduction to biostatistics: part 2, descriptive statistics. Ann Emerg Med 1990;19:309–15.
Glaser A. Descriptive statistics. In: High-yield biostatistics. Philadelphia: Williams and Wilkins 1995:1–18.
Acknowledgments
The authors would like to thank Sally Hollis and Alan Gibbs who helped with an earlier version of this article and the invaluable suggestions from Jim Wardrope, Paul Downs and Iram Butt.
Footnotes
-
Funding: none.
-
Conflicts of interest: none.
Linked Articles
- Correction