Article Text
Statistics from Altmetric.com
Objectives
-
Dealing with unpaired (independent) parametric data
-
Discuss common mistakes using the t test
In covering these objectives the following terms will be introduced:
-
Unpaired z test
-
Unpaired t test
-
One and two tailed tests
In the previous article we discussed the comparison of paired (dependent) data.1 These result when there is a relation between the groups, for example investigating the before and after effects of a drug on the same group of patients. The key measurement here is the difference between each pair. If this comes from a population that is normally distributed the mean difference can be calculated along with the standard error of the mean (SEM). The 95% confidence intervals for the true mean difference can then be derived along with the p value for the null hypothesis (table 1).
Comparison of two groups using z and t-tests
When there is no relation between the groups, the data are called “unpaired” or “independent”. A common example of this is the controlled trial where the effect of an intervention on one group is compared with a control group without the intervention. Here the selection of the experimental group does not tell you which people will be in the control group. They are therefore independent of one another.
It is useful to note at this stage that when you compare groups you are taking into account two variations. One is due to the difference between subjects within the same group and is called the intra-group variation. The other results from the difference between the groups and so is known as the inter-group variation. With paired data the difference in subjects is removed because each subject acts as its own control. Consequently you are simply measuring the inter-group variation. In contrast, when using unpaired data, both these variations have an effect (table 1).
State the null hypothesis and the alternative hypothesis of the study
Select the level of significance
Establish the critical values
Select the groups and calculate their means and standard errors of the mean (SEM)*
Choose and calculate the test statistic
Compare the calculated test statistic with the critical values
Express the chances of obtaining results at least this extreme if the null hypothesis is true
Unpaired z test
We have shown previously that a systematic approach is used to determine if the null hypothesis is valid (box 1).1
To see how this works when dealing with unpaired data consider the following example. Dr Egbert Everard has been working for just over a year in the emergency department of Deathstar General. During this time the department has dealt with 100 patients who have ingested a new rave drug “Hothead”. He suspects these people may have an abnormal sodium concentration on presentation—how can he investigate this?
1 STATE THE NULL HYPOTHESIS AND ALTERNATIVE HYPOTHESIS OF THE STUDY
Having considered the problem, Egbert writes down the null hypothesis as:
“There is no significant difference between the sodium concentration in the patient's ingesting “Hothead” and those of a similar age attending the emergency department”. In other words they are part of the same population.
This can be summarised to:
mean “Hothead” [Na+] = mean control [Na+]
The alternative hypothesis is the logical opposite of this, that is:
“There is a significant difference between the sodium concentration in the patient's ingesting “Hothead” and those of a similar age attending the emergency department”.
This can be summarised to:
mean “Hothead” [Na+] ≠ mean control [Na+]
2 SELECT THE LEVEL OF SIGNIFICANCE
If the null hypothesis is correct, and the two groups are part of the same population, then their mean sodium concentrations should be the same. Therefore the difference between them would be zero. However, there is bound to be some small variation simply due to chance. Therefore how big a difference are we going to allow before we reject the idea that the two groups are all part of the same population?
In answering this question we rely on an interesting mathematical fact that the difference of the means represents a population that has a normal distribution. In other words, if you repeated the experiment many times, and plotted all the mean differences, the scatter diagram would take the form of a normal distribution. If the null hypothesis was correct this distribution would have a mean of zero. The standard deviation of this distribution is known as the standard error of the differences between the means (SE Diff). This is equivalent to the standard error of the mean (SEM) that has been discussed in previous articles (fig 1A).1–3
Normal distribution of the difference between the means of the two groups. (A) Shows the hypothetical population mean of this difference (central vertical line). The horizontal access is divided up into standard errors of the difference of the mean (SE Diff). (B) Displays the same data as a standard normal distribution and replaces the SE Diff with z scores. The areas of acceptance (0.95 of the total area under the curve) and rejection (0.05 of the total area under the curve) of the null hypothesis are demonstrated. The area of rejection is split into two tails of the distribution curve.
When comparing large groups of independent data we can determine the size of different areas under the distribution curve by using the z statistic:
Where:
μ1 = mean of group 1
μ2 = mean of group 2
You will note that this equation is slightly different from the one used to compare paired data.1 The numerator has changed to reflect the fact that we are interested in the difference between the means of the two groups rather than the mean difference between paired readings. Furthermore, the standard error of the difference between the means (SE Diff) has replaced the SEM to take account of the errors in estimating the means in each group:
where:
s1 = estimation of the population's standard deviation derived from group 1 with n1 subjects
s2 = estimation of the population's standard deviation derived from group 2 with n2 subjects
As discussed in article 4, s is used because in clinical practice we usually do not know the value of a population's standard deviation (σ). However, provided the sample size is large enough (that is, greater than or equal to 100) the z statistic can still be derived using s as an estimation of the population's standard deviation.3
By convention, the outer 0.025 probabilities (that is, the tips of the two tails representing 2.5% of the area under the curve) are considered to be sufficiently away from the population mean as to represent values that cannot be simply attributed to chance variation (fig 1B). Consequently, if the sample mean is found to lie in either of these two tails then the null hypothesis is rejected. Conversely, if the sample mean lies within these two extremes then the null hypothesis will be accepted (fig 1B].
Following convention, Egbert picks a significance level of 0.05 for his study. He now needs to determine the sodium concentration that demarcates these two tails.
3 ESTABLISH THE CRITICAL VALUES
Using the z table Egbert finds that the critical value (zCRIT) demarcating the middle 95% of the distribution curve is z = +/− 1.96 (fig 1B). In other words a z value of +/− 1.96 separates the middle 95% area of acceptance of the null hypothesis from two, 2.5% areas of rejection.
With the null and alternative hypotheses defined, and the critical values established (zCRIT), the patients for the study can now be selected. The z statistic derived from the sample (zCALC) can then be determined.
4 SELECT THE GROUPS AND CALCULATE THEIR MEANS
Egbert gathered the presenting sodium concentrations from a sample of 100 patients who have taken “Hothead”. The mean (μ1) and estimation of the population's standard deviation (s1) were calculated and found to be 131 mmol/l and 10 mmol/l respectively.
Egbert had arranged with Ivor Whitecoat, senior laboratory technician at Deathstar General, to measure the sodium concentration in a control group of patients who had not taken “Hothead”. Ivor tells him that the mean sodium concentration in a 100 patients of a similar age presenting to the emergency department (μ2) is 134 mmol/l with as an s of 7 mmol/l (s2).
5 CHOOSE AND CALCULATES THE TEST STATISTIC
As explained before, the z statistic is equal to:
where:
An interesting feature can be noted from the equation above. Supposed Ivor Whitecoat used several thousand patients to work out a mean. In this case s22 /n2 would become so small as to be negligible. Consequently the SE Diff would simply be equal to the ESEM of Egbert's group.
The difference between the two means in this study is:
Therefore:
When comparing a sample with a large group you only need to know the ESEM of the sample.
6 COMPARE THE CALCULATED TEST STATISTIC WITH THE CRITICAL VALUES
The calculated value of −2.5 lies beyond the larger critical value of −1.96. It therefore falls outside the area of accepting the null hypothesis.
7 EXPRESS THE CHANCES THAT THE NULL HYPOTHESIS IS IN KEEPING WITH THE DATA
The p value is the probability of getting a difference equal to or greater than that found in the experiment (that is, −3), if the null hypothesis was correct.4 As the z value can be negative or positive, there are two ways of getting a value with a magnitude of 2.5. Consequently the p value is represented by the area demarcated by −2.5 to the tip of the left tail plus the area demarcated by +2.5 to the tip of the right tail (fig 2).
Standard normal distribution curve demonstrating two ways of getting a difference with a magnitude of 2.5. 0.0062 of the area under the curve lies between −2.5 to the tip of the left tail. A similar area lies between +2.5 to the tip of the right tail.
From the tables of z statistics, it can be seen that the size of the tail from +2.5 to the right tip is 0.5−0.4938 = 0.0062. The equivalent value in the other half of the distribution curve is the same. The p value is therefore doubled to give a total value of 0.0124. Consequently there is a 1.2% chance that a difference with a magnitude of 3, or larger, could have resulted if the null hypothesis was true.
WHAT IS THE ESTIMATED RANGE FOR THE TRUE DIFFERENCE BETWEEN THE MEANS?
In view of the importance of confidence intervals, Egbert also wants to determine the 95% CI for the difference. From the previous explanation, Egbert knows that 95% of all possible values of the difference between the means will lie within a range 1.96 SE Diff below his experimental difference to 1.96 SE Diff above it:
As described above, the difference between the means is 131− 134 = −3 mmol/l. Therefore the 95% confidence intervals of the difference between the two groups are:to
As this range does not include zero, Egbert concludes that the data are not compatible with the null hypothesis being correct. The range is on the side of there being a lower sodium concentration in the patients taking “Hothead” but the differences are small. Therefore, rather than simply presenting a p value, it would be better if Egbert uses the 95% confidence intervals in discussing the clinical relevance of these data.
In summary, Egbert concludes that he can confidently reject the null hypothesis. Difference between the means of the two groups = −3 mmol/l (95% confidence intervals −5.4 to –0.6 mmol/l); p = 0.012.
Unpaired t test
When the sample size is less than 100, the effect of intra-group variation becomes greater. In these cases the normal distribution has to be replaced by the t distribution. Unlike the normal distribution, the shape of the t distribution is dependent upon the size of the group (fig 3). It is always symmetrical but with small sample sizes the curve is flatter and has longer “tails”. However, as the sample size increases, the curve becomes normally distributed.
t distribution with 10 degrees of freedom and a standard normal distribution. As the degrees of freedom increase the t distribution becomes more like a normal distribution.
Irrespective of which t distribution is chosen, the same principle applies regarding set areas under the curve representing particular probabilities. Consequently the p value for the null hypothesis is derived from the test statistic. However, tables of the t distribution are used rather than the z distribution ones. Likewise, the 95% confidence intervals for the true difference between the means are the experimental mean difference, ± the appropriate t value, multiplied by the SE Diff.
To see how this works, consider the following example. Egbert by chance has a night off and discusses his findings over a romantic meal with Endora Lonely, an emergency physician at St Heartsinc. She is surprised by the topic of conversation but does wonder if the same applies to patients having taken an analogue of “Hothead” called “Brainboil”. She tackles this using the previously described systematic approach but this time considers using an unpaired t test as her study and control groups have only 16 patients in each.
1 STATE THE NULL HYPOTHESIS AND THE ALTERNATIVE HYPOTHESIS OF THE STUDY
These remain unchanged from those used in the larger study. Consequently:
“There is no significant difference between the sodium concentration in the patient's ingesting “Brainboil” and those of a similar age attending the emergency department”. In other words they are part of the same population.
This can be summarised to:
mean “Brainboil” [Na+] = mean control [Na+]
The alternative hypothesis is the logical opposite of this, that is:
“There is a significant difference between the sodium concentration in the patient's ingesting “Brainboil” and those of a similar age attending the emergency department”.
This can be summarised to:
mean “Brainboil” [Na+] ≠ mean control [Na+]
2 SELECT THE LEVEL OF SIGNIFICANCE
Following convention, Endora picks a significance level of 0.05 for her study.
3 ESTABLISH THE CRITICAL VALUES
This is carried out in the same way as comparing larger groups but this time the t statistic is used because the group size is less than 100. The t table enables you to calculate the probability of lying within the middle 95% of the distribution where the null hypothesis is valid. The tables also take into account the changes in t with variation in sample size.
For mathematical reasons rather than use the absolute number in the group you use the degrees of freedom. In this type of comparison it is equal to the sum of the number in each group minus 1 (that is, (n1 − 1) + (n2 −1)).
Consequently the degrees of freedom is [16 − 1] + [16 − 1] = 30.
Using the t distribution tables, Endora finds that the t value for a probability of 0.05 with 30 degrees of freedom is 2.042. This is known as the tcrit value. It means that for this study, a t value of ± 2.042 divides the middle 95% of the population accepting the null hypothesis from the two, 2.5% tails of rejection (fig 4).
t distribution curve with 30 degrees of freedom demonstrating two tails. These are demarcated by t(crit) values of −2.042 and 2.042. The area between these values represents those values that are in keeping with the null hypothesis and amount to 0.95 of the area under the curve.
The t statistic from the study (tcalc) can now be determined.
4 SELECT THE GROUPS AND CALCULATE THEIR MEANS
After carrying out her study with “Brainboil” she finds the mean presenting sodium concentrations to be 130 mmol/l (s=10 mmol/l). Dr Rooba Tube tells her the mean sodium concentration in a control group of patients at St Heartsinc is 135 mmol/l (s=10).
5 CHOOSE AND CALCULATE THE TEST STATISTIC
Unpaired t test
This test requires the data to have two properties:
-
The two groups need to come from populations with normal distributions.
It is sometimes possible to determine if the distribution is skewed simply by checking the raw data or looking at a distribution curve. A more sophisticated way is to use a computer to develop a “normal plot” of the data. This programme manipulates normally distributed data so that they form a straight line. The closer the data comply with this line, the closer they are to being normally distributed. Altman gives a good description for those who are interested to getting further information about this type of manipulation.5 It is possible to also prove the same thing mathematically using various probability tests. However, these are of little additional benefit when populations are less than 30 and the normal probability plot gives a reasonably straight line.
-
The groups should be from populations with the same standard deviation.
It is not possible to know the standard deviations of the populations but an estimation comes from using the s value calculated from each group. The closer these values are, the more likely they have similar standard deviations. As with distributions, it is possible to formally assess the difference in group variance by carrying out an F test (see appendix).
Endora checks her datasets for both groups and is satisfied that these pre-conditions are met. She therefore proceeds with the analysis.
Key points 2 Requirements for unpaired t tests
-
The groups need to have normal distributions
-
The groups need to have similar standard deviations
Calculate the t statistic
The t statistic for comparing two unpaired groups is:
where:
μ1 = the mean for group 1
μ2 = the mean for group 2
When using the t test for comparing unpaired means the SE diff is derived from the following formula:
Where:
n1 and n2 are the number of subjects in the two groups.
s represents the pooled variance of the two groups:
From the data Endora calculates:
Therefore:
Therefore:
6 COMPARE THE CALCULATED TEST STATISTIC WITH THE CRITICAL VALUES
The calculated value of −1.43 lies inside the area of accepting the null hypothesis (that is, a t value of ± 2.042).
7 EXPRESS THE CHANCES THAT THE NULL HYPOTHESIS IS IN KEEPING WITH THE DATA
As the t value can be negative or positive, there are two ways of getting a difference with a magnitude of 1.43. Consequently the p value is represented by the area demarcated by −1.43 to the tip of the left tail plus the area demarcated by +1.43 to the tip of the right tail.
To determine this p value, Endora needs to consult the t distribution table using the appropriate degrees of freedom (table 2). In this case the degrees of freedom equals two less than the total number of subjects in both groups (that is, 30). The table indicates that the size of these two tails is greater than 0.1. Consequently there is over a 10% chance that a difference of 5 mmol/l, or larger could have resulted if the null hypothesis was true.
Extract of the table of the t-statistic values. The first column lists the degrees of freedom (df). The headings of the other columns give probabilities for t to lie within the two tails of the distribution.
The null hypothesis is therefore accepted and Endora can conclude that the sodium concentrations in patients taking “Brainboil” are not significantly different from the general population.
What is the estimated range for the true difference between the means?
Similar to Egbert, Endora would also like to know the 95% confidence intervals for the difference. The SE Diff is used in the usual way to determine this:
Where to is the t statistic appropriate for the required CI for the true difference between the means.
To find the t value representing the middle 95% of the distribution curve, Endora needs to look down the column representing the outer 5% (0.05 probability) of the curve because this identifies the extreme point of the middle section. The t value in the column representing p = 0.05 at 30 degree of freedom is 2.042. This is the number of SE Diff above and below the mean that cover the middle 95% of the distribution curve.
Therefore:
As this includes zero, the null hypothesis is valid at the 5% level. The confidence intervals are wide indicating that the study lacks precision, possibly due to the small sample size.1
Common mistakes made in using the z and t test
ONE AND TWO TAILED TEST
When using the null hypothesis in comparing two groups, you are determining what is the probability that they are from the same population. p Values of less than 0.05 are usually taken as the point where the null hypothesis can be rejected. However, this result does not tell you how they are different. Indeed when you set up your initial study you commonly do not know which group will produce the biggest result.
To demonstrate this consider a trial comparing the analgesic effect of a new non-steroidal anti-inflammatory drug (NSAID) with ibrufen. Before carrying out this trial you may hope that the new drug is better than the old one but you cannot be sure—it could be worse. Your study, and statistical analysis, should therefore be able to detect three possibilities:
-
The new NSAID is no different then ibrufen (that is, supporting the null hypothesis)
-
The new NSAID is a better then ibrufen (that is, rejects the null hypothesis)
-
The new NSAID is worse then ibrufen (that is, rejects the null hypothesis)
The latter two displacements are referred to as the “tails” or “sides”. Consequently the tests measuring the probability of the null hypothesis being valid in one or both situations are called “one” and “two” tailed (sided) tests respectively.
When using a “one tailed test” there is only one area of rejection on the random sampling distribution of the means. Consequently the area of rejection is concentrated on one of the tails. This reduces the critical value for t and so makes the p value smaller and more impressive for any given difference between means.
For example, if Endora used a one tail test for a t statistic of −1.43 she would find it to lie in an area of rejection of the null hypothesis (p < 0.05, shaded area). This is because the tcrit for this degree of freedom is −1.31. In contrast, as we have seen, using a two tailed test leads to acceptance of the null hypothesis (p > 0.05, checkered area) (fig 5).
t distribution curve with 30 degrees of freedom demonstrating one and two areas of rejection. A t statistic to be −1.43 lies in an area of acceptance of the null hypothesis using the two tailed test [t(crit) = −2.042] or rejection using the one tailed test [t(crit) = −1.31].
You may therefore be tempted to use a one tailed test in an analysis. Beware though; it is rare that the direction of displacement can be predicted before the study. Furthermore, in comparison to the standard treatment a worse result by the new treatment is also clinically important. Consequently two tailed tests should be routinely used when comparing two groups. If a decision is made to carry out a one tailed test then the rationale must be clearly described. An example of such a case is in public health work when you want to determine if a product does not fall below a particular standard (for example, water purity). Here you are not concerned with how pure the water is, just as long as it is better than a preset level.
The decision to use a one tailed test must depend upon the nature of the hypothesis being tested and should therefore be decided upon before the data are collected.
As a rule of thumb, when comparing two groups always use a two tailed test for the null hypothesis.
The concept of one tailed and two tailed tests does not apply when more than two groups are being compared.
USING A COMPUTER WITHOUT CONSIDERING THE DATA AND THE QUESTION BEING ASKED
Obviously using computer software can greatly help you in working out the long calculations described above. This leads to the temptation of doing the analysis on your own and not seeking statistical help when necessary. However, be aware in doing this that you make sure the appropriate test is carried out so that the correct calculations and distribution tables are used. Applying the wrong test will still produce an answer, but it will be meaningless!
ASSUMPTIONS OF NORMALITY AND SIMILAR VARIANCE
The t test is able to deal with all but major deviations from normality or uniform variance between the groups. The main problems occur when dealing with small data with a skewed distribution. In these cases the t statistic does not comply with the t distribution curve. A possible solution in these cases is to see if transforming the data can make them have a normal distribution and uniform variance.6 Data that still fail to be achieve these prerequisites cannot be analysed using the t test. Instead a non- parametric method will have to be used.
MULTIPLE COMPARISONS
This article has concentrated on comparing two groups. In medical research however we may be faced with having to compare three or more groups. If this problem is tackled using two group comparisons for every possible pair of combinations then we run the risk of finding some “significant” differences simply by chance. This probability gets bigger as the number of pairs increase. Therefore for multiple comparisons, t tests should not be used. Instead a different test, known as an analysis of variance, is required.
Summary
When carrying out medical studies we often have to compare two unpaired (independent) groups. This is best carried out using a systematic approach in which the null hypothesis, the levels of significance and the critical levels are decided upon before the experiment is started. Once the data have been collected the appropriate statistic test is chosen and calculated. The likelihood of the null hypothesis being valid can then be determined along with the confidence intervals for the difference between the means of the two groups.
When comparing large samples (a 100 or greater), the z statistic can be used. However, with smaller groups the assumptions made in its calculation are no longer valid. In these circumstances the t statistic should be calculated, provided the data are normally distributed and the two groups have similar standard deviations.
Quiz
-
Why are the confidence intervals for unpaired comparisons usually greater than the paired variety?
-
What are the requirements of the data if an unpaired t test is to be carried out?
-
Allison et al carried out an experiment to compare the capillary leakage in trauma victims resuscitated with either hydroxyethyl starch (n = 24) or gelatine (n = 21).7 State the null hypothesis and alternative hypothesis of the study? Assuming a level of significance of 0.05, what would be the critical values (tcrit) of the study if an unpaired t test was carried out?
-
The following question is adapted from Boyd et al study comparing two activated charcoal preparations.8 They found the mean (s) amount of charcoal drunk was 26.5 (13.3) g for Carbomix and 19.5 (13.7) g for Actidose-Aqua. The sample size for the Carbomix group was 47 and 50 for Actidose-Aqua. Assuming the prerequisites for carrying out an unpaired, two tailed t test are valid, calculate the t statistic.
-
One for you to try on your own. Steele et al carried out a study comparing two types of rewarming of hypothermic patients.9 Part of the data are adapted and shown in table 3.
Comparison of age and admission temperatures in two groups of patients selected for Steele et al's study9
Carry out an unpaired, two tailed t test comparing these datasets in the two groups. Is there a significant difference between the groups for the two variables?
Answers
-
When using paired data the confidence interval is smaller because you have removed the variability between the subjects and are solely comparing the inter-groups difference.
-
The groups need to have normal distributions and similar standard deviations.
-
The null hypothesis is:
“There is no significant difference in the capillary leakage of trauma victims resuscitated with hydroxyethyl starch or gelatine.
The alternative hypothesis is:
“There is a significant difference in the capillary leakage of trauma victims resuscitated with hydroxyethyl starch or gelatine.
To determine the critical value the degrees of freedom need to be calculated first. In this study this is equal to [24 − 1] + [21 − 1] = 43.
Using the t distribution tables, the tcrit value for a probability of 0.05 with 43 degrees of freedom is ±2.017 if a two tailed test is being used.
-
The t statistic for comparing two unpaired groups is:
where:
μ1 = the mean for Carbomix
μ2 = the mean for Actidose-Aqua
and:
From the data of Boyd et al:
Therefore:
Therefore:
Appendix
The F test is the ratio of the larger variance/smaller variance. The resulting F value is then used to determine the probability that the two variances are from the same population (that is, that the null hypothesis is correct). This is done by reading the F tables using the F value and the appropriate degrees of freedom. The latter is equal to the sum of (n1 − 1) and (n2 − 1) where n1 and n2 are the number in both groups. If the null hypothesis is not consistent with the data (that is, the “F” statistic is significant), the t test should not be used.
Further reading
-
Altman D. Theoretical distributions. In: Practical statistics for medical research. London: Chapman Hall, 1991:48–73.
-
Bland M. An introduction to medical statistics. Oxford: Oxford University Press, 1987.
-
Gaddis G, Gaddis M. Introduction to biostatistics: Part 4, statistical inference techniques in hypothesis testing. Ann Emerg Med 1990;19:820–5.
-
Gardner M, Altman D. Calculating confidence intervals for means and their differences. In: Statistics with confidence. London: BMJ Publications, 1989:20–7.
-
Glaser A. Hypothesis testing In: High yield biostatistics. Baltimore: Williams and Wilkins, 1995:31–46.
-
Koosis D. Difference between means. In: Statistics—a self teaching guide. 4th ed. New York: Wiley, 1997:127–52.
-
Swincow T. The t test. In: Statistics from square one. London: BMJ Publications, 1983:33–42.
Acknowledgments
The authors would like to thank Jim Wardrope and Iram Butt for their invaluable suggestions.