Search This Blog

Saturday, 15 October 2016

P-values bad: confidence intervals good.

"... the primary product of a research inquiry is one or more measures of effect size, not p values." Jacob Cohen, 1990

P-values are Statistical Hypothesis Inference Testing

When my academic mentor and department head Professor Rory O'Connor asked me if I had ever done any modelling, the only surprise was that nobody had suggested it before. I muttered something about not having the right sort of chin and I was a little short for the catwalk.  He listened to me with his usual fortitude and then sent me to 'Introduction to Modelling', a series of lectures which are part of the MSc in Epidemiology and Biostatistics at my university

A common theme in the statistical lectures that I have attended over the last few years has been the widespread misuse, poor reporting and misinterpretation of statistics. The p-value is a particularity relevant and high profile example of this, and it is responsible for false conclusions in at least 30% of research output, more if the study is underpowered. 

The aim of this blog is to explain what the p-value does, why it is unhelpful for judging the outcomes of research and why researchers should report effect sizes and confidence intervals as well. Effect sizes and confidence intervals are easily calculated, and clinicians should look for them when appraising the outcome of research studies. The results of research papers should include the data necessary for calculating them, so I have included the necessary formulas for those occasions where the authors have been forgetful. 

I finish with some examples of papers in which the effect sizes and their confidence intervals contradict the conclusions of the authors.


What does the p-value tell us?

The p-value simply tells us what the chances are of getting a result as large as the one presented in the results if there is no effect of the intervention - if the null hypothesis is true. 

It says nothing about the size of the effect of the intervention or whether the findings can be generalised from the sample to the wider population.
It is difficult to see how p-value has become such a powerful influence on the conclusions drawn from research but it is important that clinicians recognise that interpretation of research findings should include examination of effect sizes and their confidence intervals. And researchers should always include these within their results and draw their conclusions from them. Generally, think of the p-value as 
"something that should be outlawed" ~ Professor Mark Gilthorpe

Effect sizes and Confidence Intervals


Effect size

An effect size is exactly what it says: a quantification of the change in scores of outcome measures or difference between groups. There are different types of effect size e.g. the risk ratio, odds ratio or correlation coefficient (r). They are standardised and can compare changes in the scores of different outcomes. 

An effect size of differences between two experimental groups commonly uses an effect size known as Cohen's d. Cohen's d is based on the groups' mean scores and the standard deviations, scores which are always calculated in analysing results in any case so it is a simple matter to proceed with calculating the effect size. 

And it's a simple calculation:




The pooled standard deviation is:



Because we know the characteristics of standard deviations we know that an effect size of 0.6 indicates that 73% of the control group is now below the average person in the intervention group, up from 50% of course. 

Cohen suggested a rule of thumb for effect sizes: below 0.2, the effect is trivial or non-existent; up to 0.5, the effect is moderate; and above 0.8, the effect size is large. Note that the effect size can exceed 1.



What is a confidence interval and how do we interpret them?

An experiment or research study, and its associated observations, is on one sample drawn from an entire population. A confidence interval gives a range of results which we are 95% confident contains the true population score of interest (e.g. the mean or effect size). 


Why 95% confident? Well, imagine we took 100 samples and conducted the experiment and observations, and calculated means, standard deviations, effect sizes and 95% confidence intervals, on each of them. We would expect five of the confidence intervals not to contain the true population mean/effect size. We would expect each of the other 95 confidence intervals to contain, somewhere in the range of values, the true population score. It is most likely close to the calculated mean/effect size and less likely at the extreme boundaries of the confidence interval. But if the confidence interval crosses the value that indicates no effect, then we must report that the intervention showed no effect. Here is a great interactive visualisation of confidence intervals which demonstrates this.

It's a little more tricky to calculate confidence intervals, but if we've progressed this far it is only lazy and rather self-defeating not to proceed. It's pretty straightforward when you have pulled the relevant figures from the papers.

First you have to calculate the standard error (SE):




The it's putting it all together:




where ES is the effect size calculated earlier, CI is confidence interval and SE is the standard error also calculated earlier. The figure of 1.96 reflects the number of standard errors that would include 95% of the observations. 

An effect size should therefore be reported like this: "the effect size was 0.48 (95% CI: -0.12, 1.08)". 

An example of how effect sizes and confidence intervals change our interpretation of research findings

I will illustrate how calculating and examining effect sizes and confidence intervals can change the interpretation of research findings that have relied only on p-values, using RCTs included in our recently-published systematic review of motor skill interventions for children with developmental coordination disorder

Although we set out to include only high quality randomised controlled trials (RCTs), by including only those which scored 7/11 or more on the PEDro scale, we found a number of problems with the studies including that not all of them had calculated effect sizes, and none of them had calculated confidence intervals

In order to more effectively evaluate the benefits (or not) of the interventions being investigated by each RCT we calculated the effect sizes and confidence intervals from the data within each paper. This caused us to to interpret the findings of some RCTs quite differently to the authors. 

Does aquatic therapy improve motor skills in children with developmental coordination disorder?

Hillier et al investigated whether aquatic therapy was beneficial for the motor skills of children with developmental coordination disorder, and states in the  abstract:


"Analysis of covariance indicated that at posttest, mean scores on the Movement Assessment Battery were higher for children who received aquatic therapy compared to those on the wait-list (p = 0.057)."
Their conclusion states that "Aquatic therapy was a feasible intervention for children with developmental coordination disorder and may be effective in improving their gross motor skills (my emphasis)". (page 22, but behind a pay wall).

Leaving aside that the ANOVA was statistically non-significant, and so differences in mean scores are meaningless, the abstract implies that the intervention - aquatic therapy - had clinical meaning benefits for children with developmental coordination disorder. Our effect size of 0.66 indicated a  moderate effect - that 75% of the control group would be below the average person in the aquatic group. However, the lower bound of our 95% confidence interval reached below an effect size of 0.2 (to -0.5, in fact), suggesting that there was no effect at all, and certainly disagreeing with their conclusion.



Does table tennis show improve motor skills in children with developmental coordination disorder?

Tsai investigated whether a table-tennis training programme resulted in benefits on motor skills of children with developmental coordination disorder and reported in the abstract a "significant improvement of cognitive and motor functions for the children with DCD". In their conclusion they stated that the children's motor outcomes were "significantly enhanced". 


Our findings contrasted sharply with this conclusion. The ANOVA did show that there were statistically-significant differences between groups across time (p = 0.001) but they did not perform any post hoc testing to evaluate where these differences were, listing only the differences in change scores between groups. These were, for the intervention group, a change of 17.69 (SD 4.26)  improving to 13.38 (SD 2.75) and 18.64 (SD 4.80) improving to 17.57 (SD 4.06). The effect size was 0.95 ( a large effect size), but the lower boundary of the 95% confidence interval  was 0.15. This suggests that in the wider population the effect of table tennis on the motor skills of children with developmental coordination disorder is trivial.


Comparative effectiveness of Pilates and yoga group exercise interventions for chronic mechanical neck pain: quasi-randomised parallel controlled study

Finally, an example from a recent Physiotherapy Journal paper. This paper concludes that reductions in disability (the primary outcome) were significant following Pilates and yoga group exercise interventions. It is great that they also calculated effect sizes - it is far more meaningful than looking at changes in the raw scores, for which they have calculated confidence intervals, but they did not calculate confidence intervals for the effect sizes. 

It is wrong to calculate means, standard deviations and confidence intervals for changes in raw scores of its outcome measure (the Neck Disability Index), as this produces ordinal outcome scores on which arithmetic should not be performed. But I have gone with it, and used their figures to calculate the confidence intervals associated with the effect size. I will also overlook other methodological problems. 


Power calculations indicated 90 participants but only 56 completed the study: this small sample leads to an expectation of a large standard error (SE), and therefore wide confidence intervals. Using the formulas above we find that the reported effect size of 1 (large) does indeed have a wide upper and lower boundary so that the effect size is 1.0 (95% CI: -1.67, 3.67). This suggests that Pilates has no effect on chronic mechanical neck pain as measured by the Neck Disability Index. 

I could have calculated the effect size and confidence intervals using the ordinal outcome scores, if I had access to the full outcome data. For non-parametric data, the effect size is Cliff's d in which one compares each of the scores in one group to each of the scores in the other group, and keeps count of the number of scores in the first group that are greater than the scores in the other group, and the number of scores in the first group that are lower than scores in the other group. This should produce two numbers, which for two groups of 10 total 100 (i.e. 10 x 10). Cliff's is then the number of scores in group 1 that are greater than scores in group 2 minus the number of scores in group 1 that are lower than group 2, divided by the product of the numbers in each group:




Cliff's is bounded from -1 to 1, where d = 0 means that there is no effect.

Summing up

There are plenty of other things that can impact on the integrity, interpretation and generalisation of research findings but the p-value has taken on a mythical and invincible might

Do not look at the p-value to determine whether the study shows a significant effect of the intervention. Check the effect size - this gives a good indication of the clinical effect -  and look for confidence intervals to evaluate whether this clinical effect could be generalised to the wider population.