Search This Blog

Saturday 15 October 2016

P-values bad: confidence intervals good.

"... the primary product of a research inquiry is one or more measures of effect size, not p values." Jacob Cohen, 1990

P-values are Statistical Hypothesis Inference Testing

When my academic mentor and department head Professor Rory O'Connor asked me if I had ever done any modelling, the only surprise was that nobody had suggested it before. I muttered something about not having the right sort of chin and I was a little short for the catwalk.  He listened to me with his usual fortitude and then sent me to 'Introduction to Modelling', a series of lectures which are part of the MSc in Epidemiology and Biostatistics at my university

A common theme in the statistical lectures that I have attended over the last few years has been the widespread misuse, poor reporting and misinterpretation of statistics. The p-value is a particularity relevant and high profile example of this, and it is responsible for false conclusions in at least 30% of research output, more if the study is underpowered. 

The aim of this blog is to explain what the p-value does, why it is unhelpful for judging the outcomes of research and why researchers should report effect sizes and confidence intervals as well. Effect sizes and confidence intervals are easily calculated, and clinicians should look for them when appraising the outcome of research studies. The results of research papers should include the data necessary for calculating them, so I have included the necessary formulas for those occasions where the authors have been forgetful. 

I finish with some examples of papers in which the effect sizes and their confidence intervals contradict the conclusions of the authors.


What does the p-value tell us?

The p-value simply tells us what the chances are of getting a result as large as the one presented in the results if there is no effect of the intervention - if the null hypothesis is true. 

It says nothing about the size of the effect of the intervention or whether the findings can be generalised from the sample to the wider population.
It is difficult to see how p-value has become such a powerful influence on the conclusions drawn from research but it is important that clinicians recognise that interpretation of research findings should include examination of effect sizes and their confidence intervals. And researchers should always include these within their results and draw their conclusions from them. Generally, think of the p-value as 
"something that should be outlawed" ~ Professor Mark Gilthorpe

Effect sizes and Confidence Intervals


Effect size

An effect size is exactly what it says: a quantification of the change in scores of outcome measures or difference between groups. There are different types of effect size e.g. the risk ratio, odds ratio or correlation coefficient (r). They are standardised and can compare changes in the scores of different outcomes. 

An effect size of differences between two experimental groups commonly uses an effect size known as Cohen's d. Cohen's d is based on the groups' mean scores and the standard deviations, scores which are always calculated in analysing results in any case so it is a simple matter to proceed with calculating the effect size. 

And it's a simple calculation:




The pooled standard deviation is:



Because we know the characteristics of standard deviations we know that an effect size of 0.6 indicates that 73% of the control group is now below the average person in the intervention group, up from 50% of course. 

Cohen suggested a rule of thumb for effect sizes: below 0.2, the effect is trivial or non-existent; up to 0.5, the effect is moderate; and above 0.8, the effect size is large. Note that the effect size can exceed 1.



What is a confidence interval and how do we interpret them?

An experiment or research study, and its associated observations, is on one sample drawn from an entire population. A confidence interval gives a range of results which we are 95% confident contains the true population score of interest (e.g. the mean or effect size). 


Why 95% confident? Well, imagine we took 100 samples and conducted the experiment and observations, and calculated means, standard deviations, effect sizes and 95% confidence intervals, on each of them. We would expect five of the confidence intervals not to contain the true population mean/effect size. We would expect each of the other 95 confidence intervals to contain, somewhere in the range of values, the true population score. It is most likely close to the calculated mean/effect size and less likely at the extreme boundaries of the confidence interval. But if the confidence interval crosses the value that indicates no effect, then we must report that the intervention showed no effect. Here is a great interactive visualisation of confidence intervals which demonstrates this.

It's a little more tricky to calculate confidence intervals, but if we've progressed this far it is only lazy and rather self-defeating not to proceed. It's pretty straightforward when you have pulled the relevant figures from the papers.

First you have to calculate the standard error (SE):




The it's putting it all together:




where ES is the effect size calculated earlier, CI is confidence interval and SE is the standard error also calculated earlier. The figure of 1.96 reflects the number of standard errors that would include 95% of the observations. 

An effect size should therefore be reported like this: "the effect size was 0.48 (95% CI: -0.12, 1.08)". 

An example of how effect sizes and confidence intervals change our interpretation of research findings

I will illustrate how calculating and examining effect sizes and confidence intervals can change the interpretation of research findings that have relied only on p-values, using RCTs included in our recently-published systematic review of motor skill interventions for children with developmental coordination disorder

Although we set out to include only high quality randomised controlled trials (RCTs), by including only those which scored 7/11 or more on the PEDro scale, we found a number of problems with the studies including that not all of them had calculated effect sizes, and none of them had calculated confidence intervals

In order to more effectively evaluate the benefits (or not) of the interventions being investigated by each RCT we calculated the effect sizes and confidence intervals from the data within each paper. This caused us to to interpret the findings of some RCTs quite differently to the authors. 

Does aquatic therapy improve motor skills in children with developmental coordination disorder?

Hillier et al investigated whether aquatic therapy was beneficial for the motor skills of children with developmental coordination disorder, and states in the  abstract:


"Analysis of covariance indicated that at posttest, mean scores on the Movement Assessment Battery were higher for children who received aquatic therapy compared to those on the wait-list (p = 0.057)."
Their conclusion states that "Aquatic therapy was a feasible intervention for children with developmental coordination disorder and may be effective in improving their gross motor skills (my emphasis)". (page 22, but behind a pay wall).

Leaving aside that the ANOVA was statistically non-significant, and so differences in mean scores are meaningless, the abstract implies that the intervention - aquatic therapy - had clinical meaning benefits for children with developmental coordination disorder. Our effect size of 0.66 indicated a  moderate effect - that 75% of the control group would be below the average person in the aquatic group. However, the lower bound of our 95% confidence interval reached below an effect size of 0.2 (to -0.5, in fact), suggesting that there was no effect at all, and certainly disagreeing with their conclusion.



Does table tennis show improve motor skills in children with developmental coordination disorder?

Tsai investigated whether a table-tennis training programme resulted in benefits on motor skills of children with developmental coordination disorder and reported in the abstract a "significant improvement of cognitive and motor functions for the children with DCD". In their conclusion they stated that the children's motor outcomes were "significantly enhanced". 


Our findings contrasted sharply with this conclusion. The ANOVA did show that there were statistically-significant differences between groups across time (p = 0.001) but they did not perform any post hoc testing to evaluate where these differences were, listing only the differences in change scores between groups. These were, for the intervention group, a change of 17.69 (SD 4.26)  improving to 13.38 (SD 2.75) and 18.64 (SD 4.80) improving to 17.57 (SD 4.06). The effect size was 0.95 ( a large effect size), but the lower boundary of the 95% confidence interval  was 0.15. This suggests that in the wider population the effect of table tennis on the motor skills of children with developmental coordination disorder is trivial.


Comparative effectiveness of Pilates and yoga group exercise interventions for chronic mechanical neck pain: quasi-randomised parallel controlled study

Finally, an example from a recent Physiotherapy Journal paper. This paper concludes that reductions in disability (the primary outcome) were significant following Pilates and yoga group exercise interventions. It is great that they also calculated effect sizes - it is far more meaningful than looking at changes in the raw scores, for which they have calculated confidence intervals, but they did not calculate confidence intervals for the effect sizes. 

It is wrong to calculate means, standard deviations and confidence intervals for changes in raw scores of its outcome measure (the Neck Disability Index), as this produces ordinal outcome scores on which arithmetic should not be performed. But I have gone with it, and used their figures to calculate the confidence intervals associated with the effect size. I will also overlook other methodological problems. 


Power calculations indicated 90 participants but only 56 completed the study: this small sample leads to an expectation of a large standard error (SE), and therefore wide confidence intervals. Using the formulas above we find that the reported effect size of 1 (large) does indeed have a wide upper and lower boundary so that the effect size is 1.0 (95% CI: -1.67, 3.67). This suggests that Pilates has no effect on chronic mechanical neck pain as measured by the Neck Disability Index. 

I could have calculated the effect size and confidence intervals using the ordinal outcome scores, if I had access to the full outcome data. For non-parametric data, the effect size is Cliff's d in which one compares each of the scores in one group to each of the scores in the other group, and keeps count of the number of scores in the first group that are greater than the scores in the other group, and the number of scores in the first group that are lower than scores in the other group. This should produce two numbers, which for two groups of 10 total 100 (i.e. 10 x 10). Cliff's is then the number of scores in group 1 that are greater than scores in group 2 minus the number of scores in group 1 that are lower than group 2, divided by the product of the numbers in each group:




Cliff's is bounded from -1 to 1, where d = 0 means that there is no effect.

Summing up

There are plenty of other things that can impact on the integrity, interpretation and generalisation of research findings but the p-value has taken on a mythical and invincible might

Do not look at the p-value to determine whether the study shows a significant effect of the intervention. Check the effect size - this gives a good indication of the clinical effect -  and look for confidence intervals to evaluate whether this clinical effect could be generalised to the wider population. 

5 comments:

  1. Great blog Nick, well done.
    Any chance you could take a swing at the false conditional in a follow up? That is, the p value tells us about the probability given the null hypothesis is true (no difference) but this doesn't tell us about the probability given the null hypothesis is false (there is a difference), or the more common situation where we can't assume if there is or isn't a difference between the groups a priori because we don't know.
    (eg https://www.ma.utexas.edu/users/mks/statmistakes/misinterppvalues.html)

    ReplyDelete
    Replies
    1. Hi Rod, thank you very much, I appreciate your kind words very much.

      Interesting question. I have a feeling that this long response is not what you might mean. I plan to explore conditional probabilities (Bayesian statistics) in future but I need to be much more certain about my grasp of the concepts for inference testing. I am very much a novice here, and am currently focusing on General Linear Modelling.

      As far as the null hypothesis testing goes, defining the p-value as the probability that there is no difference between groups is asking for trouble; there will always be a between-groups difference and there will always be a within-groups difference (across time), whether there is an experimental effect or not. The questions are, is this difference large enough for clinical and functional benefits, and is the result due to chance? The p-value tells us the probability that the results are just a chance finding based on sampling variation and, depending on what value we are willing to accept, it stops us making ourselves look foolish (by making a Type-1 error). For this reason, there are some proponents of adjusting the alpha - the proposed p-value, defined a priori, of accepting that the results are not due to chance - less conservatively, depending on what effect size we expect to see.

      For example, we are putting together a motor skills programme to improve the motor skills of children with developmental coordination disorder. This movement disorder has a profound effect on children's motor skills, social, emotional and academic development and it is very common - perhaps six children per classroom in the UK. The motor skills programme is based on the results of high quality RCTs which showed huge effects in children with developmental coordination disorder - huge. Before we conduct an RCT (to evaluate our motor skills programme efficacy when conducted in schools) we will conduct a feasibility study and evaluate changes in children's motor skills. If these results show another huge effect size, it would be absurd to design the RCT with the emphasis on not making a Type-1 error i.e. by setting the alpha at 0.05 or worse, 0.01. We have good reason to believe that we will see an effect of the programme, so our biggest concern should be that of making a Type-2 error - not finding a result when in fact there really is one. For this reason, we should set alpha at 0.1 and Beta - the a priori risk of making a Type 2 error - at 0.1. (power the study for 90%).

      By the time we get to this stage, I might be more familiar with Bayes statistics and find that could be more appropriate.

      Thanks again for your kind comments. If you ever want to contribute to the blog please let me know and I will be pleased to publish - it's already proving a challenge to find the time.

      Best wishes
      Nick

      Delete
    2. Thanks for the reply Nick.
      I think Bayes is one good way to go on this, but more centrally to the whole null hypothesis thing is this false conditional probability that grew out of an approach in maths that goes something like "assume 1 + 1 is not = 2, now if we assume that is true, then the following would be true:..." At some stage you then end up with something that cannot be true (like, say, 3=4) so you can then say, "therefore since we know that 3 does not = 4 our initial assumption was not true, and so 1+1 must =2". In maths this is typically fine as we're only ever talking about True/False outcomes. There's a nice example that the Pythagoreans used to prove that the square root of two must be irrational, and the guy who proved it got thrown in the ocean (to his death) for his trouble.
      When we're trying to find out if a motor development program helps kids more than usual care, we end up with a couple of scores and "a p value that assumes there was no difference between the two". Problem is, this is not necessarily the correct assumption, in my experience it rarely is, so the probability generated is not valid, and would've been different if we started with the assumption that say, usual care was better, or we didn't know which one was better, or we thought usual care was 10% worse, etc. Unfortunately we can never get the infinite number of trials with the infinite number of subjects in each group past an ethics committee. For me this is the big one that pulls the rug out from under NHST.
      Back to your original thought and marrying Bayesian approach, this might be a reasonable common ground if you could start with a probability. For your programme maybe you think that there's a 5% chance it's better than usual care, or it's a 50:50 bet.
      Then after your trial you can generate post test odds that it is better with the p value of 0.05 and this would come out to be 11% and 71% respectively.
      It's much better explained here:
      http://www.nature.com/news/scientific-method-statistical-errors-1.14700
      but hopefully underlines the point that pre-test chances are critical, and therefore starting out with "assume there's no difference" really influences how you interpret the p-value.
      Thanks again for a great blog and discussing this stuff.

      Delete
  2. Really good read and food for thought!

    P.s. I think there's a slight mistake in your Cliff's d formula figure: number of scores in group 1 that are greater than scores in group 2 - the same again in the figure instead of the number of scores in group 1 that are lower than scores in group 2 like in your text)

    ReplyDelete
    Replies
    1. Thank you, I appreciate your kind words very much.

      And you are right, thanks for pointing it out. I have amended the figure, but it's very careless of me and I am annoyed with myself.

      Best wishes
      Nick

      Delete