Review: The Paper Formerly Known as ‘Voodoo Correlations in Social Neuroscience’

In this review, I have revisited the ‘Voodoo Correlations in Social Neuroscience’ paper which has been renamed as ‘Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition‘ and in this link is in draft format.

There are several reasons to revisit the paper. First of all there have been some amendments although the central argument remains. Secondly a paper can be revisited several times and the reader may have slightly different interpretations each time when for instance a subtle nuance has been picked up. Thirdly there have been a few developments including a few responses to this paper and another paper looking at stastistical methods in imaging studies. So it makes sense to revisit the paper at this time. The text here will be integrated into the original review so that arguments from the ‘blogosphere’ can be added to the analysis. The review here is highly structured with analysis for each point which will make the above integration slightly easier with each section acting as a reference point.

2.A. Reliability estimates.

In this section Vul and colleagues look at some assumptions that are central to the remainder of the argument. They focus on reliability. There are several types of reliability. Vul et al choose to focus on the test-retest reliability for both psychometric measurements and for BOLD (blood oxygenation level dependent) measurements. They select a study on the five factors of personality where the authors concluded that the reliability ranges from 0.73 to 0.78. They also select a study from 2005 in which the upper limit of reliability in BOLD responses was 0.7. Discussion: The focus on psychometric measurement in general is both a strength and a weakness of the Vul paper. The obvious criticism of Vul’s approach is that there are just too many psychological constructs to make such generalisations useful or valid. What it does do however is focus the argument which is just what is needed to discuss the central issues. Another point to note is that the authors have chosen to ignore other types of reliability which are just as important and these include inter-rater reliability. If the scales or tests are administered by researchers rather than being self-administered then we might see a lot of variability in the results depending on factors such as the training of the raters themselves. Another useful consideration is the distinction between state and trait. Within a short period of time a persons state changes may obscure the underlying trait as measured by the rating instrument. Additionally test-retest reliability should not always be close to 1 as we might expect these values to change within the individual over time. This means that there are is always a trade off between reliability and validity in psychological rating instruments that examine constructs that we would expect to change with time. With regards to the BOLD responses there is very little evidence to support the given reliability score of 0.7. In the study that is referenced, multiple trials are undertaken which may lead to fatigue and variability in BOLD response (i.e. although this is an assumption it is not unreasonable to suppose there might be some degree of habituation in blood flow to a given cerebral region with repetitive task activity and increased activity in that region). Trying to find a test-retest reliability value for BOLD responses is much more difficult to justify than with the psychological constructs as there are just too many variables to consider which interact with the blood flow from 1 minutes to the next. While there is very tight regulation of cerebral blood flow there is also a variability in blood pressure, heart rate variability and heart rate which may all influence cerebral blood flow and these in turn are rapidly modified by a variety of physiological and psychological factors that can independently influence cerebral blood flow. Additionally the physiological state may activate several brain regions which in turn will influence regional cerebral blood flow also. Despite all of this, the above discussion actually supports Vul’s central argument as we might not expect particularly high test-retest reliability values.

B. The puzzle

The previous section leads on to the central premise of this section namely that by combining the above reliability values an upper limit can be established to the expected correlation between measurements of the psychological construct and the BOLD response.

Discussion: This seems like a very reasonable argument.

C. Meta analysis methods. In this section Vul and colleagues look at the literature and also undertake a survey of people working in this field.

1. Literature review. The authors want to answer the questions ‘How common are correlations higher than the expected upper limits and how are these calculated?’. They use the keyword fMRI together with a list of “social terms”. On this initial search they select papers that reports across subjects correlations between task performance and BOLD activity. They identified 55 articles with 274 correlations. These are then presented in the form of a histogram in figure one in the paper.

Discussion: This seems to be one of the weaker sections in the paper. The authors have effectively included several types of study/experimental design within the same paper. If they are doing this in order to answer a question then it seems reasonable. However in the process, the methodology for the literature review, which is quite important for the purposes of reproducibility has been briefly and incompletely described. They mention only a few terms which are combined to identify relevant papers and give us no indication of the other terms that were used. The more usual convention in a literature review is to provide a complete list of search terms together with the relevant search years and database which is used. Then the criteria for filtering the studies is usually described. Even with a sophisticated search strategy there would be significant limitations and the authors might produce more effective results were they to limit their search to a very specific domain within the fMRI field. However in terms of the higher goal of generating debate about research methodology such an approach would have a more limited appeal. Thus there appears to be a trade off between the impact of the paper and the validity of the results.

2. Elements of fMRI analysis

Vul and colleagues give a clear account of the analysis that takes place in current MRI studies. As there is almost continuous activity throughout the brain they explain how contrasts are chosen to examine the activity within voxels. Thus activity within the voxels between different tasks is examined or else those within a specific anatomical domain. The authors argue that many studies do not make their method of analysis clear.

Discussion: The arguments for clarity in the methodology is entirely sensible and consistent with good scientific principles such that the results may be reproduced by other groups. Indeed an absence of a clearly defined statistical analysis within the methods section raises questions about why this was omitted.

3. Survey method

The authors then undertook a survey of various investigators in the field. They selected those authors of the papers that they had identified from the literature review. They constructed a questionnaire which focused on the research methodology. This was cents to of the research is and a subsequent set of questions was composed and sent to the research is after their first response. They obtained 53 responses which compares favourably by the 55 articles they had identified. Two participants didn’t respond.

Discussion: Here there are further methodological problems. Although most of the questions have binary responses there were a few questions which are open-ended. Furthermore there were a number of questions which were used on an ad hoc basis after the initial responses were received. Essentially the researchers are conducting a qualitative study in addition to the above. There has been no pilot of this questionnaire described within the study. As this is a qualitative study there should be a qualitative approach to the analysis. This may well have taken place but the authors make no mention of it and instead provide us with quantitative results in the form of pie charts. At first glance it might seem as though the surveyed investigators might use one method or another and this would be well handled by the questions with dichotomous responses. However the data analysis is often quite complex and perhaps a qualitative approach would capture some of this complexity. By choosing dichotomous responses and presenting these in the analysis to the exclusion of the responses to the other questions we perhaps don’t get the full picture. There he is a further difficulty here. As the methodology for the literature review is not clear, any bias in selection of papers is carried forwards into this part of the study as we don’t quite know the characteristics of the sample population. Additionally it is not clear if one or several of the authors on each paper was contacted. If there were more than one author contacted then the response rate would be quite low. If on the other hand it was solely the lead author on the paper then this is an extraordinarily high response rate and we do not have the method that the authors used to obtain such a high response rate. In effect this means that for the study a highly selected group of authors were chosen with the selection criteria remaining less than clear and with the selected authors being relatively constrained in one part of the questionnaire and then able to give more flexible responses in the other part of the questionnaire which is not reported. One final point to note here is that the lead author may not have been the one to have carried out the statistical analysis. While it might not always be the case, some of the analysis might have been undertaken by other members of the group. When the participants were completing the analysis there are no questions asking if that person had undertaken this part of the analysis. The lead author may be the person creating the study design, scanning the patients and doing the write-up. A simple question would have clarified this.

D. Survey results

The authors describe the quantitative aspects of the results in the form of a rather ingenious set of pie charts each of which corresponds to a different parts of the analysis.

Discussion: the results are not tabulated and we do not see if there are any significant differences between the different types of analysis undertaken. Similarly the qualitative results are not reported on at all. On the pie charts we can see however that there is a roughly even distribution amongst most of the methods used in the analysis. The authors also include a diagram showing the results of a simulation they have run which represents the voxels activity against performance on the task with both being represented by noise – this in turn being modelled as a gaussian distribution with a mean of zero. At first glance this seems rather convincing. However on closer inspection this appears to be an additional type of experiment included in this paper. When we look at it from this perspective it it is at once obvious that there should again be a clearly delineated methodology section for this experiment. On seeing this it becomes it is also possible to see that there is not have enough information to interpret the corresponding graphs and authors conclusions. For instance, the authors state that Gaussian noise is used. This involves generating random data points that fit a Gaussian distribution with a mean of zero. However random numbers cannot be generated out of ‘thin air’ and there is usually a systematic method for doing so which invariably involves the use of a random seed. We do not know which method the researchers have used for this. Furthermore we do not have a number of important variables such as time. For instance over how many virtual seconds was each stimulation run and how was time represented within the simulations. However there are more critical points. How have the authors selected the threshold value? What they say here is that a statistical threshold is used for this purpose they are presumably correlating this with the “behavioural measure”. Why is this latter measure modelled as gaussian noise when this is not the case in the real world experimental paradigm. By not properly defining the methodology or the simulation the authors may have put provided us with nothing more than a theoretical exercise in statistics in which case the simulations were entirely unnecessary.

E. The non-independence error

In this section the authors argue that if suitable thresholds are found and corresponding voxels selected during a task then there is in effect a selection bias for the chosen voxels.

Discussion: Again it would have been interesting to have seen the full results of the survey in order to examine the ‘other’ methods the participants had used to analyse the data.

F. Results and discussion

1. Are the correlation values reported in this literature meaningful?

The authors argue that if there is a non-independence error than the correlation will be overestimated or indeed spurious.

Discussion: This is a sweeping statement. The authors have already identified a wide range of different approaches to analysing the data within the studies as can be seen from the pie charts. Additionally we do not have access to the qualitative information which would help us to decide if the investigators are using other methods of analysis.

2. Is the problem being discussed hear anything different than the well known problem of multiple comparisons raising the probability of false alarms?

The authors raise the objection that since there are large numbers of voxels being used in the analysis there is an opportunity for false positives on the basis of numbers alone. They criticise one of the standard methods for controlling for multiple comparisons which uses a p-value based on a cited study. However the authors argue that this value has been misinterpreted and the studies that used the relevant p-value may have produced spurious results. They suggest that larger cluster sizes should be considered for significance calculations.

Discussion: The assertion here is more serious and requires further exploration.

3. What may be inferred from the scattergrams often exhibited in connection with non-independence analyses?

The authors raise a similar objection to the scattergrams that are used in publications to illustrate correlations. They argue that selection of those voxels firing above the threshold if chosen would suggest a correlation where none exists.

Discussion: The argument here is essentially the same as that raised in an earlier section regarding spurious or exaggerated correlations due to what they refer to as a non-independence error.

4. How can the same method sometimes produce no correlations?

The authors suggest that within experiments different types of statistical analysis will be undertaken according to the researchers expectations. Thus within the same study they argue that artifacts will be produced by the non-independent error for some of the tests and at other points in analysis this will be absent.

Discussion: This is a testable hypothesis. Since authors have the data they should be able to undertake an analysis of a subsample of the selected papers where the data is available in order to test this hypothesis.

5. But is there really any viable alternative to doing these non-independence analyses?

The authors recommend two methods for analysing the data. The is what they refer to as a split half analysis. This means half of the voxel activity is analysed without a prior knowledge of the activity on the task. They argue this would remove bias. The second method involves identifying correlations using half of the data. This is used to generate hypotheses which are then tested on the remaining half of the data.

Discussion: These suggestions seem on the surface entirely reasonable although there have been criticisms of these suggestions.

6. Even if correlations were overestimated due to non-independence analyses can’t we at least be sure the correlations are statistically significant (and us that there exists a real nonzero correlation)?

The authors again raise the previous points to suggest that the correlations may not be valid.

Discussion: The objections here would need to be raised against the earlier points.

7. Well in those cases where the correlation really is significant (i.e. non-zero), isn’t that what matters anyway? Does the actual correlation value really matter so much?

The authors argue that the value of the correlation is important. Relations with a small p-value may not provide much useful data or may not be present in another sample.

Discussion: This is an entirely reasonable argument that is taken as read in other research areas.

G. Concluding remarks

The authors reiterate the points from earlier in the article. They also make the suggestion that datasets should be reanalysed using the author’s recommendations.

Discussion: Perhaps datasets should be openly available as in the Alzheimer’s Disease Neuroimaging Initiative so that independent groups may analyse the data also.

H. References

I. Table one non-independence (red)’no response as of August 1, 2008 (orange), independent (green)

J. Appendix survey questionnaire

K. Appendix two. Most papers use cluster size not just a high threshold to capture correlations. Does inflation of correlation problems still exist in this case?

In conclusion, the authors have produced a controversial paper particularly because of it’s history and the surrounding exposure it has received. However the paper is clearly presented which benefits the reader. Nevertheless the authors describe several different types of study within this paper including a literature review, a survey including which requires a qualitative analysis (i.e. an analysis of text in this case) and a computer simulation. In these three areas, the methodology is incompletely described. An important debate has begun and appears to have maintained momentum. Even this redraft has made it to Newsweek and a number of other high profile publications.


If you have any comments, you can leave them below or alternatively e-mail


The comments made here represent the opinions of the author and do not represent the profession or any body/organisation. The comments made here are not meant as a source of medical advice and those seeking medical advice are advised to consult with their own doctor. The author is not responsible for the contents of any external sites that are linked to in this blog.


  1. I think the real problem in fMRI research is not nonindependence (a huge problem in its own right), but people doing whole-brain analyses without proper correction for multiple comparisons. Even established methods are deeply flawed and produce spurious results; Vul et al is just the tip of the iceberg. See by the way Tal Yarkoni’s commentary on their paper in the same issue. And by the way, where were the editors of Nature and Science in all this? Did they really think all this was real? Neural correlates of social rejection??? Shame on them.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s