‘Voodoo Correlations in Social Neuroscience’



The featured paper is ‘Voodoo Correlations in Social Neuroscience’ by Edward Vul and colleagues freely available here. I wanted to have a closer look at this paper, given the recent interest it has generated.

I was quite surprised on reading through just the abstract about how provocative this paper is. Here is a sentence from the abstract

‘The implausibly high correlations are all the more puzzling because social-neuroscience method sections rarely contain sufficient detail to ascertain how these correlations were obtained’

Abstracts for neuroscience articles I have come across are usually described in an objective manner and yet the words ‘puzzling’ and ‘implausibly’ in the description above lend the abstract a subjective quality. The stakes rise even further on reading through the first few paragraphs of the introduction when a number of prominent journals and the ‘lavish attention from the popular press’ are mentioned. In the introduction the authors mention the mind-brain divide and the empirical methods that have been used in neuroimaging to try and bridge this gap. Having reached this stage, the authors go one step further focusing on the work of individual scientists.

There is then an interesting and clearly argued point about high correlations between psychometric measurements and correlations with imaging data. The authors question correlations as high as 0.96 arguing that this should depend not only on the correlation between traits and imaging data but also on the reliability of the psychometric measures. They then quote some evidence for reliability in the MMPI before concluding ‘in general, therefore, a range of .7 – .8 would seem to be a somewhat optimistic estimate for the smaller and more ad hoc scales’. Given the abundance of psychometric scales measuring a wide range of different properties, I don’t think it’s possible to generalise to all psychometric scales but that instead these figures should be used as a starting point with acknowledgement of limitations. The authors argue that reliability data on BOLD responses are limited before going on to cite some studies and giving a starting point of 0.7. They then calculate an upper limit on the correlations of 0.74 based on their estimates above. However given the argument above the figures should perhaps be based on analysis of specific areas of investigation and on the reliability of psychometric scales used in that domain as there will most likely be a great deal of heterogeneity.

They then argue that the ‘exact methods were simply not made clear in the typically brief and sometimes opaque method sections’. The evidence and methodology the authors utilised in order to arrive at that conclusion is unclear.

The authors then discuss how they have conducted a study of the neuroimaging researchers themselves – a curious meta-study of sorts! The authors identified ‘social neuroscience’ articles between evoked BOLD activity and ‘behavioral measures of individual differences in personality, emotionality, social behavior, and related domains (generally excluding psychopathological symptoms, however)’. The case for excluding psychopathological symptoms is not made as the psychometric measures used here are those used in large pharmaceutical trials as well as clinical practice where highly reliable measures are needed.

There are further problems I have with the methodology. For instance there are a number of unusual omissions which make it difficult to reproduce these results. Thus the databases which were searched were not identified. The search terms used were described thus

‘e.g., ”jealousy”, ”altruism”, ”personality”, ”grief”, etc’

In other papers I have seen, the combination terms are described in full.  In this particular case there are so many terms that it is difficult to see how the authors could not have been selective. Consider that over 50 types of emotions have been described and that is a generous underestimate in just one psychological domain. Additionally for the term fMRI a search of the PubMed database on 19.1.9 revealed 225490 articles and 31475 reviews although returned articles were not exclusive to ‘social neuroscience’. The authors later comment that ‘it should be emphasized that we do not suppose this literature review to be exhaustive’. Thus the skewed distribution observed in the graph data of frequency of papers versus absolute correlation may result from selection bias although given the lack of data in the methodology section it is difficult to better characterise this bias.

In keeping with the authors clarity of writing in the introduction they then go onto explain fMRI analysis. This is a very nice description of the process identifying general principles of the analysis while at the same time identifying idiosyncratic approaches that negate principles of generalisation. In this section the authors equip the reader with valuable resources for analysing imaging studies.

The authors then describe the survey of authors whose papers they had identified from their selection process. They wanted to understand which methods the researchers had used to identify voxels of interest – anatomical, functional or both. They then identify a ‘regression across subjects’ being used in 54% of researchers. If I understand this correctly, this means that voxels were selected which fired above a threshold at the time of a certain behaviour/phenomenological experience and then correlated these with similar voxels in other subjects. The authors argue that this method can produce signals out of noise as a result of selection of those voxels that fire with the behaviour/phenomenological experience of interest. The point is further driven home with an analogy to weather readings and stocks! The authors refer to this phenomenon as the ‘non-independence error’ although it could equally be called selection bias.

The authors then make some suggestions including a ‘split-half analysis’ whereby half of the data is used to generate predictions which are then tested on the remaining half of the data as well as what I refer to as ‘blinding’ whereby the voxel analysis proceeds independent of the behavioural/phenomological data and only afterwards are the two correlated.

The authors round-up by remarking that

we are led to conclude that a disturbingly large and quite prominent, segment of social neuroscience research is using seriously defective research methods and producing a profusion of numbers that should not be believed‘.

Are there any properties of this paper which the authors themselves have criticised in the other studies?

In my opinion, there were a few interesting points in this paper which I interpreted.

1. The methodology is opaque – in particular the method of identifying relevant papers. The authors have criticised a number of the imaging studies similarly.

2. In my opinion there is possibly a selection bias in this paper – a small number of all possible papers are selected but due to the opacity of the methodology section we are unable to ascertain the nature of a possible selection bias. The authors criticise other researchers for identifying voxel activity based on correlation with the behaviour/phenomenological experience in question i.e. selection bias.

3. If there is a selection bias then the authors would have selected those papers which support their argument – thus generating a result similar to the ‘non-independent error’. Furthermore they have produced ‘visually appealing’ graphs for their data which ‘provide reassurance’ to the ‘viewer that s/he is looking at a result that is solid’.

4. The authors draw generic conclusions which in my opinion are outside of the limits of their analysis and are denoted in italics above. Given the large number of fMRI studies, in my opinion, such conclusions are the equivalent of deriving signals from noise.

5. In my opinion, the authors have used a number of emotive terms in their paper and also invoked the names of individual researchers as well as reputable journals. Such an approach is atypical of the papers I have reviewed previously and may be expected to ‘lavish attention from the’ research community in question.

Is there an explanation for the points above?

As I’m listening to some Jungian Podcasts at the moment, I thought I would have a bit of fun in applying some Jungian analysis here. If the authors in this paper have themselves committed some of the acts they have criticised what does this mean? Well one obvious suggestion is that this represents projection. However I thought there was one term in particular that was very apt – the shadow. Indeed the particular term in this case would be the ‘cultural shadow’ as it applies to a group of people who in my opinion have applied these properties to another group. This interpretation may of course of course be incorrect but the interested reader is recommended to listen to the following podcast (#4) and draw their own conclusions. Alternatively it could be argued that the qualities required for projecting a shadow are a necessary result of having a similar domain of knowledge to the person/people who are the object of the projection.

Is Science Independent of Social Properties?

One interpretation of the events surrounding this paper is that reputation has been implicitly incorporated into the discussion. This would occur at the level of individuals, journals and ‘social neuroscience’. Thus we may see on another level of analysis that other related areas may be of relevance, areas which would be of potential interest to the anthropologist or sociologist – these including territorial disputes, tribal identity, social perception and social cognition.

Miscellaneous Issues

One of the more intriguing issues is about the subjects in the study. In particular, why the subjects consented to inclusion in a study which ultimately results in a publication which comments on the ‘seriously defective research methods’ in their field of social neuroscience. Nevertheless inclusion in a study does not imply that subjects will anticipate the results, although the methodology and null hypothesis will necessarily make the possible outcomes more predictable.


On the one hand there are a number of clear arguments in the paper which are useful in considering the research methodology further. However I argue that the points are presented atypically, have made specific references unfairly (in my opinion) – for instance to the field of ‘social neuroscience’ and have resulted in a number of responses from the cited authors as well as generating attention from elsewhere. Given the complexity of the issues generated from this paper, the repurcussions are unpredictable and it will be some time before the consequences become clear.

Steps to Treatment = 4 (Given that neuroimaging studies may lead to elucidation of important relationships that may hold clinical importance, this paper beginning a debate in a particular area of neuroimaging methodology may lead to improved analysis in related studies thereby facilitating further clarification of clinical relationships which may in turn lead to more effective diagnostic techniques or treatment interventions).

If you know of any links i've missed, could you please leave a comment or send an e-mail (details below).

Summary of Responses (In Progress)

In organising the responses in the media, I will firstly label the different sections of the original paper for referencing purposes.

Firstly the article is by Edward Vul, Christine Harris, Piotr Winkielman and Harold Pashler

The sections of the article are outlined thus

1. Abstract

2. A Puzzle:

Remarkably High Correlations in Social Neuroscience: Here the authors give some examples. They also state the nature of the difficulty they have which is the high correlations in the study. The authors state that the correlations should include not only the phenomenon being examined but also the measurement tools that are being used also.

2.A. Reliability estimates.

In this section Vul and colleagues look at some assumptions that are central to the remainder of the argument. They focus on reliability. There are several types of reliability. Vul et al choose to focus on the test-retest reliability for both psychometric measurements and for BOLD (blood oxygenation level dependent) measurements. They select a study on the five factors of personality where the authors concluded that the reliability ranges from 0.73 to 0.78. They also select a study from 2005 in which the upper limit of reliability in BOLD responses was 0.7. Discussion: The focus on psychometric measurement in general is both a strength and a weakness of the Vul paper. The obvious criticism of Vul’s approach is that there are just too many psychological constructs to make such generalisations useful or valid. What it does do however is focus the argument which is just what is needed to discuss the central issues. Another point to note is that the authors have chosen to ignore other types of reliability which are just as important and these include inter-rater reliability. If the scales or tests are administered by researchers rather than being self-administered then we might see a lot of variability in the results depending on factors such as the training of the raters themselves. Another useful consideration is the distinction between state and trait. Within a short period of time a persons state changes may obscure the underlying trait as measured by the rating instrument. Additionally test-retest reliability should not always be close to 1 as we might expect these values to change within the individual over time. This means that there are is always a trade off between reliability and validity in psychological rating instruments that examine constructs that we would expect to change with time. With regards to the BOLD responses there is very little evidence to support the given reliability score of 0.7. In the study that is referenced, multiple trials are undertaken which may lead to fatigue and variability in BOLD response (i.e. although this is an assumption it is not unreasonable to suppose there might be some degree of habituation in blood flow to a given cerebral region with repetitive task activity and increased activity in that region). Trying to find a test-retest reliability value for BOLD responses is much more difficult to justify than with the psychological constructs as there are just too many variables to consider which interact with the blood flow from 1 minutes to the next. While there is very tight regulation of cerebral blood flow there is also a variability in blood pressure, heart rate variability and heart rate which may all influence cerebral blood flow and these in turn are rapidly modified by a variety of physiological and psychological factors that can independently influence cerebral blood flow. Additionally the physiological state may activate several brain regions which in turn will influence regional cerebral blood flow also. Despite all of this, the above discussion actually supports Vul’s central argument as we might not expect particularly high test-retest reliability values.

B. The puzzle

The previous section leads on to the central premise of this section namely that by combining the above reliability values an upper limit can be established to the expected correlation between measurements of the psychological construct and the BOLD response.

Discussion: This seems like a very reasonable argument.

C. Meta analysis methods. In this section Vul and colleagues look at the literature and also undertake a survey of people working in this field.

1. Literature review. The authors want to answer the questions ‘How common are correlations higher than the expected upper limits and how are these calculated?’. They use the keyword fMRI together with a list of “social terms”. On this initial search they select papers that reports across subjects correlations between task performance and BOLD activity. They identified 55 articles with 274 correlations. These are then presented in the form of a histogram in figure one in the paper.

Discussion: This seems to be one of the weaker sections in the paper. The authors have effectively included several types of study/experimental design within the same paper. If they are doing this in order to answer a question then it seems reasonable. However in the process, the methodology for the literature review, which is quite important for the purposes of reproducibility has been briefly and incompletely described. They mention only a few terms which are combined to identify relevant papers and give us no indication of the other terms that were used. The more usual convention in a literature review is to provide a complete list of search terms together with the relevant search years and database which is used. Then the criteria for filtering the studies is usually described. Even with a sophisticated search strategy there would be significant limitations and the authors might produce more effective results were they to limit their search to a very specific domain within the fMRI field. However in terms of the higher goal of generating debate about research methodology such an approach would have a more limited appeal. Thus there appears to be a trade off between the impact of the paper and the validity of the results.

2. Elements of fMRI analysis

Vul and colleagues give a clear account of the analysis that takes place in current MRI studies. As there is almost continuous activity throughout the brain they explain how contrasts are chosen to examine the activity within voxels. Thus activity within the voxels between different tasks is examined or else those within a specific anatomical domain. The authors argue that many studies do not make their method of analysis clear.

Discussion: The arguments for clarity in the methodology is entirely sensible and consistent with good scientific principles such that the results may be reproduced by other groups. Indeed an absence of a clearly defined statistical analysis within the methods section raises questions about why this was omitted.

3. Survey method

The authors then undertook a survey of various investigators in the field. They selected those authors of the papers that they had identified from the literature review. They constructed a questionnaire which focused on the research methodology. This was cents to of the research is and a subsequent set of questions was composed and sent to the research is after their first response. They obtained 53 responses which compares favourably by the 55 articles they had identified. Two participants didn’t respond.

Discussion: Here there are further methodological problems. Although most of the questions have binary responses there were a few questions which are open-ended. Furthermore there were a number of questions which were used on an ad hoc basis after the initial responses were received. Essentially the researchers are conducting a qualitative study in addition to the above. There has been no pilot of this questionnaire described within the study. As this is a qualitative study there should be a qualitative approach to the analysis. This may well have taken place but the authors make no mention of it and instead provide us with quantitative results in the form of pie charts. At first glance it might seem as though the surveyed investigators might use one method or another and this would be well handled by the questions with dichotomous responses. However the data analysis is often quite complex and perhaps a qualitative approach would capture some of this complexity. By choosing dichotomous responses and presenting these in the analysis to the exclusion of the responses to the other questions we perhaps don’t get the full picture. There he is a further difficulty here. As the methodology for the literature review is not clear, any bias in selection of papers is carried forwards into this part of the study as we don’t quite know the characteristics of the sample population. Additionally it is not clear if one or several of the authors on each paper was contacted. If there were more than one author contacted then the response rate would be quite low. If on the other hand it was solely the lead author on the paper then this is an extraordinarily high response rate and we do not have the method that the authors used to obtain such a high response rate. In effect this means that for the study a highly selected group of authors were chosen with the selection criteria remaining less than clear and with the selected authors being relatively constrained in one part of the questionnaire and then able to give more flexible responses in the other part of the questionnaire which is not reported. One final point to note here is that the lead author may not have been the one to have carried out the statistical analysis. While it might not always be the case, some of the analysis might have been undertaken by other members of the group. When the participants were completing the analysis there are no questions asking if that person had undertaken this part of the analysis. The lead author may be the person creating the study design, scanning the patients and doing the write-up. A simple question would have clarified this.

D. Survey results

The authors describe the quantitative aspects of the results in the form of a rather ingenious set of pie charts each of which corresponds to a different parts of the analysis.

Discussion: the results are not tabulated and we do not see if there are any significant differences between the different types of analysis undertaken. Similarly the qualitative results are not reported on at all. On the pie charts we can see however that there is a roughly even distribution amongst most of the methods used in the analysis. The authors also include a diagram showing the results of a simulation they have run which represents the voxels activity against performance on the task with both being represented by noise – this in turn being modelled as a gaussian distribution with a mean of zero. At first glance this seems rather convincing. However on closer inspection this appears to be an additional type of experiment included in this paper. When we look at it from this perspective it it is at once obvious that there should again be a clearly delineated methodology section for this experiment. On seeing this it becomes it is also possible to see that there is not have enough information to interpret the corresponding graphs and authors conclusions. For instance, the authors state that Gaussian noise is used. This involves generating random data points that fit a Gaussian distribution with a mean of zero. However random numbers cannot be generated out of ‘thin air’ and there is usually a systematic method for doing so which invariably involves the use of a random seed. We do not know which method the researchers have used for this. Furthermore we do not have a number of important variables such as time. For instance over how many virtual seconds was each stimulation run and how was time represented within the simulations. However there are more critical points. How have the authors selected the threshold value? What they say here is that a statistical threshold is used for this purpose they are presumably correlating this with the “behavioural measure”. Why is this latter measure modelled as gaussian noise when this is not the case in the real world experimental paradigm. By not properly defining the methodology or the simulation the authors may have put provided us with nothing more than a theoretical exercise in statistics in which case the simulations were entirely unnecessary.

E. The non-independence error

In this section the authors argue that if suitable thresholds are found and corresponding voxels selected during a task then there is in effect a selection bias for the chosen voxels.

Discussion: Again it would have been interesting to have seen the full results of the survey in order to examine the ‘other’ methods the participants had used to analyse the data.

F. Results and discussion

1. Are the correlation values reported in this literature meaningful?

The authors argue that if there is a non-independence error than the correlation will be overestimated or indeed spurious.

Discussion: This is a sweeping statement. The authors have already identified a wide range of different approaches to analysing the data within the studies as can be seen from the pie charts. Additionally we do not have access to the qualitative information which would help us to decide if the investigators are using other methods of analysis.

2. Is the problem being discussed hear anything different than the well known problem of multiple comparisons raising the probability of false alarms?

The authors raise the objection that since there are large numbers of voxels being used in the analysis there is an opportunity for false positives on the basis of numbers alone. They criticise one of the standard methods for controlling for multiple comparisons which uses a p-value based on a cited study. However the authors argue that this value has been misinterpreted and the studies that used the relevant p-value may have produced spurious results. They suggest that larger cluster sizes should be considered for significance calculations.

Discussion: The assertion here is more serious and requires further exploration.

3. What may be inferred from the scattergrams often exhibited in connection with non-independence analyses?

The authors raise a similar objection to the scattergrams that are used in publications to illustrate correlations. They argue that selection of those voxels firing above the threshold if chosen would suggest a correlation where none exists.

Discussion: The argument here is essentially the same as that raised in an earlier section regarding spurious or exaggerated correlations due to what they refer to as a non-independence error.

4. How can the same method sometimes produce no correlations?

The authors suggest that within experiments different types of statistical analysis will be undertaken according to the researchers expectations. Thus within the same study they argue that artifacts will be produced by the non-independent error for some of the tests and at other points in analysis this will be absent.

Discussion: This is a testable hypothesis. Since authors have the data they should be able to undertake an analysis of a subsample of the selected papers where the data is available in order to test this hypothesis.

5. But is there really any viable alternative to doing these non-independence analyses?

The authors recommend two methods for analysing the data. The is what they refer to as a split half analysis. This means half of the voxel activity is analysed without a prior knowledge of the activity on the task. They argue this would remove bias. The second method involves identifying correlations using half of the data. This is used to generate hypotheses which are then tested on the remaining half of the data.

Discussion: These suggestions seem on the surface entirely reasonable although there have been criticisms of these suggestions.

6. Even if correlations were overestimated due to non-independence analyses can’t we at least be sure the correlations are statistically significant (and us that there exists a real nonzero correlation)?

The authors again raise the previous points to suggest that the correlations may not be valid.

Discussion: The objections here would need to be raised against the earlier points.

7. Well in those cases where the correlation really is significant (i.e. non-zero), isn’t that what matters anyway? Does the actual correlation value really matter so much?

The authors argue that the value of the correlation is important. Relations with a small p-value may not provide much useful data or may not be present in another sample.

Discussion: This is an entirely reasonable argument that is taken as read in other research areas.

G. Concluding remarks

The authors reiterate the points from earlier in the article. They also make the suggestion that datasets should be reanalysed using the author’s recommendations.

Discussion: Perhaps datasets should be openly available as in the Alzheimer’s Disease Neuroimaging Initiative so that independent groups may analyse the data also.

H. References

I. Table one non-independence (red)’no response as of August 1, 2008 (orange), independent (green)

J. Appendix survey questionnaire

K. Appendix two. Most papers use cluster size not just a high threshold to capture correlations. Does inflation of correlation problems still exist in this case?

Reviews of Other Papers Relevant to the Discussion

Response to Voodoo Correlations in Social Neuroscience by Vul et al – summary information for the press’ by Jabbi et al

The article reviewed here is  ‘Response to Voodoo Correlations in Social Neuroscience by Vul et al – summary information for the press’ by Jabbi et al and freely available here. The original Vul paper offers a useful focal point for discussion of the principles of fMRI image analysis and this article continues the discussion. This review will be integrated with the original analysis of the Vul paper. The authors have organised their response to the paper into 8 sections. In the first point the authors argue that there are corrections for multiple comparisons which control for the large number of voxels that have been sampled. They also argue that producing p-values and effect sizes for correlations complies with the American Psychological Association guidelines for statistical analysis. In their second point, the authors essentially argue that the upper limit value of Vul et al can be exceeded and cite a reliability of 0.98 in an fMRI study of language lateralisation to support their argument. The authors go on to comment on the simulation that Vul et al have run. I have commented on in this in an extension of the original analysis (pointed to by the link above) and was quite sceptical about the simulations that have been run as we have few details of them. Here the authors add that the simulations should have contained a family wise error correction. If I understand this correctly this is essentially the same as point 2. As the number of pairwise comparisons increases so too does the number of false positives and therefore there need to be corrections for this. The Bonferrini correction is an example and essentially allows significance of multiple comparisons on the same data set to be calculated (to the best of my understanding).

In the fourth point, the authors argue that Vul et al have ignored papers that report non-significant correlations. However I would argue that such correlations would exist even if there is a bias in the analysis. In the fifth point the authors argue that it is not only the numbers that matter but also the biological validity of the correlations. Nevertheless the statistical correlation is central to the argument as without it even biologically valid associations are nothing more than speculation. Point 6 is a very strong argument – namely that correlations have been replicated using different statistical methodologies for the analysis. Here we see the importance of constraining the argument to a specific area of the social neurosciences as without such a constraint an extremely large number of counterexamples can be produced. Also with such generalisations much of the meaning can be lost. In the seventh point, the authors comment on the questionnaire being invalid. They quite rightly point out that questions on the secondary analysis should have been included. However a much more significant point here in my opinion (and mentioned in the extended analysis linked to above) is that the questionnaires have not been validated, some of the responses have not been reported on and the methodology for qualitative analysis is not given. This is in my opinion a more significant weakness in the original paper which traverses both quantitative and qualitative analysis without a clearly reported methodology. The final point is one that I have difficulty with. The authors criticise Vul et al’s suggestion of having a split-half analysis. The authors use the term ‘commonly’ to suggest that Vul et al’s suggestion ignores statistical techniques commonly applied for these reasons. However Vul et al are proposing something slightly different.

This article contains some interesting responses to Vul et al’s paper. This paper also highlights the complexity of the process of reading a paper as well as criticising such a paper on the path to a deeper understanding of the subject area.


Correlations in Social Neuroscience Aren’t Voodoo’ by Lieberman and colleagues

The featured article is ‘Correlations in Social Neuroscience Aren’t Voodoo’ by Lieberman and colleagues and freely available here. This continues indirectly the analysis of the original paper (here). The Lieberman paper is a considered response which in my opinion mirrors some of the properties of the Vul paper. The authors firstly comment that Voodoo has ‘connotations of fraudulence’ on the basis of a book on science with Voodoo in the title which examined many issues including fraud in science. They also comment on the tone of the article.

Much of the article’s prepublication impact was due to its aggressive tone, which is nearly unprecedented in the scientific literature and made it easy for the article to spread virally in the news

This is quite an interesting statement and many have commented on the tone of the publication. The paper has provoked much debate not just in the public domain but also within the scientific community. On the one hand there is the question of the reputation of a large domain within neuroscience coming under public attack. On the other hand this has provoked a number of prominent figures to publicly debate statistical analysis within the fMRI field and keep members of the public interested in this issue along the way. Many people within the field have commented also and some have even suggested that the statistics are not always well understood. As with many psychometric properties, it would not be surprising if the understanding of statistical methods within the field was normally distributed – although the ‘mean’ would be expected to be quite high relative to the general population as the field would most likely select for those with an interest in maths either directly or indirectly through an interest in the neurosciences.

Liebermann look at the survey methods. They have gone to the effort of contacting authors of the papers on the non-independent list. However as with the Vul et al paper, there is no methodology section here. They are presumably using a qualitative approach to their analysis of the survey data. Not all authors were included and we do not know if there was a selection bias. We also do not know the questions that were used. What is quite curious is that the authors report that they use a single step inferential process and there is no explanation of why multiple choice questions probing two-step inferential processes were simply left out by the subjects if they did not agree with the choices.

Lieberman and colleagues address the issue of whether there is a two-step inferential process. Essentially they are examining Vul et al’s assertion that in the statistical analysis of the fMRI data the conclusions are drawn two steps away from the original data. However Liebermann and colleagues challenge this by saying that when the data is plotted – this is nothing more than a simple description of the first-step inferential. The first-step inferential in turn represents the inferences that are drawn from the original data – the correlations between the voxel activity and the observed behaviours or phenomenon.

Next Liebermann and colleagues conduct a simulation of the data and conclude that ’76% of the simulated studies reported no correlation of r .80 by chance anywhere’ thus challenging Vul et al’s findings in their ‘simulation’. As with Vul et al’s paper there is no explicit description of the methodology within the paper. Instead the authors describe the simulation in the text attached to a diagram showing graphs of the data from the simulations. Again we do not know how the random numbers were generated and also the reason for running simulations which are essentially demonstrating established statistical properties.

In the next stage, the authors return to Vul et al’s non-independent papers and extract the data. Essentially as I had argued previously there may have been a selection bias in the Vul paper. Liebermann and colleagues do indeed find a selection bias and by taking these into consideration the authors show that the inflation figure would be 0.12 i.e the overestimate of the effect size. I didn’t particularly understand the next part. The authors argue that there should be a smaller correlation size for the whole-brain analysis. They use p-value threshold of 0.25 for the ROI and 0.001 for the whole-brain analysis. It can be seen that if a larger number of comparisons are being made then there should be a higher p-value threshold for choosing values to avoid false positives. What is not clear to me though is why these particular values are chosen – the assumptions are not clear.

The authors then go on to explain how the statistical methods used can produce an artefact. They argue that the t-test is more likely to show up correlations in data where they exist if the data is less variable. They make a correction for this ‘restricted range’. No doubt they could also control for other such properties of the t-test or indeed of other correlation measures used in studies. The authors then address the upper limits of the correlations and answer this with a number of points. Essentially they cite studies that show high reliabilitys of fMRI and psychometric data and show an upper limit of 0.92 in studies from the field challenging Vul et al’s paper. Given the title of the original paper which takes a swipe at such a large field it is not surprising that such results can be readily identified.

An interesting response to a provocative article, I wonder if this paper will have some historical value in due course.

Correlations and Multiple Comparisons in Functional Imaging – A Statistical Perspective’ by Marin Lindquist and Andrew Gelman

The article reviewed here is ‘Correlations and Multiple Comparisons in Functional Imaging – A Statistical Perspective’ by Marin Lindquist and Andrew Gelman and freely available here.

In the introduction the authors state that their training is primarily outside of the neurosciences and later in the paper describe themselves as ‘applied statisticians’. The authors also state that they will be commenting on the Vul paper from a statistical perspective as implied in the title of the paper. They begin with a summary of some of the points made in the Vul paper and some of the responses in the field before addressing the points one by one.

Firstly they note that Vul et al criticised the two-step inferential process before drawing attention to Liebermann’s et al’s response in which they conclude that such analyses are not usually performed within the field. They also comment on the first inferential step and second descriptive step in the process noted by Liebermann et al. They note that providing there is control for multiple comparisons the second step will not alter the fact that there is an underlying correlation although this correlation will be inflated.

The authors then look at the issue of reporting the results in the literature and note that if the reader is not helped to understand the meaning of the statistics then they can misinterpret the correlations:

For these reasons, the practice of simply reporting the magnitude of the reported correlations is somewhat suspect

Here is what I feel is a really important point. Part of the ‘rules’ of science are that the theory evolves through a critical process – the theories that remain after critical analysis and challenge by replication or other types of studies are by default successful – a survival of the fittest. However if the methodology is obscured then in effect, a paper is to some extent shielded from the critical testing ground of the scientific community. This in turn might be expected to slow the rate at which the ‘winning’ theories are successfully identified. Thus replication becomes more difficult if the methodology cannot be followed exactly and the findings are presumably more likely to be accepted by the community which could prolong the acceptance of false theories or beliefs.

The authors then go on to comment on significance testing

Indeed, it is well known that with a large enough sample size even very small effects will be statistically significant, and statisticians often warn about mistaking statistical significance in a large sample for practical importance

The authors go on to comment on the limitations of effect sizes in both small and large populations and also state the importance of describing underlying assumptions in the models being employed. Again I would argue that this latter point continues the theme of ‘transparency’ which should facilitate the testing process of the scientific community and hasten the process of ‘selection’ of the best theories. If you help more of the community to understand the steps in the process leading to the conclusions, the community should be more likely to identify any flaws in the arguments.

Then the authors come out with this interesting comment

There are many factors that affect blood flow in the brain and we probably wouldn’t expect the average scans of two different groups of people to be exactly the same

The implication is that subtracting the activity of voxels in one group performing a task from those in the corresponding voxels in the control group might need to be modified as there would be many ‘significant’ correlations across the brain even after the relevant corrections have been undertaken. They expand upon this and describe how the focus of analysis should be on characterising persistent differences.

Finally the authors propose their own ‘multilevel’ model for use in fMRI analysis. The authors argue that the voxel correlations should be corrected for using a measure of the distribution of correlations in the entire voxel population or within a region of interest. This rests on the assumption that the activity in all of the regions represents an identical phenomenon and that looking at the distribution of correlations is helping in interpreting individual voxel firing. Ultimately these types of debates might well be settled by experimental evidence. If intraoperative techniques or combination imaging methods can be used perhaps we might be able to use additional sources of information to make sense of the voxel firing patterns seen in fMRI.

Big Correlations in Little Studies’ by Yarkoni

The article reviewed here is ‘Big Correlations in Little Studies’ by Yarkoni and freely available here. This is another in the long list of responses to Vul et al’s paper (which was originally reviewed here). In the introduction Yarkoni summarises the responses to Vul et al’s assertion that the correlations in studies was spuriously high and adds a third. Thus the three responses are that

1. The correlations were indeed spuriously high and in this regards Yarkoni suggests that empirically this is to a greater extent than asserted by Vul et al

2. The correlations are valid

3. The correlations are too  high but less so than stated by Vul et al.

Yarkoni then looks at other factors which influence the correlations and focuses on the power of study – a topic which he has written about previously. Yarkoni illustrates the role of power calculations with some worked examples. He argues that if an effect exists, then to detect it within the correlations at a certain level of significance, the effect size will need to be much larger in order for it to be detected. Those large effects that are detected however might also represent smaller effects that have been magnified as a result of variance. Furthermore a larger variance is evident with a smaller sample size – as the sample size increases so do the confidence intervals shrink to the effect size. He also argues that the power of between-subject effects is much smaller than that of within-subject effects. Yarkoni makes an interesting point here –

an investigator who believes in big r’s has to explain why it is that most within-subject contrasts identify relatively distributed patterns of activation, whereas correlation analyses do not

Perhaps the comments of Lindquist and Gelman are relevant here (paper reviewed here) who suggest in their analysis that in comparing groups there is no reason to suppose that the difference between mean activations in one area for the same task should be zero. In other words brain activation patterns are governed by so many factors that you could argue that there will always be differences between groups. Thus if your looking at effect sizes of significance under such circumstances you could find them in every region of interest. Indeed they also argue that the same could hold for within-subject designs as factors influencing blood flow will vary across time also.

If as he asserts the power of between-subjects comparisons is lower than that of within-subject contrasts, then the implication of the above is that behavioural correlates should be represented by distributed patterns of activation as detected in the higher powered within-subject studies. Further he writes

If one believes that brain-behaviour correlations of that strength exist in the population, the absence of any large-sample studies reporting such correlations is inexplicable….In fact, what tends to happen is exactly the opposite: As sample size grows, effects shrink

Yarkoni then argues that studies should have large sample sizes, the r values should be discounted for the sake of consistency and the focus should be on within-subject designs which are higher powered.

This is a well-argued paper that shows the immense complexity of the subject that Vul et al have chosen to comment on in their original paper as the points made here are different still to those in the previous responses to Vul et al that have been reviewed here.

Addendum 12.12.09

See this blog post by Yarkoni for an update

Discussion of Puzzlingly High Correlations in fMRI studies of Emotion, Personality and Social Cognition by Lazar

The featured article is a response by Lazar to Vul et al’s paper on statistics used in fMRI (and reviewed here) and Lazar’s article is freely available here. Vul’s article presents a useful focus for examining the statistical techniques used in fMRI and a systematic analysis of the responses to this paper should offer insights into the diversity of approaches considered in such analysis. Lazar is based in the Department of Statistics at the University of Georgia. He starts off the article with this neat observation

Statistical tools were originally devised for a rather specific set of circumstances – mostly small or moderately sized data sets, collected under controlled experimental conditions

I was reassured to find that Lazar refers to Vul et al’s use of the term ‘non-independence error’ as a ‘selection bias’ which I had also considered in my original review above. He discusses the issue of selection bias in other fields further and I was intrigued to read of Benjamini’s experience in discussing a possible selection bias in the field of genetics.

Lazar has many interesting sessions and reiterates an important point made elsewhere

Increased transparency in reporting the details of an analysis will also help

Lazar tells us that he is working with colleagues to develop new statistical approaches to analysing imaging data and suggests that researchers will move away from the correlation analysis as more appropriate forms of analysis take their place. He also warns us against the basic mistake of analysing the same data set twice and notes the increasing complexity of data available for analysis.

While relatively brief, I thought this was a thought-provoking article which suggests changes not just in neuroimaging but also in the field of statistics.




I thought as this write-up is ongoing that I would also put down some of my personal reflections on this paper and the interest it has generated. This section is necessarily speculative and experiential. I’ll also put a date at the end of each section to indicate when I have started (and finished) writing part of this section (i.e. dates will also act as partitions within this section and delineate 1 day’s writing). My first thought is that the presentation of this paper has questioned the research of particular scientists in a quite public way. In many journals however, articles which provoke a lot of interest also generate many letters. The authors are able to view these letters and respond to them such that the letters and the authors responses may be published in the next issue. As these responses can sometimes be viewed publicly, the end result is similar to what has occurred here i.e. that the scientist is being publicly challenged. However one of the criticisms that has been leveled at the ‘blogosphere’ is that this process has been bypassed and the authors haven’t been able to respond through the usual channels. The reason for this, as I understand it is to do with speed of response. The blogosphere is very fluid as individual authors are able to publish quickly. On the other hand, journals have planned issues many months in advance and there are many complex logistics involved in producing quality journals both online and in paper format as well as distributing large numbers. There is also a convention which I will refer to as a culture, part of the culture of science, removed to some extent from the process of doing the science but another process which the scientist must be able to successfully negotiate. The blogosophere is a collection of effectively small, independent publishers who may or may not have developed a collective culture. So what seems to have happened is that an author has put down a challenge to a large number of authors. This has then be released before print and responded to on the blogosphere. Perhaps the challenged authors might have expected the paper to be sent to them firstly, so that they might respond and then for these responses and the authors to have been published. However even so, there might have been one issue between the paper and the responses putting us effectively in the same position we are in now, where the paper gains exposure and the respondents look for an opportunity to publish their responses. In this process, I think what has happened is that the dialogue that occurs in the journals has been displaced to the blogosphere where this dialogue is occuring at a frenetic pace. So is this a bad thing? I don’t think so. I don’t think it’s even new. In the case of a press release, the findings of a scientific study are taken up by the media and the dialogue occurs very rapidly, particularly if there is dissemination through the radio and television, not just by the scientific community but also members of the public. While the blogosphere is accessible to the public, it does also have a certain community with an interest in a particular area.

A special point about the article which has perhaps caused it to generate so much interest is that it is a study about the process of doing science and that it has examined the work of individual scientists. Perhaps it is the nature of the study more than anything else that makes the location of the dialogue so important. In the process of defending their papers, the authors are also trying to work out where it is that this dialogue is taking place which in turn may produce anxiety as the ‘blogosphere’ interconnections are fluid. The locations of the studies will be interesting to see, as if there are a larger number of independent locations then the ‘interest’ the paper generates may be more ‘widespread’ (crudely speaking). On the other hand if several groups were being mentioned more than once in the red list, that might also produce more anxiety. The inclusion of the paper on a wikipedia article on social neuroscience suggests there has already been generalisation and perhaps this might even need challenging. I will also look at the New Scientist and Scientific American articles as these are disseminated widely to the public and it will be interesting to see the interpretations (31.1.9).

The more I think about it, the more I see a connection between what is happening in this online debate about the paper and a term ‘distributed conversations’ that I came up with for an introduction to what I refer to as ‘Internet Models’.

The term ‘distributed conversation’ refers to conversations which are separated in space and time. So for instance in a typical conversation, two people might be in the same room talking to each other. In this case it’s very simple to follow the dialogue. However online, there is an ‘audit trail’ of conversations and we can see how one comment left on a blog can be responded to in a separate article on another blog. Thus not only is the continuing dialogue separated in location and time but even the participants may be substituted. While this occurs in the offline world, it is the ‘audit trail’ that makes this experience quite different. We are better able to see some of the responses of people not directly involved in the research whereas previously such conversation would have been lost in the ‘mists of time’ (1.2.9). I’ve been thinking of a few more questions about what the recent events mean for dialogue in science. If there is a demand for more rapid discussion of science articles, does this mean that the platform offered in the blogosphere has always been needed? Do scientists want a public debate on their work? In the past has this been available only in correspondence in journals, discussion at conferences, in groups within individual labs or else in occasional press releases in the broader media channels such as newspapers, radio and television. Do science journals and scientists want to reach their audience through these media channels. For instance if there are people who have no interest in science are they more likely to misinterpret the messages? Or is the accumulation of correct or incorrect information even by those without an interest in the subject part of the process of science passing into culture? Does this mean that communication through these channels must contain a message for both the general audience and the specialist audience? Will journals increasingly adapt to create their own forums for rapid dialogue. An example is the Rapid Responses section of the BMJ. However, what role does the blogosphere play with the advent of such forums in journals? Does the blogosphere increase the extent of the debate and if so, can we say that this pushes science forwards? (3.2.9). A number of the authors have responded with their own PDF documents which have been stored on their own websites. In this sense, part of the debate has occurred in a third area – the ‘author sphere’ – the author’s territory where they can determine the amount of space they allocate to a response. Thus the author responses have been many pages long. This length of response might present difficulties in a paper journal where space could be limited. Additionally such lengthy responses can be lost in the comments sections of blogs or responses sections of online journals. In the ‘author sphere’, the author can take some control of the discussion around their work, for instance by pointing people discussing the paper to their response – in some senses a process akin to marketing their work. However the ‘author sphere’ doesn’t have to be managed by the authors in this way. Thus if the authors are allowed to place their responses in influential blogs as guests, the marketing and technical details of maintaining the website can be managed by those with resources in this area. Additionally, online journals do not need space restrictions and could extend the ‘author sphere’ further with both blogs and online journals, or online presence of paper journals providing this forum for authors (6.2.9).

One of the themes that has emerged is that some of the authors fear that this attack on the field will draw away support for future studies of this nature both in terms of funding and also in terms of publication in journals where editors are looking at the debate. However, this provides the field with an opportunity to respond (which a number of authors have done very promptly) and to educate various groups about the techniques that are being used so as to reassure people in this regards. As a simple example, a video could be created explaining the statistical analysis that occurs during studies, outlining the criticisms levelled by Vul et al and also delineating the responses to these criticisms. Guidelines for authors and editors could be included – questions that need to be answered. Such material could be argued to benefit not just the researchers, the editors and the funding agencies but also science journalists, health professionals and interested members of the public. Such a video could be posted on YouTube and referenced. Surely dissemination of research methodology is just as important as dissemination of results and dissemination involves clarity and is ultimately for the benefit of the audience. If this holds then the creation of such resources would meet an important need. One unusual feature of the events surrounding this paper is that it was released publicly after acceptance but prior to publication. It would be interesting to know if the journal editors knew of this beforehand and if this approach will be used in the future (14.2.9).

Unsurprisingly there has been a surge of interest around the time at which Vul’s paper was released on the internet and now things have quietened down a bit. As I understand it, the paper will be renamed for publication and is now being referred to as ‘the paper formerly known as Voodoo correlations in social neuroscience’. The nature editorial referred to in this article was quite interesting and although not explicitly referencing this paper, the issues being discussed were extremely relevant.  (29.3.9). So I’ve come back to this after a few months. I don’t think too much was written in the blogosphere until the article finally reached print with a number of responses by authors and an introduction by the editor. What’s quite interesting is that another article has recently been published (again see the links above) on neuroimaging methodology – on the phenomenon of double dipping in analysis – and this has now been given the name Voodoo II by MindHacks. The term Voodoo, which initially could be viewed as derogatory has been dropped from the final press version of the article but has now been used as a ‘handle’ to explore these issues. In effect it has helped to create a ‘cultural map’ for some people to orientate themselves when discussing fMRI methodology. The final print paper, the responses and the Voodoo II paper have now started to generate a lot of interest on the blogosphere (see links above). There seems to be a momentum that is maintained by key papers in the published literature with discussion occurring in the blogosphere (4.5.9). It’s been a while now since i’ve updated this article and reviewed the scene. I note that there has been a little activity on the scene in an interesting blog (see here). I’ve added a diagram at the top of the page in the meantime and have also added some of the related reviews of articles I have done which add to the debate. Kuhn’s ‘The Structure of Scientific Revolutions’ is also relevant to the debate but I would have to have a further think about this before writing about it (10.7.11).


If you have any comments, you can leave them below or alternatively e-mail justinmarley17@yahoo.co.uk


The comments made here represent the opinions of the author and do not represent the profession or any body/organisation. The comments made here are not meant as a source of medical advice and those seeking medical advice are advised to consult with their own doctor. The author is not responsible for the contents of any external sites that are linked to in this blog.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s