Review: Test-Retest Reliability in fMRI

The paper reviewed here is a preprint edition of ‘fMRI Reliability: How reliable are the results from Functional Magnetic Resonance Imaging’ by Craig Bennett and Michael Miller and is freely available here. MindHacks had drawn attention to this article and it is also referenced on Craig Bennett’s blog. The article, excluding references is just under 10,000 words. Bennett and Miller refer to the Vul et al paper in the introduction, a paper which critically appraised the methodology in contemporary fMRI studies (for further details see here). They then argue for the importance of reliability of fMRI studies with a number of supporting points.

I found the section on ‘What factors influence fMRI reliability’ slightly tricky to get to grips. Although the authors have explained the concepts throughout the paper very clearly, I hit a stumbling block with the CNR and SNR. The difficulty I had with the SNR is that it is defined by dividing the mean signal intensity in a volume region by the standard deviation of signal strengths of voxels in that region. The difficulty I had here however was more of a philosophical one. The ratio in itself seems to contain some implicit assumptions. How do we know for instance that the variation in signal intensity across voxels is a marker of noise. Suppose for example that the fMRI scanner performed perfectly, capturing the data without error. Would we not expect to see a wide variation in voxel activity in a region or would we see very little variation? I don’t know on what basis we could argue which of these should be more likely. If we don’t know the answer to that, then for me, it is difficult to know what the noise and the signal will ‘look’ like.

Similarly with the contrast ratio. If we were taking a photograph of a scene and displaying this photograph to convey useful information e.g a portrait, then clearly contrast is going to be very useful. If there is a difference between two pixels then ‘maximising’ this difference suggests a transformation of this data. Surely this would be dependent on the amount of information stored within the image which is effectively the resolution of the image. If the image is stored digitally then the detail within the image would be a function of both the number of pixels in each layer as well as the number of values each pixel can take. However we are talking of volumes here and would therefore replace pixels with voxels. The more voxels that are possible within the image capture, all other things being equal, the higher the resolution. So I could only see the CNR making sense if it represents the ‘range’ of signals possible within a voxel which would be a function of the equipment. This for me, means that there is a difference between the SNR and the CNR because the former is dependent on the brain activity whereas the latter might be a function of the equipment (although I might be wrong about this).

The difficulty in relying on the brain’s activity for the signal to noise ratio can be illustrated as follows. If we take a snapshot of the brain’s activity at a moment in time or across time why should the reality not be a messy blur of activity that doesn’t make much sense to us. After all our brain’s didn’t evolve to inspect fMRI scans. The bottom line in the argument is why shouldn’t the brain shift it’s activity very subtly when responding during tasks? Sure, it could do some very dramatic shifting which would be easy to pick up in the scan, but on what basis are we arguing that there should be a dramatic and not a subtle shift. A subtle shift might represent a small change in activity of individual neurons across a wide network much as might be expected in the neural networks that have been used to simulate brain functions and perform useful tasks. Here subtle shifts in the weightings of individual synapses in a network are sufficient to store complex information with little change in the overall ‘energy’ of the network.

So moving onto the remainder of the paper which utilises the earlier discussion of CNR and SNR. They cite research showing that doubling the field strength of the MRI scanner from 1.5T to 3T results in a 1.3 fold increase in the CNR. The SNR increases with a doubling of the field strength also although I didn’t fully understand the rationale as I’d got stuck earlier as described above. What I found interesting was the large number of factors that influence the image quality and the authors cite a noted example of a faulty light bulb which produced patterns in the resulting images (presumably though these would have been different from the brain activity). This was one clear example of extraneous noise that would interfere with the brain signal.

They also describe the subtle alterations in methodology that are required between different strength scanners to optimise signal-to-noise ratios. In the section on ‘SNR considerations of analysis methods’, the authors raise some very interesting points about the transformations of the data that are used. For instance, I was intrigued by the smoothing operations that are used to reduce error. Smoothing of the data sounds to me like averaging and this would effectively ‘remove’ outliers. However, who is to say that the outlying data is not important – perhaps one area drives the others. My argument here is not for nihilism but rather for an explicit elucidation of the underlying assumptions inherent in each stage of the data analysis. If any one of these steps is adopted uncritically into the wider neuroimaging culture then there is the risk that it will become a ‘ritual’ rather than a useful and relevant analytic procedure. For each hour of effort put into analysing data surely two hours should be put into the reasons for why the data is analysed the way it is (Ok entirely arbitrary but just for the sake of the argument). After all, it is a simple enough matter to automate the data analysis procedure but it is the choice of analytic procedures which is the key.

The analysis of the equipment is relatively easy (although this is extremely complex as discussed above). But on this scale of complexity, if analysis of the equipment is the easy part then when it gets to assessing the subjects, this really takes the biscuit. Start asking whether one subject varies from another and you enter the tangled web of philosophy, psychology, neuroanatomy, neurophysiology, neuropharmacology and any other discipline which bounds on this question. Take genetics for instance. Do some people experience synaesthesia as a consequence of genetics. If so, does that mean that there are subtle differences in the wiring and consequently on aspects of brain functioning and therefore activation under specific tasks. If so, then how many genetic factors are relevant. And if this is the case for synaesthesia then why not for every other brain function? How many brain functions are there? How many combinations of brain functions are there? Can a person have several gene-brain function changes relative to another person? If so, how many combinations are possible? How do you determine which combinations are present? What sample size would you need for each combination? This argument rests on an assumption of brain modularity which isn’t necessarily the case. But even here this is just one tiny part of the argument. What role does the environment play on each person? The effects of environmental insult, learning, nutrition, exercise, motivation and so on.

The section on fMRI reliability focuses on some philosophical questions which are inescapable. If you retest a person on the same task, then how do you determine if the imaging method is reliable. This is a problem common to psychometric tests. If there is perfect test-retest reliability in the same person, then something’s wrong with the test because people can change with time. This basic problem underlies the neuroimaging paradigm also. The authors describe a number of statistical methods for assessing the test-retest reliability in voxels. What I found most interesting here, was the concept of analysing the whole brain activity which the authors point out has happened infrequently. This seems a very thorough approach to the question but at the same time I would expect there to be little reliability in this approach if there is intrasubject variability in the same task with time as well as a modular approach to brain function.

The authors use three methods of assessing test-retest reliability

1. Cluster overlap. Dependent on voxels meeting a threshold in both test and retest which is a drawback of this method.

2. Voxel counting. Looking at voxels which are active in both test and retest.

3. Intra-class correlation. Matching not just the activity but the magnitude of that activity in voxels in the test/retest conditions

The authors set out their search strategy. I wasn’t able to find the search years used. They use the PubMed database and use a simple search strategy followed by a hand-search of the retrieved papers. The results are illustrated in full in tables 1, 2 and 3. From table 1, it was shown that the mean range of ICC across the tasks in the multiple papers was on average 0.58 which I thought was quite a large range. The cluster overlap varied from 0.21 to 0.865 again showing quite a large range.

What I found interesting were not the answers to the bigger questions as above but rather the secondary findings. For instance, the authors note that there is greater test-retest reliability in tasks assessing motor function in comparison to cognitive function. Perhaps this reflects the relativel complexity of the cognitive tasks and/or the increased space of viable solutions. There was no consensus on what values should be used for test-retest reliability. There were some potentially interesting findings in clinical groups but I wasn’t sure how to interpret these in view of the wide variation that was found across all of the above studies. A between-centre variability of 8% in terms of the BOLD signal seemed quite good relative to the inter-subject variability.

The authors conclude by suggesting a number of improvements that could be used in research including increasing task length, the strength of the scanner, the use of genetic algorithms in task construction, quality assurance methods and a mock up of the task beforehand to ensure the subject has understood the task. They then look at where neuroimaging might go next and suggest the importance of analysis of large data sets and ‘provenance’.

I thought this was a very interesting paper and the authors have a talent for explaining complex subject matter in a way that can be understood by the interested reader who might possibly have little background in the subject. I can’t help feeling however that fMRI research really focuses on the very core of what the brain is doing and that the tool is giving yet another window onto the complexities of the brain. What I mean by this is that the solution may not be in the images themselves but rather in the understanding of the images by triangulation with other areas of knowledge. Indeed it seems to me inescapable that the fMRI data cannot be interpreted in isolation from psychology, neuroanatomy, neuropharmacology, neurophysiology, psychiatry, computational neurobiology, cellular neurobiology and philosophy. Perhaps a consensus needs to be arrived at not just within the neuroimaging community but with members of these other communities. Such a consensus would need to invoke optimal methods for deliberation within groups (see Infotopia by Cass Sunstein for further information on this subject) to ensure an unbiased consensus which maximises the use of specialised knowledge within the group settings and which hopefully would enable optimal research strategies to be devised and blind alleys to be avoided.


You can find an index of the site here. The page contains links to all of the articles in the blog in chronological order.


You can follow ‘The Amazing World of Psychiatry’ Twitter by clicking on this link


You can listen to this post on Odiogo by clicking on this link (there may be a small delay between publishing of the blog article and the availability of the podcast).

TAWOP Channel

You can follow the TAWOP Channel on YouTube by clicking on this link


If you have any comments, you can leave them below or alternatively e-mail


The comments made here represent the opinions of the author and do not represent the profession or any body/organisation. The comments made here are not meant as a source of medical advice and those seeking medical advice are advised to consult with their own doctor. The author is not responsible for the contents of any external sites that are linked to in this blog.


  1. Hey, I found this website by mistake i was looking Google for a registry cleaner that I had already bought when I discovered your website, I must say your site is definitely cool I just love the theme, its awesome!. I don’t have the time currently to fully read your website but I’ve bookmarked it as well as signed up to your RSS feeds. I shall be back within a day or two. thank you for a excellent site.”


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s