# Doing Science Using Open Data – Part 5: Looking at Populations

In this fifth part of the series on using open data for science I’ve take a slight diversion to look at populations and the issue of sampling. This was prompted by a look at the UK Mid-2011 Census data shown in the graph below.

The issue arose after looking at comparing two segments of the population. What was interesting was that rather than samples, the census data is providing information about the population. When we look at samples from a population, a number of assumptions are made about the population. For instance it may be assumed that the population is normally distributed in which case there are some standard measures that would describe that population. However looking above we can see that the population does not resemble that of a normal distribution. Instead it is skewed right (higher values are on the left) but has some additional variation characterised by a number of peaks and troughs and a spike in the mid sixties. In comparing two segments of the population that were examined in the earlier part of the series (16-44 and 45-65) it became clear that there were distinct characteristics for these two segments of the population.

In statistical terms this would be a multimodal population since there are numerous peaks in the data. Is the data discrete or continuous? Of course our ages are continuous. However for the purposes of the census, the ages are presented as discrete values. This is necessary in order to organise the data. The graph above might therefore be better presented as a bar chart although the graph still gives a useful overview of the data. The female and male data follow similar patterns over the ages 16-65 although there is a larger variation towards the earlier part of the age group. If we consider the two groups in turn we get the following

16-44 – this segment of the population has a bimodal distribution with peaks at around the mid-twenties (the summed data would be more useful for visual inspection)

45-65 – this segment of the population has a right skewed distribution with a superimposed peak towards the mid-sixties.

The benefits of dwelling on this are that whenever we are sampling from a population we should bear in mind the original population. Studies usually have strict inclusion criteria which effectively transforms the population and makes generalisations limited to similar populations. Even before the inclusion criteria are applied the population may not be representative of the national population. However even if we look at the population within one geographical location, although it is sampled from the national population, it may differ from that population. For instance the age may be skewed relative to the national population. In this case we would have to think about three populations – the characteristics of the national population, the characteristics of the local population and the characteristics of the population included in the study. We may be able to apply transformations of the study findings to these different populations although the difficulty is that stratifications of study data may result in loss of significance of study findings.

What is also interesting about this is that if we talk about a sample population – we can compare it against the national population in a number of ways. Although it is usual to match against the variable of interest, the population can also differ according to characteristics such as age and years of education. Although these are usually adjusted for in comparisons it would be useful to have a compound metric to describe the sample population. This in turn could be provided in the national census data. This metric would be a simple measure for enabling generalisation of study findings to populations.

To illustrate the above discussion, suppose that in a study there are some interesting findings about a sample of older adults women with an average age of 80 and a range of 65-90 with a normal distribution. A cursory examination of the above census data would reveal that this is not representative of the national profile. We would therefore have to think about how the study findings can be generalised and why the study sample characteristics differs from the national profile as it is still a sample from the national population. This is relevant in understanding the epidemiology of Dementia for example.

Appendix

Doing Science Using Open Data – Part 1

Doing Science Using Open Data – Part 2

Doing Science Using Open Data – Part 3

Doing Science Using Open Data – Part 4