# Doing Science Using Open Data – Part 2

This is the second in a series looking at Open Science. In the first article (see Appendix) we looked at how to obtain open data on mortality figures. Graphing the data produced the figure below.

Graph of UK Mortality Data over 8 Consecutive Weeks for 7 Age Groups (Weeks ending 10th August 2012 to the 28th September 2012)

I generated two hypotheses based on a superficial examination of the graph.

H1: In the UK, deaths in the age group 45-64 years of age are several times higher than deaths in the age group 15-44 years of age.

H2: The increase in deaths described in H1 results from a larger population in the age group 45-64 than in the age group 15-44 years of age

However its now time to tighten up hypothesis H1. What we can say is that the mortality in the 45-64 year age group is several times higher than the 15-44 year age group. In this first phase we have used deduction to infer a disproportionate increase in mortality between the two groups in question. Nevertheless if we move away from this and generalise to the whole year it is going beyond the data. This is a process of induction. We need to use induction to make predictions and to have facts that we can make practical decisions with. In this case if we did not use induction then we would be limited to making a very specific comment about a cross-section in time. This data can instead be used as a framework for looking ahead although this will have significant limitations. When actual data differs significantly from these predictions this in turn will tell us something useful which can be further explored.

In order to tighten up the hypothesis H1 we need to have a look at the raw data that was used in the above graph. The data applies to eight consecutive weeks ending 10th August 2012 through to 28th September 2012.

Age 15-44: Mortality figures  for eight consecutive weeks 249, 268, 283, 243, 243, 259, 319 and 246

Age 45-64: Mortality figures for eight consecutive weeks  1,124, 1,108, 1,144, 952, 1,164, 1, 1604, 1, 173 and 1,094

We can now summarise this data using some simple statistics – the mean, median and mode which look at the central tendency of the data. The videos below explains these concepts in a bit more detail.

Video Explaining the Calculation of the Mean

Video Explaining the Mode and Mean

To calculate the mean of the data we use the formula

Mean = ∑ x/n

where x is each value in the dataset and n is the number of values in the dataset. Turning first to the 45-64 group

Mean = (1,124+1,108+1,144+952+1,164+1,106+1,173+1,094)/8 = 8,865/8 = 1108.125 = 1108 (rounded down to the nearest integer)

The mode is the most frequently occurring value. When we look at the dataset for the age group 45-64 we see that no value occurs more than once. In this case it makes sense to divide the groups into ranges of 100. In this case the ranges will appear as follows

900-1000: 1

1000-1100: 1

1100-1200: 6

Thus the modal range here will be 1100-1200. This is also consistent with the mean that has just been calculated.

Finally we turn to the median. In a dataset this is considered the middle value and if there are an even number of values then this is the average of the two middle values. Arranging the data from smallest to largest value we have

952, 1094, 1,106, 1,108, 1,124, 1,144, 1,164, 1,173

The two middle values (values 4 and 5) are 1,108 and 1,124. The average of these two values is

(1,108+1,124)/2 = 2,232/2 = 1,116

We can see a close match between the mean, median and mode. Eyeballing the graph suggests that there is no marked skewing of the data consistent with the tests of central tendency. Therefore the mean can be considered representative of the data.

Turning now to the 15-44 age group the mean of the data is

Mean = (249 + 268 + 283 + 243 + 243 + 259 + 319 + 246)/8 = 2,110/8 = 263.75 = 264 (rounding up to the nearest integer)

The mode here is simply 243 as this number occurs twice. However if we want to look at the modal ranges again we can group them as follows

240-250 4

250-260 1

260-270 1

270-280 0

280-290 1

290-300 0

300-310 0

310-320 1

The modal range here is 240-250 which is consistent with the mode calculated above. Finally the median is the average of the two middle values. Ordering the values we get

243, 243, 246, 249, 259, 268, 283, 319

The average of the two middle values is

(249+259)/2 = 508/2 = 254

Thus the mean, median and mode are 264, 254 and 243 or 240-250 respectively. There is a discrepancy between the mean and the other two values. If we remove the outlier in the data (319) and repeat the calculation we get

mean = (249 + 268 +283 + 243 + 243 + 259 + 246) = 255.85 = 256 (rounded up to the nearest integer)

If we repeat the other tests of central tendency on this new dataset we get

mode = 243

median = 249

There is still a discrepancy between the values resulting from the preponderance of values in the range 240-250. I will return to the original values and accept the limitations (essentially there were 2 weeks when there were much less deaths and one week in which there were many more deaths in this age group than in other weeks). Now if we compare the means for the two age groups

Age group 15-44 mean value = 264

Age group 44-65 mean value = 1108

The ratio of these two values is

1108/264 = 4.19 =  4 (rounded down to the nearest integer)

We can now revise hypothesis H1

H1: In the UK, deaths in the age group 45-64 years of age are 4 times higher than deaths in the age group 15-44 years of age.

Next we will need to see if these differences are significant. For this we will need to look at the population data.

(to be continued)

Appendix

Doing Science Using Open Data – Part 1