Doing Science Using Open Data – Part 2

This is the second in a series looking at Open Science. In the first article (see Appendix) we looked at how to obtain open data on mortality figures. Graphing the data produced the figure below.

Graph of UK Mortality Data over 8 Consecutive Weeks for 7 Age Groups (Weeks ending 10th August 2012 to the 28th September 2012)

I generated two hypotheses based on a superficial examination of the graph.

H1: In the UK, deaths in the age group 45-64 years of age are several times higher than deaths in the age group 15-44 years of age.

H2: The increase in deaths described in H1 results from a larger population in the age group 45-64 than in the age group 15-44 years of age

However its now time to tighten up hypothesis H1. What we can say is that the mortality in the 45-64 year age group is several times higher than the 15-44 year age group. In this first phase we have used deduction to infer a disproportionate increase in mortality between the two groups in question. Nevertheless if we move away from this and generalise to the whole year it is going beyond the data. This is a process of induction. We need to use induction to make predictions and to have facts that we can make practical decisions with. In this case if we did not use induction then we would be limited to making a very specific comment about a cross-section in time. This data can instead be used as a framework for looking ahead although this will have significant limitations. When actual data differs significantly from these predictions this in turn will tell us something useful which can be further explored.

In order to tighten up the hypothesis H1 we need to have a look at the raw data that was used in the above graph. The data applies to eight consecutive weeks ending 10th August 2012 through to 28th September 2012.

Age 15-44: Mortality figures  for eight consecutive weeks 249, 268, 283, 243, 243, 259, 319 and 246

Age 45-64: Mortality figures for eight consecutive weeks  1,124, 1,108, 1,144, 952, 1,164, 1, 1604, 1, 173 and 1,094

We can now summarise this data using some simple statistics – the mean, median and mode which look at the central tendency of the data. The videos below explains these concepts in a bit more detail.

Video Explaining the Calculation of the Mean

Video Explaining the Mode and Mean

To calculate the mean of the data we use the formula

Mean = ∑ x/n

where x is each value in the dataset and n is the number of values in the dataset. Turning first to the 45-64 group

Mean = (1,124+1,108+1,144+952+1,164+1,106+1,173+1,094)/8 = 8,865/8 = 1108.125 = 1108 (rounded down to the nearest integer)

The mode is the most frequently occurring value. When we look at the dataset for the age group 45-64 we see that no value occurs more than once. In this case it makes sense to divide the groups into ranges of 100. In this case the ranges will appear as follows

900-1000: 1

1000-1100: 1

1100-1200: 6

Thus the modal range here will be 1100-1200. This is also consistent with the mean that has just been calculated.

Finally we turn to the median. In a dataset this is considered the middle value and if there are an even number of values then this is the average of the two middle values. Arranging the data from smallest to largest value we have

952, 1094, 1,106, 1,108, 1,124, 1,144, 1,164, 1,173

The two middle values (values 4 and 5) are 1,108 and 1,124. The average of these two values is

(1,108+1,124)/2 = 2,232/2 = 1,116

We can see a close match between the mean, median and mode. Eyeballing the graph suggests that there is no marked skewing of the data consistent with the tests of central tendency. Therefore the mean can be considered representative of the data.

Turning now to the 15-44 age group the mean of the data is

Mean = (249 + 268 + 283 + 243 + 243 + 259 + 319 + 246)/8 = 2,110/8 = 263.75 = 264 (rounding up to the nearest integer)

The mode here is simply 243 as this number occurs twice. However if we want to look at the modal ranges again we can group them as follows

240-250 4

250-260 1

260-270 1

270-280 0

280-290 1

290-300 0

300-310 0

310-320 1

The modal range here is 240-250 which is consistent with the mode calculated above. Finally the median is the average of the two middle values. Ordering the values we get

243, 243, 246, 249, 259, 268, 283, 319

The average of the two middle values is

(249+259)/2 = 508/2 = 254

Thus the mean, median and mode are 264, 254 and 243 or 240-250 respectively. There is a discrepancy between the mean and the other two values. If we remove the outlier in the data (319) and repeat the calculation we get

mean = (249 + 268 +283 + 243 + 243 + 259 + 246) = 255.85 = 256 (rounded up to the nearest integer)

If we repeat the other tests of central tendency on this new dataset we get

mode = 243

median = 249

There is still a discrepancy between the values resulting from the preponderance of values in the range 240-250. I will return to the original values and accept the limitations (essentially there were 2 weeks when there were much less deaths and one week in which there were many more deaths in this age group than in other weeks). Now if we compare the means for the two age groups

Age group 15-44 mean value = 264

Age group 44-65 mean value = 1108

The ratio of these two values is

1108/264 = 4.19 =  4 (rounded down to the nearest integer)

We can now revise hypothesis H1

H1: In the UK, deaths in the age group 45-64 years of age are 4 times higher than deaths in the age group 15-44 years of age.

Next we will need to see if these differences are significant. For this we will need to look at the population data.

(to be continued)

Appendix

Doing Science Using Open Data – Part 1

Index: There are indices for the TAWOP site here and here Twitter: You can follow ‘The Amazing World of Psychiatry’ Twitter by clicking on this link. Podcast: You can listen to this post on Odiogo by clicking on this link (there may be a small delay between publishing of the blog article and the availability of the podcast). It is available for a limited period. TAWOP Channel: You can follow the TAWOP Channel on YouTube by clicking on this link. Responses: If you have any comments, you can leave them below or alternatively e-mail justinmarley17@yahoo.co.uk. Disclaimer: The comments made here represent the opinions of the author and do not represent the profession or any body/organisation. The comments made here are not meant as a source of medical advice and those seeking medical advice are advised to consult with their own doctor. The author is not responsible for the contents of any external sites that are linked to in this blog.

8 thoughts on “Doing Science Using Open Data – Part 2

  1. Pingback: Doing Science Using Open Data – Part 2 | What is Data Science | Scoop.it

  2. Pingback: Doing Science Using Open Data – Part 2 « analyticalsolution

  3. Pingback: Doing Science Using Open Data – Part 4: Is the UK Population Normally Distributed According to Age? (No) « The Amazing World of Psychiatry: A Psychiatry Blog

  4. Pingback: Doing Science Using Open Data – Part 5: Looking at Populations « The Amazing World of Psychiatry: A Psychiatry Blog

  5. Pingback: Doing Science Using Open Data – Part 2 | Enterprise Knowledge | Scoop.it

  6. Pingback: Doing Science Using Open Data – Part 6: Modelling Populations « The Amazing World of Psychiatry: A Psychiatry Blog

  7. Pingback: Doing Science Using Open Data – Part 7: Modelling Populations 2 « The Amazing World of Psychiatry: A Psychiatry Blog

  8. Pingback: Doing Science Using Open Data – Part 8: Modelling Populations Part 3 « The Amazing World of Psychiatry: A Psychiatry Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s