Doing Science Using Open Data – Part 3: Census Data and Open Software

In the second part in the series, I examined UK mortality data to generate H1

H1: In the UK, deaths in the age group 45-64 years of age are 4 times higher than deaths in the age group 15-44 years of age.

In order to test this hypothesis further we need to learn a bit more about the population as a whole. The finding in the previous post in this series was based on the eight weeks worth of data. There are various reasons why this may be a transient finding. There may be a seasonal variation in figures or else this cohort may differ considerably from the age-equivalent cohort in one year’s time.

Before investigating this further I will return to the issue of how to analyse the data. In the first part of this series I referenced Microsoft Excel. I’ve found this to be very useful but some readers may not have access to this. There is an open source alternative – Open Office Calc. Apache ‘Open Office‘ is described as ‘The Free and Open Productivity Suite’. In order to get started with the Open Office alternative to Excel follow these instructions

1. Go to the Apache Open Office Download Page

2. Download the Apache Open Office package (versions are available for several operating systems)

3. Install the package

4. Start up Apache Open Office Calc

If you’re familiar with other spreadsheets then it shouldn’t be too difficult to get started. There is a drop-down menu for help. At the time of writing i’m using Apache Open Office 3.4.1 and will use Calc for the remainder of this post.

Returning to hypothesis 1 above we need to find out a bit more about the general population. Fortunately there is detailed Census Data available. We’re going to use the Mid 2011 Census results. To do this

1. Go to the Office for National Statistics Census 2011 page

2. Go to the page for population estimates for England and Wales for 2011

3. Download the Excel file for the ‘Annual Mid-Year Population Estimates for England and Wales, Mid 2011

4. Open this with Apache Open Office Calc

The results are just for England and Wales. The Scotland 2011 Census results are due out in December 2012 and will be published in 5 year age groups. The Northern Ireland 2011 Census results are available here. Looking at the data for England and Wales, there is a cut-off at age 89 and further data above this age is due to be published. Selecting the data for all ages including male and female figures graphed against population (using an X-Y Scatter) gives the following result.

 

A cursory examination of the graph reveals that there are more males than females for every age under 25. Once we reach the mid-forties this is reversed. Indeed there is an increase excess of women over men from the mid-seventies onwards. This may be consistent with numerous studies showing increased life expectancy for women although we would need more information to draw conclusions in this regards. We can also see that the population for each group peaks in the mid-forties. This is relevant to the hypothesis H1. Indeed hypothesis H2 states that the increase in mortality in moving from age group 16-44 to 45-65 may be accounted for by a larger population in the latter group.

We can test this hypothesis for the England and Wales population directly. Returning to the census data and summing the male and female figures we get the following results for ages 16 through to 44

680,979
706,234
711,491
741,667
765,895
757,901
757,295
771,297
756,449
768,415
774,921
759,889
768,860
770,810
778,986
782,510
751,251
700,825
690,775
702,024
716,419
729,013
761,347
794,300
820,805
800,550
821,037
819,650
832,297

For ages 45-65 we get the following results

832,727
838,064
831,041
813,798
797,077
770,066
739,859
723,861
708,371
682,824
659,795
637,073
641,145
634,399
618,132
623,508
638,118
655,668
694,644
754,834
583,734

The total estimated population in England and Wales in Mid-2011 for the age group 16-44 is

21993892

and for the age group 45-65 is

15711035

Hypothesis 2 states that the increase in mortality moving from the first to the second age group might be accounted for by an increase in the population in the second group. However the data above for England and Wales shows that there is a reduction in the overall population in moving from the first to the second group. Indeed the second group is only 71% of the size of the former group. Nevertheless the data is incomplete as the mortality data applies to the UK and the census figures apply only to England and Wales. When the other census data becomes available it will be possible to revisit hypothesis 2 and test it more convincingly.

Using the above data what implications are there for hypothesis 1? Suppose the findings from other parts of the UK are consistent with the England and Wales census data. This would imply that on moving from the age group 16-44 to 45-65 the mortality per 100,000 would increase 4 x 1/0.71 = 5.6 fold (2 sf).

However we can start to perform other statistical tests on the data.

Appendix

Doing Science Using Open Data – Part 1

Index: There are indices for the TAWOP site here and here Twitter: You can follow ‘The Amazing World of Psychiatry’ Twitter by clicking on this link. Podcast: You can listen to this post on Odiogo by clicking on this link (there may be a small delay between publishing of the blog article and the availability of the podcast). It is available for a limited period. TAWOP Channel: You can follow the TAWOP Channel on YouTube by clicking on this link. Responses: If you have any comments, you can leave them below or alternatively e-mail justinmarley17@yahoo.co.uk. Disclaimer: The comments made here represent the opinions of the author and do not represent the profession or any body/organisation. The comments made here are not meant as a source of medical advice and those seeking medical advice are advised to consult with their own doctor. The author is not responsible for the contents of any external sites that are linked to in this blog.

5 thoughts on “Doing Science Using Open Data – Part 3: Census Data and Open Software

  1. Pingback: Doing Science Using Open Data – Part 4: Is the UK Population Normally Distributed According to Age? (No) « The Amazing World of Psychiatry: A Psychiatry Blog

  2. Pingback: Doing Science Using Open Data – Part 5: Looking at Populations « The Amazing World of Psychiatry: A Psychiatry Blog

  3. Pingback: Doing Science Using Open Data – Part 6: Modelling Populations « The Amazing World of Psychiatry: A Psychiatry Blog

  4. Pingback: Doing Science Using Open Data – Part 7: Modelling Populations 2 « The Amazing World of Psychiatry: A Psychiatry Blog

  5. Pingback: Doing Science Using Open Data – Part 8: Modelling Populations Part 3 « The Amazing World of Psychiatry: A Psychiatry Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s