Doing Science Using Open Data – Part 6: Modelling Populations

In this 6th part of the series on using open data for science I’ve take a slight diversion to look at populations and the issue of sampling. This was prompted by a look at the UK mid-2011 Census data shown in the graph below.

Figure 1: Summation of male and female figures for each age from mid-2011 Census. Red bars represent the age group 45-65 and the blue bars represent the age group 16-44

What were going to do is look at the UK population and build a mathematical population and build a model for the populations we’ve looked at in the previous posts. Just to recap, when we compared two populations there are a number of statistical methods for doing this which are dependent on the characteristics of the population. For a normally distribution population we can define this population by the mean and standard deviation. As discussed in previous posts the populations in this post from the census study in mid 2011 which are not normally distributed. In the first segment aged 16-44 there is a somewhat homogenous group?? whilst in the group 45-65 there is a right skewed distribution that is the numbers for each year get progressively smaller.

In the third part in this series I included some of the data from the mid-2011 Census which I will reproduce here to support the subsequent discussion. Summing the male and female figures we get the following results for ages 16 through to 44


For ages 45-65 we get the following results


The total estimated population in England and Wales in Mid-2011 for the age group 16-44 is


and for the age group 45-65 is


So if we move firstly to the population aged 45-65. This population has a value that begins with 832,727 people aged 45 and decreases to 583,734 at age 65 . First recall that the x-axis represents age and the y-axis is the number of people in each age group. The population can be approximately described by a line of decreasing slope.  Now if we’re going to model this we’re going to need to understand what the relationship is between x and y. Quite obviously as x increases y decreases and the relationship is described by y = -x. Looking at the graph above this doesn’t seem intuitive. None of the y values are negative. However if the graph began at (0,0) then it would become negative as x increased. The reason that this doesn’t happen in the above graph is that the line y = -x is translated in a positive direction along the y-axis. So in other words (I will take out the negative sign at this stage as it will be dealt with by the coefficient a)

y =  x + c

In addition to this, rather than a straight line with a unit gradient (i.e for every unit increase along the x-axis there is a unit increase along the y-axis) the line has a gradient which we have yet to determine. For the sake of convenience I will refer to this as

y =  a x + c

There is a simple introduction to lines and slopes below.

Our job now is to find out what those two variables a and c are. This is going to be an approximation. Turning first to people aged 45

y =  a x + c

832,727 =  44 a + c

and for the age 65

583,734 =  65 a + c

We have two equations that we have to solve and two sets of values to do this. Since

832727 = 44 a + c

44 a = 832727 – c

a = (832727-c)/44

Now from the original equations we know that

583,734 = 65 a + c

and therefore substituting

a = (832,727-c)/44

we get

583734 = 65/44 (832727-c) + c

Multiplying out we get

583734 =  (1 – 1.477)c + 1230164.89

- 646430.88636 = -0.477c

c = 1354426.6

Substituting back into the original equation

583,734 = 65 a + 1354426.6

Rearranging we get

(583,734 – 1354426.6)/65 = a

a = -11856.81

Substituting the values for a and c into the original equations above, the reader will be see that these values solve the equations. The numbers have been rounded up. Indeed rounding to the nearest number we arrive at the following equation

y = -11857 x + 1354427

This equation approximately describes the UK mid-2011 Census data for the age group 45-65 where y is the total population for each age and x is the age in years within the given range.


Doing Science Using Open Data – Part 1

Doing Science Using Open Data – Part 2

Doing Science Using Open Data – Part 3

Doing Science Using Open Data – Part 4

Doing Science Using Open Data – Part 5

Index: There are indices for the TAWOP site here and here Twitter: You can follow ‘The Amazing World of Psychiatry’ Twitter by clicking on this link. Podcast: You can listen to this post on Odiogo by clicking on this link (there may be a small delay between publishing of the blog article and the availability of the podcast). It is available for a limited period. TAWOP Channel: You can follow the TAWOP Channel on YouTube by clicking on this link. Responses: If you have any comments, you can leave them below or alternatively e-mail Disclaimer: The comments made here represent the opinions of the author and do not represent the profession or any body/organisation. The comments made here are not meant as a source of medical advice and those seeking medical advice are advised to consult with their own doctor. The author is not responsible for the contents of any external sites that are linked to in this blog.

2 thoughts on “Doing Science Using Open Data – Part 6: Modelling Populations

  1. Pingback: Doing Science Using Open Data – Part 7: Modelling Populations 2 « The Amazing World of Psychiatry: A Psychiatry Blog

  2. Pingback: Doing Science Using Open Data – Part 8: Modelling Populations Part 3 « The Amazing World of Psychiatry: A Psychiatry Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s