# Doing Science Using Open Data – Part 6: Modelling Populations

In this 6th part of the series on using open data for science I’ve take a slight diversion to look at populations and the issue of sampling. This was prompted by a look at the UK mid-2011 Census data shown in the graph below.

Figure 1: Summation of male and female figures for each age from mid-2011 Census. Red bars represent the age group 45-65 and the blue bars represent the age group 16-44

What were going to do is look at the UK population and build a mathematical population and build a model for the populations we’ve looked at in the previous posts. Just to recap, when we compared two populations there are a number of statistical methods for doing this which are dependent on the characteristics of the population. For a normally distribution population we can define this population by the mean and standard deviation. As discussed in previous posts the populations in this post from the census study in mid 2011 which are not normally distributed. In the first segment aged 16-44 there is a somewhat homogenous group?? whilst in the group 45-65 there is a right skewed distribution that is the numbers for each year get progressively smaller.

In the third part in this series I included some of the data from the mid-2011 Census which I will reproduce here to support the subsequent discussion. Summing the male and female figures we get the following results for ages 16 through to 44

 680,979 706,234 711,491 741,667 765,895 757,901 757,295 771,297 756,449 768,415 774,921 759,889 768,860 770,810 778,986 782,510 751,251 700,825 690,775 702,024 716,419 729,013 761,347 794,300 820,805 800,550 821,037 819,650 832,297

For ages 45-65 we get the following results

 832,727 838,064 831,041 813,798 797,077 770,066 739,859 723,861 708,371 682,824 659,795 637,073 641,145 634,399 618,132 623,508 638,118 655,668 694,644 754,834 583,734

The total estimated population in England and Wales in Mid-2011 for the age group 16-44 is

 21993892

and for the age group 45-65 is

 15711035

So if we move firstly to the population aged 45-65. This population has a value that begins with 832,727 people aged 45 and decreases to 583,734 at age 65 . First recall that the x-axis represents age and the y-axis is the number of people in each age group. The population can be approximately described by a line of decreasing slope.  Now if we’re going to model this we’re going to need to understand what the relationship is between x and y. Quite obviously as x increases y decreases and the relationship is described by y = -x. Looking at the graph above this doesn’t seem intuitive. None of the y values are negative. However if the graph began at (0,0) then it would become negative as x increased. The reason that this doesn’t happen in the above graph is that the line y = -x is translated in a positive direction along the y-axis. So in other words (I will take out the negative sign at this stage as it will be dealt with by the coefficient a)

y =  x + c

In addition to this, rather than a straight line with a unit gradient (i.e for every unit increase along the x-axis there is a unit increase along the y-axis) the line has a gradient which we have yet to determine. For the sake of convenience I will refer to this as

y =  a x + c

There is a simple introduction to lines and slopes below.

Our job now is to find out what those two variables a and c are. This is going to be an approximation. Turning first to people aged 45

y =  a x + c

832,727 =  44 a + c

and for the age 65

583,734 =  65 a + c

We have two equations that we have to solve and two sets of values to do this. Since

832727 = 44 a + c

44 a = 832727 – c

a = (832727-c)/44

Now from the original equations we know that

583,734 = 65 a + c

and therefore substituting

a = (832,727-c)/44

we get

583734 = 65/44 (832727-c) + c

Multiplying out we get

583734 =  (1 – 1.477)c + 1230164.89

- 646430.88636 = -0.477c

c = 1354426.6

Substituting back into the original equation

583,734 = 65 a + 1354426.6

Rearranging we get

(583,734 – 1354426.6)/65 = a

a = -11856.81

Substituting the values for a and c into the original equations above, the reader will be see that these values solve the equations. The numbers have been rounded up. Indeed rounding to the nearest number we arrive at the following equation

y = -11857 x + 1354427

This equation approximately describes the UK mid-2011 Census data for the age group 45-65 where y is the total population for each age and x is the age in years within the given range.

Appendix

Doing Science Using Open Data – Part 1

Doing Science Using Open Data – Part 2

Doing Science Using Open Data – Part 3

Doing Science Using Open Data – Part 4

Doing Science Using Open Data – Part 5

Index: There are indices for the TAWOP site here and here Twitter: You can follow ‘The Amazing World of Psychiatry’ Twitter by clicking on this link. Podcast: You can listen to this post on Odiogo by clicking on this link (there may be a small delay between publishing of the blog article and the availability of the podcast). It is available for a limited period. TAWOP Channel: You can follow the TAWOP Channel on YouTube by clicking on this link. Responses: If you have any comments, you can leave them below or alternatively e-mail justinmarley17@yahoo.co.uk. Disclaimer: The comments made here represent the opinions of the author and do not represent the profession or any body/organisation. The comments made here are not meant as a source of medical advice and those seeking medical advice are advised to consult with their own doctor. The author is not responsible for the contents of any external sites that are linked to in this blog.