# Doing Science Using Open Data – Part 4: Is the UK Population Normally Distributed According to Age? (No)

In the first three parts of this series, I looked at UK mortality data which is freely available in conjunction with data from the UK census. I generated two hypotheses

H1: In the UK, deaths in the age group 45-64 years of age are 4 times higher than deaths in the age group 15-44 years of age.

H2: The increase in deaths described in H1 results from a larger population in the age group 45-64 than in the age group 15-44 years of age

H1 was generated from UK data and wasn’t tested any further. However H2 was partially tested (the dataset was incomplete) and appeared to be incorrect on further testing. A more convincing result would be obtained from statistical testing. Intuitively it seems quite obvious that H2 is false.

For ages 16-44 we got the following data

680,979
706,234
711,491
741,667
765,895
757,901
757,295
771,297
756,449
768,415
774,921
759,889
768,860
770,810
778,986
782,510
751,251
700,825
690,775
702,024
716,419
729,013
761,347
794,300
820,805
800,550
821,037
819,650
832,297

For ages 45-65 we got the following data

832,727
838,064
831,041
813,798
797,077
770,066
739,859
723,861
708,371
682,824
659,795
637,073
641,145
634,399
618,132
623,508
638,118
655,668
694,644
754,834
583,734

Now it is useful in comparing these populations to get an understanding of what they look like. I’ve graphed the two populations below. The red bars show the population in the age group 45-64 with increasing years (i.e 45, 46 etc). The blue bars show the age group 16-44 again with increasing years for successive bars.

In comparing the two populations we usually make assumptions about the populations. The most commonly discussed population distribution in statistics is the normal distribution.

A selection of Normal Distribution Probability Density Functions (PDFs), Author InductiveLoad, Public Domain

Clearly when we are looking at increasing age, these two populations are not normally distributed. If they were then the there would be a central peak with tapering on either side. Eyeballing the data reveals homogeneity in the age group 16-44 whilst there is a slight left sided skewing of the data in the 45-65 age group. However if we look at the original population data from the census we get the following

This graph was discussed briefly in the previous post. What is clear is that this is not a normal distribution. What is even more interesting is that this is not even a sample. This is the population based on the Census. In other words according to age, the UK population is not normally distributed. The Census doesn’t yet contain the data for the over 90 age group but there is a clear trend in the over 80 group even without this data. What is clear from the above is that the population is skewed to the left. This is hardly surprising as the lifespan is finite and age is a risk factor for mortality. What is also interesting about this graph though is that its not the nice idealised left skewed graph we might expect. Rather than beginning at a peak or trough, the graph begins at an intermediate level before passing through troughs and peaks followed by a steady decline at around age 45. This decline is broken up by a sharp increase in the mid-sixties. The graph might almost be described by the superposition of several distinct graphs.

Returning to our original question of comparing the two populations we can see that they do not come from a normally distributed population. The first group (16-44) comes from a part of the graph which varies between 300,000 and 400,000 people per age (in years). The second group (45-65) comes from the part of the graph which begins to show the downward trend in population per age (in years). The spike complicates matters slightly. Nevertheless we can see that they are not similar populations when we consider population as a function of age.

Appendix

Doing Science Using Open Data – Part 1

Doing Science Using Open Data – Part 2

Doing Science Using Open Data – Part 3