New opportunities for geographical census analysis using individual level data

S Openshaw and I Turton

Centre for Computational Geography, School of Geography,

University of Leeds, Leeds LS2 9JT

Summary

The availability of samples of individual census data from the 1991 census significantly broadens the usefulness of census data. The paper describes how to easily access and use the SAR data via specialist software. It illustrates some of the new types of census analysis that are now possible by applying more traditional social survey approaches to spatial micro census data.

What is so new about the 1991 Census?

The 1991 census offers a number of innovations that are geography relevant. Some of these are discussed in Cole (1994) and mainly relate to the much easier access to data for small areas and the provision of a number of important new variables. The purchase of census enumeration district boundaries in a digital form is also a most significant advance. However, perhaps equally exciting is the provision of a Sample of Anonymised Records (SAR). This provides microlevel details for a random sample of 2% of individuals and 1% of households who completed a 1991 census form, see Dale and Marsh (1993). The availability of micro census data for Britain and Northern Ireland represents a major innovation in census outputs. It opens up the census data resource to a much broader group of research applications outside the traditional area of quantitative spatial analysis and GIS. Indeed this is important because viewed purely as a source of detailed spatial information, the SAR represents a step back from the detailed small area and postcoded world of the 1990s to the pre 1951 census era, when census outputs were only available for large administrative areas. The excuse then was lack of demand and probably also fundamental difficulties in data management, whilst with the SARs the coarseness of the geographic coding is designed to be a confidentiality preservation device, albeit of unknown efficiency (Openshaw, 1994). The purpose of this paper is to stimulate interest and raise awareness of the geographical research potential of the SARs within geography and also to demonstrate that it is the easiest of all census datasets to access and analyse via specialist software.

Geographical aspects of the SARs

A pre-requisite in seeking geographical uses of the SAR data is an interest in individual persons or households either for the entire UK (including Northern Ireland) or at a macrogeographic scale. The SAR geography is restricted to areal units composed of large districts (with a minimum population size of 100,000), see table 1. Nevertheless, this yields a multi-level survey dataset for Britain considerably larger and more representative than any other widely available social survey. The individual data files consists of 1,116,181 records, and the household data some 215,761 household records although they have varying levels of geographic resolution. Marsh and Dale (1993) and Middleton (1995) provides a further discussion of geographical codes used by the SAR samples. Particularly serious is the lack of fine geographic resolution. The household data which contains details of household structure is not available for geographic codes less than the region scale. This reflects confidentiality concerns but clearly precludes some important geographical uses of the data; for example, the investigation of aggregation effects on the severity of the ecological fallacy problem in the small area census statistics (Openshaw, 1984). None of the geography codes in Table 1 are small area codes and this is to be regretted, although it is understandable. Nevertheless, it does not damage one of the principal arguments in the needs case for the provision of basic SAR data. Marsh et al (1988) claimed that individual level data were needed to help negate some of the ecological fallacy problems inherent in the analysis of aggregate small area statistics. At least it is now possible to obtain reasonably good sample estimates of relationships at the individual level from the SAR samples which can subsequently be compared with Small Area Statistics (SAS) and Local Base Statistics (LBS) results. This is no small achievement and marks an advance in census data provision even if it is still far less than what might be considered desirable.

Table 1 SAR geography


census geography           number    individual data    household data   
Country                       4            yes               yes         
region 1                     12            yes               yes         
county                       64            yes                no         
District                    468            yes                no         
ward/eds/postcode            no             no                no         
TV Region 2                  12            yes                no         

Notes:


1: North, Yorks and Humberside, East Midlands, East Anglia, Inner London, Outer London, Rest of S.East, South West, West Midlands, North West, Wales ,Scotland, NI

2: a recoded derived variable


However, if the SAR is not ideal from a spatial analytical geographical perspective, it is important not to undervalue what actually exists. It provides the geographer with an opportunity to design their own census tabulations and to control to some extent the geographical coding considered most relevant. If the table you want is not provided in the SAS or LBS then you can probably create it yourself from the SAR data. In theory this should improve greatly understanding of many social and geographic phenomenon, add value to the census as a data resource relevant to many areas of human geography and social science outside the world of GIS, and allow better linkage with other survey data sources. The SAR data are the only census outputs that can be treated both as a survey and as a spatial dataset. Viewed from this perspective there can be little doubt that the SAR significantly broadens the range of research interests that can be served by the census data resource (Openshaw, 1995).

One of the benefits of only having macro geographic coding is that sample sizes of only 1% and 2% are quite adequate for disaggregate analysis at these scales. If census ward or enumeration district levels of geographic coding had been used then sample sizes of 10 to 20% (or more) might well have been needed to yield useful amounts of data for reliable analysis at this finer geographic scale. Table 2 shows the total numbers of records available at each level of geography. This should allow reasonable levels of table disaggregation. Furthermore, the simple random nature of the SAR data samples makes the subsequent task of statistical analysis much easier than if a more complex sampling strategy had been employed.

Table 2 Size of SAR samples by geographic scale


geography           person data              household data                                        
                min            max            min            max                                    
country        57579          955882         11020          184740         
region         41560          215305          8052           41043         
county          2421           83690            -              -              
district        2327           19212            -              -              

Easy access to the SAR data

Another general distinguishing feature of 1991 census analysis is the provision of specialist software to ease considerably the traditionally hard tasks of accessing the census data. Software such as SASPAC can either be run on a local computer or accessed via a national service at the Manchester Computing Centre (MCC). This and related software have revolutionised access to all the traditional census outputs: LBS, SAS, Special Migration Statistics (SMS) and Special Workplace Statistics (SWS). The SARs too can be accessed via specially written software systems. An ESRC project at the School of Geography in Leeds has created an easy to use system for access, tabulations, and some exploratory analysis of the SAR data (Turton and Openshaw, 1995). The slightly misleading acronym USAR (Unix SAR) system provides a portable Unix software and data package that can either be accessed at MCC or installed on local unix hosts. It needs 68 MB of disk space to store the SAR data files. It is available free of charge to encourage usage of the SAR and it can be obtained by anonymous file transfer (ftp); see note at end for details. A user manual is also available in electronic and paper form (Turton and Openshaw, 1994). The Centre for Computational Geography at Leeds will support the USAR software at least until 1997.

The USAR system offers the following advantages:

1. it is easy to use with a very short learning curve, most non-expert computer users (i.e. undergraduates) can generate useful results very soon after starting,

2. it is fast and efficient,

3. it provides a small number of advanced exploratory analysis tools not found in statistical packages, specially developed to help the analysis of the SARs,

4. it offers an easy way of creating files for input into other software systems,

5. it has built-in a library of standard recodes allowing the automatic recreation of many of the SAS and LBS tables as well as 47 other derived variables; and

6. it provides SAR data for Northern Ireland as well as the rest of the UK.

More recently, USAR has been transferred onto a PC and a version for MS-DOS exists. This can be used on a single PC or on a fileserver network. This also requires 68 MB of free disk space, and at least 4 MB of PC memory. On a 33 MHz 486 PC, run times are about 10 times longer than on a UNIX workstation. Nevertheless, typical table times of 5 minutes are acceptable for slow PC hardware given a database of 68MB. The MS-DOS version of USAR is identical in all respects to the UNIX version and is also available free of charge.

Some geographical illustrations

The principal attraction of the SAR data is the flexibility it gives the geographer in defining their own census outputs. You are no longer restricted to using what the OPCS/GRO(S) provide. This removes most needs for special tabulation. An example would be LBS Table L37 which only reports age and sex only for lone parents aged 16-24. Suppose tabulation with a larger number of age groups is required. With USAR this task took less than 5 minutes to perform with the PC version and 30 seconds with the UNIX version; see table 3. Now repeat the SAR analysis for Leeds which can be repeated for other cities; for example, Edinburgh and Belfast. A few minutes later and the results shown in Table 4 are produced.

Table 3 SAR and LBS Age-Sex of Lone Parents Comparison

SAR factored up LBS


	SAR						LBS                                                                             
Age	  Male	   Female       Age       Male         Female                                                                             
11          0          0         11         not  available 
12          0          0         12                                
13          0          0         13                                
14          0          0         14                                
15          0          0         15                                
16        100       2100         16          96         308       
17        500       6000         17          88        1588      
18        600      12900         18         120        5259      
19        700      17400         19         170       10779     
20       1600      25200         20         248       17248     
21       1000      30500         21         306       22240     
22       1000      35200         22         430       27341     
23       1000      45500         23         545       31032     
24       1600      42000         24         632       34046     
25-34   24100     459100         25-34      not  available
35-44   46100     333800         35-44                             
45+     35100     119900         45+  
Total   113400    1129600        Total      not available

Crown Copyright

Table 4 Lone Parents Comparison between Cities


              Leeds         Edinburgh      Belfast       
          Male  Female   Male   Female   Male   Female  
AGE                                                  
16         0        0      0        0      0       0 
17         0        0      0        0      0       0 
18         1        2      0        3      0       2 
19         1        4      1        3      1       2 
20         0        9      0        3      0       2 
21         0       12      0        5      0       3 
22         0       12      0        9      0       6 
23         1       12      0        7      0       2 
24         0       10      0        6      1       7 
25         0       12      0       10      1       6 
26         0       12      1        6      0      13 
27         1       10      0        3      0       8 
28         0        9      1        6      0       6 
29         2       11      0        5      0      11 
30         0       11      0       12      0       6 
Total      6      126      3       78      3      74 

Crown Copyright

However these results illustrate another feature of census data that the SARs can be used to investigate. Most of the other SAR/LBS comparisons users can perform with USAR will show only small differences in the numbers once they are factored up. However, occasionally as happens here, the differences can be much larger than expected. For example, the numbers of 16-24 year olds in the SAR data are estimated to be 7,519,800 compared with 6,860,516 in the LBS. The difference is too large to be due to sampling error and is investigated further in Figures 1 and 2. They show in a bar chart the differences between the results for individual age/sex groups in LBS Table 38 and USAR. The under-enumeration of 16-24 year olds is very pronounced. For example the differences in the 18, 19, 20, 21, and 22 year groups are respectively 2.4%, 4.1%, 3.2%, 5.5%, and 8.9%. These differences are attributed to the effects of census non-response being greatest amongst the young who were probably trying to hide to avoid the "poll tax". The even larger under-enumeration of 80+ year olds is due to a problem with census collection in aged person's homes. In the LBS, OPCS imputation will have been used in an attempt to correct for these effects, although clearly for some age groups the correction was too small. By comparison the SAR sample was taken from fully processed records. This suggests that the SAR results may well be more accurate.

Figure 1 -Percentage Male Underenumeration

figure 1

Figure 2 - Percentage Female Underenumeration

figure2.gif

These results provide a good illustration of the value of the SAR. Laudable attempts to correct for under-enumeration have seemed to present a distorted numeric picture of a politically highly sensitive social group of society, indeed table 3 and figure 3 shows evidence for the extensive underestimation of teenage lone parents. This should also include under 16 year olds but these are not recorded in the 1991 census. Why this should be is a mystery. Was it a legally induced omission or was it accidental? Further analysis is shown in table 5 which gives the break down of lone parents with dependant children by region divided into economically active and inactive. It illustrates the use of a SAR derived variable. Despite our expectations to the contrary nearly half of lone parents are economically active, though the economic prosperity of the region does appear to effect this with more lone parents working in the south than the north.

Figure 3 - Percentage Under Enumeration of Lone Parents

figure 3

Table 5 Economic Activity of Lone Parents with dependant Children by Region


Region          % in            Index            
                employment       (Country =100)  
Rest of S.East  53.64           123.75           
South West      53.45           120.88           
East Anglia     53.11           120.27           
East Midlands   50.39           108.85           
Outer London    49.48           108.37           
West Midlands   46.09            99.27            
Scotland        46.29            95.81            
North West      45.50            92.06            
Yorks and Humb  45.09            91.85            
Inner London    44.89            87.07            
Wales           42.54            84.45            
North           41.29            77.56            
Total           47.67           100.00           

Crown copyright

The next step is to use the SAR to look in more detail at the characteristics of lone parents via USAR's exploratory search facility. The LBS is much less useful in this respect because it provides very limited information, all of which are affected by under-enumeration as well as being scattered over several different SAS and LBS tables. The Search facility in USAR lists all the SAR variables by their association with lone parents. This is a useful simple data exploratory tool. The user can use the whole dataset or specify a series of geographical filters which define the subset of interest. USAR then presents a ranked list of variable category combinations. The user is then able to scroll down this list and select more combinations to further narrow the area of search. Table 6 summarises the most frequently associated variables for young lone parents with dependant children. Note that none of the geographical variables (region, county etc.) appear. This would either imply a lack of any major regional concentrations or that the standard geography codes used in the SAR are not particularly appropriate for analysis and may need to be recoded. This could be done subjectively by defining what might be considered more appropriate macro geographical representation; for example, approximations to North-South, big city, rural, industrial towns, etc. Alternatively USAR provides table design tools that can be used to recode geography to provide a more consistent description of lone parents. These associations can then be used as the basis for a fuzzy search.

Table 6 Search results for Young Lone Parents


                                                 
Characteristic                                   
                                Percentage       
                 
female                          96               
White                           93               
Single                          82               
No Car                          73               
Economically Inactive           71               
No Employed Persons in          66               
Household                                        
1 Child in Household            66               
Local Authority Rent            53               
2 Children in Household         25               
SIC Division 6                  24               
Distribution, hotels and                         
repairs                                          

Fuzzy search is an attempt to move away from having to be explicit when specifying queries for a database. For example, based on Table 6 if a user wants to count the number of lone parents with the most frequent characteristics then this merely involves a single logical selection (viz. select if female and white and single and no car etc.). However, it is not reasonable to expect that many lone parents will match this profile in its entirety and therefore some will be missed. Fuzzy searching permits a greater degree of flexibility. You can ask for any K from M logical conditions to be satisfied without being any more specific. Table 7 is created by instructing USAR to match 6 out of the 9 most frequent variables previously identified and the report shows the matches found. The largest coherent group is white, single parents with no car or job living in Local Authority rented accommodation with 1 dependant child. Table 8 shows the same fuzzy search conducted on a region by region basis. This shows that in most regions the largest coherent group of lone parents are the same as for the country as a whole. However in East Anglia, the South West and Wales lone parents are less likely to be single and in the South West and Wales are more likely to have 2 dependent children.

Table 7 Fuzzy search on 6 out of 9 variables


Cars   Marital  Ethnic  Economic   Tenure  Dependant    Employed    Indust              
       Status   group   Position           Children     People in   Div       Count     
                                                        HH                              
0      Single   White   Inactive   LA      1     2      0           6                   
                                   Rent                                                 
+      +        +       +          +       +     -      +           -         281       
+      +        +       +          -       +     -      +           -         153       
+      +        +       +          +       -     +      +           -         145       
+      +        +       +          +       +     -      +           +         94        
+      +        +       +          -       +     -      +           +         71        
+      +        +       +          -       -     +      +           +         55        
+      -        +       +          +       -     +      -           +         55        

Table 8 Fuzzy search for each region


Region    Cars   Marital Ethnic  Economic  Tenure  Dependant    Employed  Indust Count   
                         group   Position          Children     People     Div           
                 Status                                         in HH                    
          0      Single  White   Inactive  LA      1     2      0         6              
                                           Rent                                          
North     +      +       +       +         +       +     -      +         -      61      
Yorkshire +      +       +       +         +       +     -      +         -      56      
 + Humb                                                                                  
E.        +      +       +       +         +       +     -      +         -      42      
Midlands                                                                                 
E.        +      -       +       +         +       +     -      +         -      8       
Anglia                                                                                   
I.        +      +       +       +         +       +     -      +         -      35      
London                                                                                   
O.        +      +       +       +         +       +     -      +         -      42      
London                                                                                   
R. SE     +      +       +       +         +       +     -      +         -      48      
SW        +      -       +       +         +       -     +      +         -      29      
W.        +      +       +       +         +       +     -      +         -      59      
Midlands                                                                                 
NW        +      +       +       +         +       +     -      +         -      83      
Wales     +      -       +       +         +       -     +      +         -      30      
Scotland  +      +       +       +         +       +     -      +         -      87      

Conclusions

The SAR provides geographers with a major new census resource. The USAR system provide an easy way in. This paper has provided illustration of the SARs being used to answer research geographical questions. There are many other potential applications; in particular in multi-level modelling and the use of artificial intelligence based machine discovery systems such as data mining. The principal constraints now are those of the investigators imagination. However it may well be a while before the full geographical benefits of such a fundamentally new data resource such as the SARs are fully identified and these opportunities recognised in the less quantitative parts of human geography.

Note

You can obtain USAR software for UNIX and MS-DOS by anonymous ftp from gam.leeds.ac.uk. However, the USAR data files for UNIX and MS-DOS are only available by ftp from MCC with proof of SAR registration. Non-academic users can purchase the SAR data for £2K from MCC, USAR is free.

Acknowledgements

The research reported here was supported by ESRC grant number H507255100

References

Cole, K., (1994) 'Data Modification, data suppression, small populations and other features of the 1991 Small Area Statistics'. Area 26 1 69-78

Dale, A., Marsh, C., (1993) The 1991 Census User's GuideHMSO, London

Marsh C., Arber, S., Wrigley, N., Rhind, D., Bulmer M., (1988) 'The view of academic social scientists on the 1991 UK Census of Population: a report of the ESRC Census Working Group', Environment and Planning A 20, 851-889

Middleton, E., (1995) in S. Openshaw (ed) The Census Users Handbook Longman, London p337-362

Openshaw, S., (1984), 'Ecological fallacies and the analysis of areal census data', Environment and Planning A 16, 17-31

Openshaw, S., (1994), 'Social costs and benefits of the census', Proceedings of XVth International Conference of the Data Protection and Privacy Commissioners p89-97

Openshaw, S., (1995), 'The future of the census', in S Openshaw (ed) The Census Users Handbook Longman, London p389-411

Turton, I., Openshaw, S., (1994), 'A step-by-step guide to accessing the 1991 SAR via USAR', Working Paper 94/6, School of Geography, Leeds University

Turton, I., Openshaw, S., (1995) 'Putting the 1991 Census Sample of Anonymised Records on your Unix Workstation', Environment and Planning A (forthcoming).