S Openshaw and I Turton
Centre for Computational Geography, School of Geography,
University of Leeds, Leeds LS2 9JT
The availability of samples of individual census data from the 1991 census significantly broadens the usefulness of census data. The paper describes how to easily access and use the SAR data via specialist software. It illustrates some of the new types of census analysis that are now possible by applying more traditional social survey approaches to spatial micro census data.
The 1991 census offers a number of innovations that are geography relevant. Some of these are discussed in Cole (1994) and mainly relate to the much easier access to data for small areas and the provision of a number of important new variables. The purchase of census enumeration district boundaries in a digital form is also a most significant advance. However, perhaps equally exciting is the provision of a Sample of Anonymised Records (SAR). This provides microlevel details for a random sample of 2% of individuals and 1% of households who completed a 1991 census form, see Dale and Marsh (1993). The availability of micro census data for Britain and Northern Ireland represents a major innovation in census outputs. It opens up the census data resource to a much broader group of research applications outside the traditional area of quantitative spatial analysis and GIS. Indeed this is important because viewed purely as a source of detailed spatial information, the SAR represents a step back from the detailed small area and postcoded world of the 1990s to the pre 1951 census era, when census outputs were only available for large administrative areas. The excuse then was lack of demand and probably also fundamental difficulties in data management, whilst with the SARs the coarseness of the geographic coding is designed to be a confidentiality preservation device, albeit of unknown efficiency (Openshaw, 1994). The purpose of this paper is to stimulate interest and raise awareness of the geographical research potential of the SARs within geography and also to demonstrate that it is the easiest of all census datasets to access and analyse via specialist software.
A pre-requisite in seeking geographical uses of the SAR data is an interest in individual persons or households either for the entire UK (including Northern Ireland) or at a macrogeographic scale. The SAR geography is restricted to areal units composed of large districts (with a minimum population size of 100,000), see table 1. Nevertheless, this yields a multi-level survey dataset for Britain considerably larger and more representative than any other widely available social survey. The individual data files consists of 1,116,181 records, and the household data some 215,761 household records although they have varying levels of geographic resolution. Marsh and Dale (1993) and Middleton (1995) provides a further discussion of geographical codes used by the SAR samples. Particularly serious is the lack of fine geographic resolution. The household data which contains details of household structure is not available for geographic codes less than the region scale. This reflects confidentiality concerns but clearly precludes some important geographical uses of the data; for example, the investigation of aggregation effects on the severity of the ecological fallacy problem in the small area census statistics (Openshaw, 1984). None of the geography codes in Table 1 are small area codes and this is to be regretted, although it is understandable. Nevertheless, it does not damage one of the principal arguments in the needs case for the provision of basic SAR data. Marsh et al (1988) claimed that individual level data were needed to help negate some of the ecological fallacy problems inherent in the analysis of aggregate small area statistics. At least it is now possible to obtain reasonably good sample estimates of relationships at the individual level from the SAR samples which can subsequently be compared with Small Area Statistics (SAS) and Local Base Statistics (LBS) results. This is no small achievement and marks an advance in census data provision even if it is still far less than what might be considered desirable.
Table 1 SAR geography
census geography number individual data household data Country 4 yes yes region 1 12 yes yes county 64 yes no District 468 yes no ward/eds/postcode no no no TV Region 2 12 yes no
1: North, Yorks and Humberside, East Midlands, East Anglia, Inner London, Outer London, Rest of S.East, South West, West Midlands, North West, Wales ,Scotland, NI
2: a recoded derived variable
However, if the SAR is not ideal from a spatial analytical geographical perspective, it is important not to undervalue what actually exists. It provides the geographer with an opportunity to design their own census tabulations and to control to some extent the geographical coding considered most relevant. If the table you want is not provided in the SAS or LBS then you can probably create it yourself from the SAR data. In theory this should improve greatly understanding of many social and geographic phenomenon, add value to the census as a data resource relevant to many areas of human geography and social science outside the world of GIS, and allow better linkage with other survey data sources. The SAR data are the only census outputs that can be treated both as a survey and as a spatial dataset. Viewed from this perspective there can be little doubt that the SAR significantly broadens the range of research interests that can be served by the census data resource (Openshaw, 1995).
One of the benefits of only having macro geographic coding is that sample sizes of only 1% and 2% are quite adequate for disaggregate analysis at these scales. If census ward or enumeration district levels of geographic coding had been used then sample sizes of 10 to 20% (or more) might well have been needed to yield useful amounts of data for reliable analysis at this finer geographic scale. Table 2 shows the total numbers of records available at each level of geography. This should allow reasonable levels of table disaggregation. Furthermore, the simple random nature of the SAR data samples makes the subsequent task of statistical analysis much easier than if a more complex sampling strategy had been employed.
Table 2 Size of SAR samples by geographic scale
geography person data household data min max min max country 57579 955882 11020 184740 region 41560 215305 8052 41043 county 2421 83690 - - district 2327 19212 - -
Another general distinguishing feature of 1991 census analysis is the provision of specialist software to ease considerably the traditionally hard tasks of accessing the census data. Software such as SASPAC can either be run on a local computer or accessed via a national service at the Manchester Computing Centre (MCC). This and related software have revolutionised access to all the traditional census outputs: LBS, SAS, Special Migration Statistics (SMS) and Special Workplace Statistics (SWS). The SARs too can be accessed via specially written software systems. An ESRC project at the School of Geography in Leeds has created an easy to use system for access, tabulations, and some exploratory analysis of the SAR data (Turton and Openshaw, 1995). The slightly misleading acronym USAR (Unix SAR) system provides a portable Unix software and data package that can either be accessed at MCC or installed on local unix hosts. It needs 68 MB of disk space to store the SAR data files. It is available free of charge to encourage usage of the SAR and it can be obtained by anonymous file transfer (ftp); see note at end for details. A user manual is also available in electronic and paper form (Turton and Openshaw, 1994). The Centre for Computational Geography at Leeds will support the USAR software at least until 1997.
The USAR system offers the following advantages:
1. it is easy to use with a very short learning curve, most non-expert computer users (i.e. undergraduates) can generate useful results very soon after starting,
2. it is fast and efficient,
3. it provides a small number of advanced exploratory analysis tools not found in statistical packages, specially developed to help the analysis of the SARs,
4. it offers an easy way of creating files for input into other software systems,
5. it has built-in a library of standard recodes allowing the automatic recreation of many of the SAS and LBS tables as well as 47 other derived variables; and
6. it provides SAR data for Northern Ireland as well as the rest of the UK.
More recently, USAR has been transferred onto a PC and a version for MS-DOS exists. This can be used on a single PC or on a fileserver network. This also requires 68 MB of free disk space, and at least 4 MB of PC memory. On a 33 MHz 486 PC, run times are about 10 times longer than on a UNIX workstation. Nevertheless, typical table times of 5 minutes are acceptable for slow PC hardware given a database of 68MB. The MS-DOS version of USAR is identical in all respects to the UNIX version and is also available free of charge.
The principal attraction of the SAR data is the flexibility it gives the geographer in defining their own census outputs. You are no longer restricted to using what the OPCS/GRO(S) provide. This removes most needs for special tabulation. An example would be LBS Table L37 which only reports age and sex only for lone parents aged 16-24. Suppose tabulation with a larger number of age groups is required. With USAR this task took less than 5 minutes to perform with the PC version and 30 seconds with the UNIX version; see table 3. Now repeat the SAR analysis for Leeds which can be repeated for other cities; for example, Edinburgh and Belfast. A few minutes later and the results shown in Table 4 are produced.
Table 3 SAR and LBS Age-Sex of Lone Parents Comparison
SAR factored up LBS
SAR LBS Age Male Female Age Male Female 11 0 0 11 not available 12 0 0 12 13 0 0 13 14 0 0 14 15 0 0 15 16 100 2100 16 96 308 17 500 6000 17 88 1588 18 600 12900 18 120 5259 19 700 17400 19 170 10779 20 1600 25200 20 248 17248 21 1000 30500 21 306 22240 22 1000 35200 22 430 27341 23 1000 45500 23 545 31032 24 1600 42000 24 632 34046 25-34 24100 459100 25-34 not available 35-44 46100 333800 35-44 45+ 35100 119900 45+ Total 113400 1129600 Total not available
Table 4 Lone Parents Comparison between Cities
Leeds Edinburgh Belfast Male Female Male Female Male Female AGE 16 0 0 0 0 0 0 17 0 0 0 0 0 0 18 1 2 0 3 0 2 19 1 4 1 3 1 2 20 0 9 0 3 0 2 21 0 12 0 5 0 3 22 0 12 0 9 0 6 23 1 12 0 7 0 2 24 0 10 0 6 1 7 25 0 12 0 10 1 6 26 0 12 1 6 0 13 27 1 10 0 3 0 8 28 0 9 1 6 0 6 29 2 11 0 5 0 11 30 0 11 0 12 0 6 Total 6 126 3 78 3 74
However these results illustrate another feature of census data that the SARs can be used to investigate. Most of the other SAR/LBS comparisons users can perform with USAR will show only small differences in the numbers once they are factored up. However, occasionally as happens here, the differences can be much larger than expected. For example, the numbers of 16-24 year olds in the SAR data are estimated to be 7,519,800 compared with 6,860,516 in the LBS. The difference is too large to be due to sampling error and is investigated further in Figures 1 and 2. They show in a bar chart the differences between the results for individual age/sex groups in LBS Table 38 and USAR. The under-enumeration of 16-24 year olds is very pronounced. For example the differences in the 18, 19, 20, 21, and 22 year groups are respectively 2.4%, 4.1%, 3.2%, 5.5%, and 8.9%. These differences are attributed to the effects of census non-response being greatest amongst the young who were probably trying to hide to avoid the "poll tax". The even larger under-enumeration of 80+ year olds is due to a problem with census collection in aged person's homes. In the LBS, OPCS imputation will have been used in an attempt to correct for these effects, although clearly for some age groups the correction was too small. By comparison the SAR sample was taken from fully processed records. This suggests that the SAR results may well be more accurate.
Figure 1 -Percentage Male Underenumeration
Figure 2 - Percentage Female Underenumeration
These results provide a good illustration of the value of the SAR. Laudable attempts to correct for under-enumeration have seemed to present a distorted numeric picture of a politically highly sensitive social group of society, indeed table 3 and figure 3 shows evidence for the extensive underestimation of teenage lone parents. This should also include under 16 year olds but these are not recorded in the 1991 census. Why this should be is a mystery. Was it a legally induced omission or was it accidental? Further analysis is shown in table 5 which gives the break down of lone parents with dependant children by region divided into economically active and inactive. It illustrates the use of a SAR derived variable. Despite our expectations to the contrary nearly half of lone parents are economically active, though the economic prosperity of the region does appear to effect this with more lone parents working in the south than the north.
Figure 3 - Percentage Under Enumeration of Lone Parents
Table 5 Economic Activity of Lone Parents with dependant Children by Region
Region % in Index employment (Country =100) Rest of S.East 53.64 123.75 South West 53.45 120.88 East Anglia 53.11 120.27 East Midlands 50.39 108.85 Outer London 49.48 108.37 West Midlands 46.09 99.27 Scotland 46.29 95.81 North West 45.50 92.06 Yorks and Humb 45.09 91.85 Inner London 44.89 87.07 Wales 42.54 84.45 North 41.29 77.56 Total 47.67 100.00
The next step is to use the SAR to look in more detail at the characteristics of lone parents via USAR's exploratory search facility. The LBS is much less useful in this respect because it provides very limited information, all of which are affected by under-enumeration as well as being scattered over several different SAS and LBS tables. The Search facility in USAR lists all the SAR variables by their association with lone parents. This is a useful simple data exploratory tool. The user can use the whole dataset or specify a series of geographical filters which define the subset of interest. USAR then presents a ranked list of variable category combinations. The user is then able to scroll down this list and select more combinations to further narrow the area of search. Table 6 summarises the most frequently associated variables for young lone parents with dependant children. Note that none of the geographical variables (region, county etc.) appear. This would either imply a lack of any major regional concentrations or that the standard geography codes used in the SAR are not particularly appropriate for analysis and may need to be recoded. This could be done subjectively by defining what might be considered more appropriate macro geographical representation; for example, approximations to North-South, big city, rural, industrial towns, etc. Alternatively USAR provides table design tools that can be used to recode geography to provide a more consistent description of lone parents. These associations can then be used as the basis for a fuzzy search.
Table 6 Search results for Young Lone Parents
Characteristic Percentage female 96 White 93 Single 82 No Car 73 Economically Inactive 71 No Employed Persons in 66 Household 1 Child in Household 66 Local Authority Rent 53 2 Children in Household 25 SIC Division 6 24 Distribution, hotels and repairs
Fuzzy search is an attempt to move away from having to be explicit when specifying queries for a database. For example, based on Table 6 if a user wants to count the number of lone parents with the most frequent characteristics then this merely involves a single logical selection (viz. select if female and white and single and no car etc.). However, it is not reasonable to expect that many lone parents will match this profile in its entirety and therefore some will be missed. Fuzzy searching permits a greater degree of flexibility. You can ask for any K from M logical conditions to be satisfied without being any more specific. Table 7 is created by instructing USAR to match 6 out of the 9 most frequent variables previously identified and the report shows the matches found. The largest coherent group is white, single parents with no car or job living in Local Authority rented accommodation with 1 dependant child. Table 8 shows the same fuzzy search conducted on a region by region basis. This shows that in most regions the largest coherent group of lone parents are the same as for the country as a whole. However in East Anglia, the South West and Wales lone parents are less likely to be single and in the South West and Wales are more likely to have 2 dependent children.
Table 7 Fuzzy search on 6 out of 9 variables
Cars Marital Ethnic Economic Tenure Dependant Employed Indust Status group Position Children People in Div Count HH 0 Single White Inactive LA 1 2 0 6 Rent + + + + + + - + - 281 + + + + - + - + - 153 + + + + + - + + - 145 + + + + + + - + + 94 + + + + - + - + + 71 + + + + - - + + + 55 + - + + + - + - + 55
Table 8 Fuzzy search for each region
Region Cars Marital Ethnic Economic Tenure Dependant Employed Indust Count group Position Children People Div Status in HH 0 Single White Inactive LA 1 2 0 6 Rent North + + + + + + - + - 61 Yorkshire + + + + + + - + - 56 + Humb E. + + + + + + - + - 42 Midlands E. + - + + + + - + - 8 Anglia I. + + + + + + - + - 35 London O. + + + + + + - + - 42 London R. SE + + + + + + - + - 48 SW + - + + + - + + - 29 W. + + + + + + - + - 59 Midlands NW + + + + + + - + - 83 Wales + - + + + - + + - 30 Scotland + + + + + + - + - 87
The SAR provides geographers with a major new census resource. The USAR system provide an easy way in. This paper has provided illustration of the SARs being used to answer research geographical questions. There are many other potential applications; in particular in multi-level modelling and the use of artificial intelligence based machine discovery systems such as data mining. The principal constraints now are those of the investigators imagination. However it may well be a while before the full geographical benefits of such a fundamentally new data resource such as the SARs are fully identified and these opportunities recognised in the less quantitative parts of human geography.
You can obtain USAR software for UNIX and MS-DOS by anonymous ftp from gam.leeds.ac.uk. However, the USAR data files for UNIX and MS-DOS are only available by ftp from MCC with proof of SAR registration. Non-academic users can purchase the SAR data for £2K from MCC, USAR is free.
The research reported here was supported by ESRC grant number H507255100
Cole, K., (1994) 'Data Modification, data suppression, small populations and other features of the 1991 Small Area Statistics'. Area 26 1 69-78
Dale, A., Marsh, C., (1993) The 1991 Census User's GuideHMSO, London
Marsh C., Arber, S., Wrigley, N., Rhind, D., Bulmer M., (1988) 'The view of academic social scientists on the 1991 UK Census of Population: a report of the ESRC Census Working Group', Environment and Planning A 20, 851-889
Middleton, E., (1995) in S. Openshaw (ed) The Census Users Handbook Longman, London p337-362
Openshaw, S., (1984), 'Ecological fallacies and the analysis of areal census data', Environment and Planning A 16, 17-31
Openshaw, S., (1994), 'Social costs and benefits of the census', Proceedings of XVth International Conference of the Data Protection and Privacy Commissioners p89-97
Openshaw, S., (1995), 'The future of the census', in S Openshaw (ed) The Census Users Handbook Longman, London p389-411
Turton, I., Openshaw, S., (1994), 'A step-by-step guide to accessing the 1991 SAR via USAR', Working Paper 94/6, School of Geography, Leeds University
Turton, I., Openshaw, S., (1995) 'Putting the 1991 Census Sample of Anonymised Records on your Unix Workstation', Environment and Planning A (forthcoming).