Centre for Computational Geography
University of Leeds
Leeds, LS2 9JT
A major UK census milestone was reached in 1993 with the release of Britain's first ever, official, sample of anonymised census microdata. In fact this amounts to a 3% sample of the 1991 census being released in anonymous form, without any names and addresses and only coded to large geographical areas. This followed almost a decade of discussion and debate about the content, need and confidentiality aspects of possible microdata from the census.
Turton and Openshaw (1995) agree with Marsh (1993) when she summarised the advantages of the SAR as follows ``In the final analysis, however, the value of the SAR lies in the fact that, in contrast to tabulations produced by the Census Offices, we do not have to plan in advance all the useful ways in which the data can be presented. ...Almost every census user will have experienced the frustration of finding that the table they required was slightly different from the one they in the end had to use.'' (p297). This is a slight exaggeration. Nevertheless it is the historically relatively small number of essentially non-spatial census users who will probably benefit most from the SARs, indeed this group may well be instrumental in opening up the census data resource to a broader social science constituency [Openshaw 1994].
The SAR is a big dataset, though not massive by modern standards. The original SAR data files amount to 126 Megabytes (79Mb for the 2% sample of individuals and 47Mb for the 1% sample of households). If the data are loaded for SPSS then the system files amount to 93Mb and for SAS then the system files go up to 176Mb.
The original intent in 1991 was to load the SAR data files on to a large mainframe at Manchester Computing Centre (MCC) to provide a table output service for users using the same relational database package (model 204) as used by the UK census agencies in processing the 1991 census. Indeed, the Census Microdata Unit at the University of Manchester was established in 1992 to provide three sorts of dissemination service relating to the SARs:
Turton and Openshaw (1995) describe a fourth diffusion and dissemination path that was developed by an ESRC research project (1992-94). The idea was to develop portable data access software designed for use with the SAR and which can be ftp'ed over JANET to any unix workstation with sufficient disc space. The aim was to provide an alternative standard means of accessing and using the SAR that was likely to become increasingly relevant during the 1990s. Indeed the continued fall in cost of workstation hardware emphasised the importance of storing the SAR data in a form suitable for a unix workstation and a PC environment. This provided academics (and others who buy the data) with a very different access path, one based on the distribution of the SAR data with associated specialist software designed to allow both expert and non-expert users to easily and quickly obtain maximum benefit from the SAR datasets. Indeed, it was thought likely that both experienced and enthusiastic SAR users would increasingly want the SAR data available on their local unix and PC systems and that they would find specially developed SAR relevant software of considerable practical assistance, to complement the widely available general purpose statistical packages.
In the dissemination of the 1991 census data it was seen as important that ``the data are: easily accessed with a minimum of alien computer control language, analysis is fast, the data is largely self-documenting, and a wide range of different social scientist skill levels can be catered for. Access has to be easy, straight forward and intuitively obvious, but also table creation is not sufficient by itself, so new tools are needed to help users cope with the SARs.'' [Turton and Openshaw 1995].
In 1991 the SARs were accessed by a variety of software packages, though the majority were text based, reflecting the slower network speeds common at the time. Commercial packages such as SPSS, SIR and SAS that were already familiar to academic users were provided at MIDAS. MIDAS also provided a commercial package, Quanvert which proved unpopular with users as it was less than obvious how to use it and it did not treat all variables equally, for example geographic areas and occupations had to be row not column variables. USAR was provided as free software to anyone who had legal access to the SARs. As a new package this caused some problems in take up from established users of survey data but proved very popular with novice users as it was menu driven and would run on a variety of machines from PCs to Unix mainframes.
The effect of these proposals on the software required to access the SARs would be to increase the size and complexity of the datasets to be handled. But this is not of any real consequence as since 1992 machine speeds have increased by at least an order of magnitude and a further doubling or tripling can be expected by the time the 2001 SARs are available.
Users of the 2001 SARs are likely to have many of the same requirements as users of the 1991 SARs. Brown and Dale (1998) report the results of a survey of SAR users and non-SAR users. The majority of users were making use of the SARs as a research tool, with only one third using them for teaching.
When questioned on ease of use by Brown and Dale (1998) only 21% of respondents said that existing systems were easy to use, though in some cases this related to the documentation or the Unix system used at Manchester. There were also complaints as to the time taken to produce multiple tables.
Users showed a large spread of systems that were used to access the SARs, ranging from a large central service (MIDAS) to stand alone PCs. This shows the rapid advances in computing power that have occurred since the 1991 SARs were first thought of. Many users expressed a preference for the data to be delivered either on CD-ROM or via the world wide web.
More than one package is essential to the viability of the SARs in order to provide a robust and flexible solution.
There are a variety of general statistical packages available that are capable of handling the whole 1991 SAR or some subset of it. However many of these packages have problems associated with them, some can only use a subset of the data, fine if thats all you need, but a problem when suddenly you need to compare your results to the national figures. Other packages have restrictions on how some of the variables can be used, for instance the district variable in the 1991 individual SAR has too many categories for some packages to handle.
In contrast USAR which was developed specially for the SARs has none of these problems associated with it. Users can use as much or as little of the dataset as they wish. No variable is restricted and all variables can be arbitrarily grouped . USAR also adds extra features that allow users to explore the data more fully (Turton and Openshaw, 1994, Openshaw and Turton, 1996).
So while many packages allow users access to the SARs only specialised software allows the full richness of the data to be fully utilised.
Specialist software allows more users more access to more of the SARs
Following on from this proposal the remainder of this paper will consider only what features a specially design package must provide to be of use to the user community.
At present the 1991 SARs can be accessed by a variety of commercial and free academically created packages. While cost is seldom an issue for a centrally provided academic service, it is to be expected that in 2001 more academic sites will wish to make use of the data locally, and the concerns of the public sector as to the costs of the data as well as the cost of software must be considered [Brown and Dale 1998]. It would seem that the cheapest method of providing access to the SARs is via academically produced software, provided that any concerns over quality, reliability, performance and usefulness can be met.
At least one method of accessing the SARs must be free or very cheap, so as not to deter users.
It is of no use to anyone if the software package produced is so flexible and feature loaded that it is impossible for a novice user to produce a table in less than 5 minutes from first meeting the package. However a package must be capable of being extended so that an expert user can carry out the more complex tasks that they commonly turn to SPSS or SIR for at present. This must include the ability to produce new variables and if necessary the ability to add new functions easily.
By using the remote methods of Java adding links to other existing statistical packages is relatively easy. The industry standard CORBA can also be used to combine different packages. The standard system should provide everything that 90% of users require.
An interface must be provided that allows an average undergraduate to produce a moderately complex table with less than five minutes instruction.
The systems functionality should be easily extensible to meet new and unforeseen needs, as well as minority users.
Java also has the advantage that experienced programmers can customise the program as they require, and users who require extra features but lack programming knowledge will have no difficulty finding a Java programmer at a reasonable cost within their own institution.
Any package developed must be extremely portable between machines. There is no longer any need to tie users to a central system or an approved operating system.
Again the modular object-orientated format of Java would make it very simple to provide the hooks that users could use to add their own output formats or even links to packages that were Java compliant.
Users must be able to use the output of the SARs in a form readable by any program of their choice.
In the 1991 SARs documentation was provided as hard copy and dealt only with a few specific packages. In 2001 it makes more sense for the documents relating to the SAR variables and coding to be provided in HTML both from a central site and in a form that can be bundled in with other programs. This will allow a user to check a definition while the program is running where ever they are, as the documents could either be locally available or the central site could be reached over the internet.
Documentation should be made available in machine readable form at no cost and with no restrictions on copying to developers and users to make access to the SARs as easy as possible.
There are more survey datasets available to the academic, public and commercial sectors than the SARs. It is therefore important that suitable software is provided either in the package or as an easy to use add on to import a variety of datasets. In the academic sector this includes the General Household Survey, Labour Force Survey and New Earnings Survey. Local government departments may be interested in using it to access council tax registers or other internal datasets. Whereas commercial sector companies and academic departments may be interested in carrying out similar research on the SARs and lifestyle databases.
The package developed should be able to read other survey datasets easily.
In conclusion it is likely that in 2001 the SARs will be larger and more complex than in 1991. However many of the packages used to analyse the 1991 SARs lacked features that researchers needed or were working at the extreme limits of the package or were close to too slow to be useful.
In 1991 an academically produced package provided the cheapest, easiest and most flexible means of analysing the SARs. There is no reason why a similar package building on the developments in computing power and language can not be produced for the 2001 SARs. If possible work should start on this development before the 2001 census, rather than waiting for the SAR to be released as happened with the 1991 census.
The package developed must be easy to use, cheap, powerful, portable and flexible in both its inputs and outputs. If all these aims can be achieved then the 2001 SARs will reach a much larger audience than the 1991 SARs managed.
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The translation was initiated by Ian Turton on 5/11/1998