SAR Interfaces 2001

Ian Turton
Centre for Computational Geography
University of Leeds
Leeds, LS2 9JT
ian@geog.leeds.ac.uk

Background

A major UK census milestone was reached in 1993 with the release of Britain's first ever, official, sample of anonymised census microdata. In fact this amounts to a 3% sample of the 1991 census being released in anonymous form, without any names and addresses and only coded to large geographical areas. This followed almost a decade of discussion and debate about the content, need and confidentiality aspects of possible microdata from the census.

Turton and Openshaw (1995) agree with Marsh (1993) when she summarised the advantages of the SAR as follows ``In the final analysis, however, the value of the SAR lies in the fact that, in contrast to tabulations produced by the Census Offices, we do not have to plan in advance all the useful ways in which the data can be presented. ...Almost every census user will have experienced the frustration of finding that the table they required was slightly different from the one they in the end had to use.'' (p297). This is a slight exaggeration. Nevertheless it is the historically relatively small number of essentially non-spatial census users who will probably benefit most from the SARs, indeed this group may well be instrumental in opening up the census data resource to a broader social science constituency [Openshaw 1994].

Lessons from 1991

The SAR is a big dataset, though not massive by modern standards. The original SAR data files amount to 126 Megabytes (79Mb for the 2% sample of individuals and 47Mb for the 1% sample of households). If the data are loaded for SPSS then the system files amount to 93Mb and for SAS then the system files go up to 176Mb.

The original intent in 1991 was to load the SAR data files on to a large mainframe at Manchester Computing Centre (MCC) to provide a table output service for users using the same relational database package (model 204) as used by the UK census agencies in processing the 1991 census. Indeed, the Census Microdata Unit at the University of Manchester was established in 1992 to provide three sorts of dissemination service relating to the SARs:

  1. an online service, available over a network using database software such as SIR and Quanvert and statistical packages such as SPSS and SAS;
  2. distribution of the raw SAR data; and
  3. a customised tabulation service.

Turton and Openshaw (1995) describe a fourth diffusion and dissemination path that was developed by an ESRC research project (1992-94). The idea was to develop portable data access software designed for use with the SAR and which can be ftp'ed over JANET to any unix workstation with sufficient disc space. The aim was to provide an alternative standard means of accessing and using the SAR that was likely to become increasingly relevant during the 1990s. Indeed the continued fall in cost of workstation hardware emphasised the importance of storing the SAR data in a form suitable for a unix workstation and a PC environment. This provided academics (and others who buy the data) with a very different access path, one based on the distribution of the SAR data with associated specialist software designed to allow both expert and non-expert users to easily and quickly obtain maximum benefit from the SAR datasets. Indeed, it was thought likely that both experienced and enthusiastic SAR users would increasingly want the SAR data available on their local unix and PC systems and that they would find specially developed SAR relevant software of considerable practical assistance, to complement the widely available general purpose statistical packages.

In the dissemination of the 1991 census data it was seen as important that ``the data are: easily accessed with a minimum of alien computer control language, analysis is fast, the data is largely self-documenting, and a wide range of different social scientist skill levels can be catered for. Access has to be easy, straight forward and intuitively obvious, but also table creation is not sufficient by itself, so new tools are needed to help users cope with the SARs.'' [Turton and Openshaw 1995].

In 1991 the SARs were accessed by a variety of software packages, though the majority were text based, reflecting the slower network speeds common at the time. Commercial packages such as SPSS, SIR and SAS that were already familiar to academic users were provided at MIDAS. MIDAS also provided a commercial package, Quanvert which proved unpopular with users as it was less than obvious how to use it and it did not treat all variables equally, for example geographic areas and occupations had to be row not column variables. USAR was provided as free software to anyone who had legal access to the SARs. As a new package this caused some problems in take up from established users of survey data but proved very popular with novice users as it was menu driven and would run on a variety of machines from PCs to Unix mainframes.

Background to the 2001 SARs

Potential SARs in 2001

Dale and Elliot (1998) discuss a series of possible SAR formats for the 2001 Census. Their key proposal is for the level of geographic detail to be increased by a lowering of the threshold size in the individual sample to 90 thousand people (from 120,000 in 1991). They also suggest that increasing the sample size from 2% to 3% would have little effect on confidentiality. They also report experiments on the potential for a third SAR with a more detailed geography, of the order of wards, but with many variables recoded.

The effect of these proposals on the software required to access the SARs would be to increase the size and complexity of the datasets to be handled. But this is not of any real consequence as since 1992 machine speeds have increased by at least an order of magnitude and a further doubling or tripling can be expected by the time the 2001 SARs are available.

User requirements

Users of the 2001 SARs are likely to have many of the same requirements as users of the 1991 SARs. Brown and Dale (1998) report the results of a survey of SAR users and non-SAR users. The majority of users were making use of the SARs as a research tool, with only one third using them for teaching.

When questioned on ease of use by Brown and Dale (1998) only 21% of respondents said that existing systems were easy to use, though in some cases this related to the documentation or the Unix system used at Manchester. There were also complaints as to the time taken to produce multiple tables.

Users showed a large spread of systems that were used to access the SARs, ranging from a large central service (MIDAS) to stand alone PCs. This shows the rapid advances in computing power that have occurred since the 1991 SARs were first thought of. Many users expressed a preference for the data to be delivered either on CD-ROM or via the world wide web.

Proposals

One package or many?

While it would be possible to produce the SARs in a format that could be accessed by only one package, this would be very unwise. A single package will never satisfy the needs of all users, it would be too complex for novices and too limiting for the experienced user. There is also the question as to how do we check the accuracy of the package, if there is only one answer then how do we know if it is the correct one?

Proposal 1

More than one package is essential to the viability of the SARs in order to provide a robust and flexible solution.

Specialist or general software?

There are a variety of general statistical packages available that are capable of handling the whole 1991 SAR or some subset of it. However many of these packages have problems associated with them, some can only use a subset of the data, fine if thats all you need, but a problem when suddenly you need to compare your results to the national figures. Other packages have restrictions on how some of the variables can be used, for instance the district variable in the 1991 individual SAR has too many categories for some packages to handle.

In contrast USAR which was developed specially for the SARs has none of these problems associated with it. Users can use as much or as little of the dataset as they wish. No variable is restricted and all variables can be arbitrarily grouped . USAR also adds extra features that allow users to explore the data more fully (Turton and Openshaw, 1994, Openshaw and Turton, 1996).

So while many packages allow users access to the SARs only specialised software allows the full richness of the data to be fully utilised.

Proposal 2

Specialist software allows more users more access to more of the SARs

Following on from this proposal the remainder of this paper will consider only what features a specially design package must provide to be of use to the user community.

How much should this cost?

At present the 1991 SARs can be accessed by a variety of commercial and free academically created packages. While cost is seldom an issue for a centrally provided academic service, it is to be expected that in 2001 more academic sites will wish to make use of the data locally, and the concerns of the public sector as to the costs of the data as well as the cost of software must be considered [Brown and Dale 1998]. It would seem that the cheapest method of providing access to the SARs is via academically produced software, provided that any concerns over quality, reliability, performance and usefulness can be met.

Proposal 3

At least one method of accessing the SARs must be free or very cheap, so as not to deter users.

Ease of use and flexibility

It is of no use to anyone if the software package produced is so flexible and feature loaded that it is impossible for a novice user to produce a table in less than 5 minutes from first meeting the package. However a package must be capable of being extended so that an expert user can carry out the more complex tasks that they commonly turn to SPSS or SIR for at present. This must include the ability to produce new variables and if necessary the ability to add new functions easily.

By using the remote methods of Java adding links to other existing statistical packages is relatively easy. The industry standard CORBA can also be used to combine different packages. The standard system should provide everything that 90% of users require.

Proposal 4

An interface must be provided that allows an average undergraduate to produce a moderately complex table with less than five minutes instruction.

Proposal 5

The systems functionality should be easily extensible to meet new and unforeseen needs, as well as minority users.

Local or central provision?

As the power of desktop PCs increases more and more users will wish to store and access the SARs from their desktop. However there will always be users that either prefer to allow a central site to take care of their computing, or who have very compute intensive tasks that will need a more powerful machine. These users will want access to a package that can run on as many machines as possible, which will also be a requirement in many departments. For example the CCG already has 3 types of Unix, and 3 versions of MS windows to deal with in a single building. Users however will not want to learn a different package for each machine, their support teams will also not want to support 6 different machines. It will probably make sense therefore to develop a specialist program in Java for portability. This also allows a single copy of the program to be developed that can be placed on a CD-ROM to be distributed without regard for the target machine. The package could be produced as a stand alone application or as an applet that a user could load from a local disk (or a remote site, if a suitable security model can be devised) with their existing web browser.

Java also has the advantage that experienced programmers can customise the program as they require, and users who require extra features but lack programming knowledge will have no difficulty finding a Java programmer at a reasonable cost within their own institution.

Proposal 6

Any package developed must be extremely portable between machines. There is no longer any need to tie users to a central system or an approved operating system.

Outputs

There are now a bewildering variety of outputs that users require from simple tables, to graphs and maps. The package developed should not attempt to provide all or even many of these possible outputs. Instead the package should provide a simple variety of common interchange formats, e.g. rich text format for tables, comma separated values for graphs, arc/info ungenerate format for maps, etc.

Again the modular object-orientated format of Java would make it very simple to provide the hooks that users could use to add their own output formats or even links to packages that were Java compliant.

Proposal 7

Users must be able to use the output of the SARs in a form readable by any program of their choice.

Documentation

In the 1991 SARs documentation was provided as hard copy and dealt only with a few specific packages. In 2001 it makes more sense for the documents relating to the SAR variables and coding to be provided in HTML both from a central site and in a form that can be bundled in with other programs. This will allow a user to check a definition while the program is running where ever they are, as the documents could either be locally available or the central site could be reached over the internet.

Proposal 8

Documentation should be made available in machine readable form at no cost and with no restrictions on copying to developers and users to make access to the SARs as easy as possible.

Beyond the SARs

There are more survey datasets available to the academic, public and commercial sectors than the SARs. It is therefore important that suitable software is provided either in the package or as an easy to use add on to import a variety of datasets. In the academic sector this includes the General Household Survey, Labour Force Survey and New Earnings Survey. Local government departments may be interested in using it to access council tax registers or other internal datasets. Whereas commercial sector companies and academic departments may be interested in carrying out similar research on the SARs and lifestyle databases.

Proposal 9

The package developed should be able to read other survey datasets easily.

Conclusions

In conclusion it is likely that in 2001 the SARs will be larger and more complex than in 1991. However many of the packages used to analyse the 1991 SARs lacked features that researchers needed or were working at the extreme limits of the package or were close to too slow to be useful.

In 1991 an academically produced package provided the cheapest, easiest and most flexible means of analysing the SARs. There is no reason why a similar package building on the developments in computing power and language can not be produced for the 2001 SARs. If possible work should start on this development before the 2001 census, rather than waiting for the SAR to be released as happened with the 1991 census.

The package developed must be easy to use, cheap, powerful, portable and flexible in both its inputs and outputs. If all these aims can be achieved then the 2001 SARs will reach a much larger audience than the 1991 SARs managed.

References

Brown, M. and A. Dale (1998).

A survey of SAR users, thier requirements for the 2001 SARs and thier views on desimination and support.
Working Paper No. 6,The Cathy Marsh Centre for Census and Survey Research, Faculty of Economics, University of Manchester, Manchester, M13 9PL.

Dale, A. and M. Elliot (1998).

A report on the disclosure risk of proposals for SARs from the 2001 Census.
Working Paper No. 5,The Cathy Marsh Centre for Census and Survey Research, Faculty of Economics, University of Manchester, Manchester, M13 9PL.

Marsh, C. (1993).

The sample of anonymised records.
In A. Dale and C. Marsh (Eds.), The 1991 Census User's Handbook, pp. 295-311. HMSO, London.

Openshaw, S. (1994).

Census Users' Manual.
Longmans, London.

Openshaw, S. and I. Turton (1996).

New opportunities for geographical census analysis using individual level data.
Area 28.8, 167-176.

Turton, I. and S. Openshaw (1994).

A Step-by-Step Guide to Acessing the 1991 SAR via USAR.
School of Geography, University of Leeds: Working Paper - 94/6.

Turton, I. and S. Openshaw (1995).

Putting the 1991 Census Sample of Anonymised Records on your Unix workstation.
Environment and Planning A 27, 391-411.

About this document ...

SAR Interfaces 2001

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The translation was initiated by Ian Turton on 5/11/1998


Ian Turton
5/11/1998