1. What is classification?

Classification is a data reduction tool. Census data for 150,000 EDs and 80 variables are reduced to a cluster code (1 to M) which is attached to each ED. An 80 fold reduction! The aim is to group those EDs that have a "similar" profile in terms of the 80 variables used here together so they have the same cluster code. Clearly the quality of this grouping or classification depends on both the method and the number of groups used.


2. What is a cluster in LGAS?

A cluster is a collection of EDs or postcodes that are judged to be similar in terms of the variables being used and the method being applied. Similar does not mean identicalness and indeed clusters composed of identical EDs or postcodes would be very rare. Another name for cluster is group or type, or in the present census data context: residential, area type. How 'similar' is 'similar' is relative and depends on the number of clusters being used and the data analysed.


3. What methods of classification can be found in LGAS?

Four methods of classification have been used to create the census classifications available in the Leeds GAS:

  1. the CCP programs of Openshaw (1984) and Openshaw and Wymer (1995), and
  2. a neurospatial classifier described in Openshaw (1993) and Openshaw et al. (1995)
  3. a crisp k-means classifier
  4. a fuzzy k-means classifier

Most classifications have been produced using CCP but there are some neurocomputing-based and k-means alternatives for comparison.

The Census Classification Programs (CCP)

There are many methods of classifying multivariate data sets. The one most commonly used within the geodemographics industry is an iterative relocation algorithm based on an error sum of squares measure. The basic idea is to iteratively improve a random or other starting classification. Each case is examined to see whether a move to any of the other clusters would improve the within cluster sum of squares criteria. Whichever move results in the greatest improvement. An iteraction occurs when all the cases have been processed. A stable classification is obtained when no move occurs during a complete iteration.

The Neuroclassification Procedure

Openshaw (1994) argues that the use of an unsupervised neural net based on Kohonen's self organising map (Kohonen, 1984) provides the basis for a much more sophisticated approach to spatial classification that reduces the number of assumptions that have to be made and neatly incorporates many of the sources of data uncertainty; see Openshaw (1994a). A basic algorithm is as follows:

  1. Define the geometry of the self-organising map to be used and its dimensions. Here a grid with 8 rows and 8 columns is used.
  2. Initialise a vector of M weights (one for each variable) for each of the 8 by 8 neuronal processing units.
  3. Define the parameters that control the training process: block neighbourhood size, training rate, and number of training iterations.
  4. Select a census ED at random but with a probability proportional to its population size.
  5. Randomise the vector of M variable values to incorporate data uncertainty computed for each variable separately (optional).
  6. Identify the neuron which is "closest" to the input data.
  7. Update the winning neuron weights and those of all other neurons in its block neighbourhood or vicinity.
  8. Reduce slightly the training parameter and the block neighbourhood size.
  9. Repeat steps 4 to 8 a very large number of times.
If Step 4 is replaced by a sequential selection process and Step 5 is ignored then the algorithm is essentially the same as a K means classifier; with a few differences due to the neighbouring training which might well be regarded as a form of simulated annealing and it maywell provide better results and avoid some local optima. However, from a geographical perspective Step 4 is extremely important because it provides a means of explicitly incorporating spatial data uncertainty into the classification process.

The method also provides a very natural means of handling cluster fuzzyness without having to impose an arbitrary metric; since the distance between the best and the next best neurons can be readily measured. The simplicity of the self-organising map approach readily lends itself to ad hoc modification designed to improve the quality of the geographic representation offered by the classification. There are various ways of meeting this objective. the simplest is to select an ED, as in the standard algorithm described previously, but then to use a distance weighted average value for the k nearest ED's. This neighbourhood in geographic ED space is gradually reduced as the block neighbourhood in the self-organising map's topological space is also reduced, slowly over many millions of iterations. The logic is to incorporate some notion of local geographical neighbourhood structure into the classification. Here the geographic neighbourhood is limited to the 10th nearest neighbour of each ED.

Another way of attempting the same objective is to change the updating mechanism (see Steps 6 and 7 in the basic algorithm) to update neurons assigned to the kth nearest geographical neighbours of the ED being used for training at any particular instance, irrespective of whether these neurons are within the block neighbourhood of the winning neuron. Experimentation suggests that the 'OR' rule is slightly better than the 'AND' rule. Equally, restricting the neuron updating to only the geographical neighbour related neurons also yielded slightly poorer results. However, the resulting classifications seemed to offer levels of descriptive resolution equivalent to conventional cluster systems with many more cluster in them.

The principle disadvantage of neuroclassification concerns the computationally intensive nature of the method. If the technique is to properly handle and represent the 150,000 cases then large numbers of training iterations (Step 3) are required. In a census application an ability to represent the data is much more important than any generalisation to unseen data; since there is none. This requires many millions of training iterations; indeed runs of up to one billion iterations have been investigated. In practice this means that parallel implementations are required and a parallel supercomputing version is under development. However, it is worth noting that a conventional classification of 150,000 ED's may well require 200 passes through the data. This does not seem much but nevertheless it would be equivalent to 30 million training iterations and this conventional classifier is much harder to parallelise or vectorise in any worthwhile manner.

K-means classifiers (crisp and fuzzy)

In crisp k-means clustering, iterative optimisation is used to minimise the objective function for n observations and k clusters, which is given by the following equation,

The objective function to minimise

which involves finding the best crisp partition matrix, U, that minimises the Euclidean distance, dij, between the observations and the cluster centers, vi, calculated by the expression,

An equation for calculating the difference

where m is the number of variables in the data set. For a given observation, the partition matrix indicates cluster membership by a crisp characteristic function, (ij, that assigns 1 to the cluster to which it belongs and 0 to all others. The fuzzy c-means algorithm (also referred to as fuzzy k-means) is a direct fuzzification of this crisp methodology. The fuzzy partition matrix is soft and therefore allows membership values to range between 0 and 1. As a result, partial membership is allowed in more than one cluster at the same time but with the restriction that the membership across all clusters for a given observation must total to 1. The objective function for iterative optimisation in the fuzzy c-means algorithm is modified as follows:

The fuzzy objective function for minimising

where the crisp characteristic function, (ij , is replaced by a fuzzy membership function, (ij, and f is a fuzziness factor that ranges from 1 to 2. A value of f=1 results in a crisp classification. As this value increases, the fuzziness of the classification increases until μij = 1/k for all membership values. When this occurs, all or the majority of the observations will belong to each cluster equally, resulting in too much fuzziness. There is currently no established theory as to the optimal choice of this parameter (Bezdek et al., 1984) since it is probably related in part to the actual data set. However, Feng and Flowerdew (1998a) used a value of 1.25, which produced satisfactory results in clustering Lancashire data at the ED level.

Before applying the classification, the number of clusters must be chosen. Although there are different heuristics or cluster validity techniques available to aid in determining this number, the choice can still be a fairly subjective task and may require some experimentation. Furthermore, the intended purpose of the classification should be considered in the choice. For example, to characterise household types using data from the census, a number must be chosen that provides adequate discrimination while remaining small enough to ensure sensible cluster center labelling (Openshaw, 1983). Once the number is chosen, the fuzzy c-means algorithm can be implemented as follows:

  1. Randomly initialise the partition matrix of size n observations by k clusters with values ranging between 0 and 1.
  2. Select a fuzziness factor and raise each element of the partition matrix by this value.
  3. Scale each row of the partition matrix so it totals to 1.
  4. Calculate the cluster centers using the following formula:

    An equation to calculate the cluster centres

    where this equation is a fuzzified version of the crisp k-means algorithm in which the fuzzy characteristic function is replaced by the fuzzy membership function.

  5. Update each element of the partition matrix as follows:

    An equation to calculate the individual elements

    and then scale the elements in each row so that they total to 1.

  6. Check the termination criteria. If a maximum number of iterations specified by the user has been exceeded or if the difference between Jt and Jt-1 is less than a specified criterion for convergence, stop the algorithm. Otherwise go to step 3 and repeat until the termination criteria are satisfied.

The fuzzy c-means method produces a matrix containing the cluster centers and a final partition matrix, which contains the membership values of each observation in each cluster. This matrix can also be defuzzified or hardened using the max-membership method or the nearest-center classifier (Ross, 1995). In the max-membership method, the largest element in each column of the U matrix is assigned a membership value of unity and all the other elements in the column are assigned a membership value of zero. In the nearest center classifier, each of the data points is assigned to the class to which it is closest, i.e., the minimum Euclidean distance from a given data point to the nearest cluster center, k.

No Hierarchical Typologies

Neither classifier creates hierarchical classifications; e.g., a 50 cluster solution that can be agglomerated into 25 or 10 "higher order" categories. The reason is simple. Stan is a fervent believer that whilst hierarchical classifications look nice (viz. dendogram plots) thay are far poorer than if each classification with fewer clusters was created from scratch.


4. How is census geography organised?

The Census geography is composed of a hierarchy of spatial units. There are four basic levels within this hierarchy.

WalesEnglandScotland
Counties/Regions84612
Districts3736656
Wards9458,9851,158
EDs/OAs6,330106,86638,255

At the lowest level is the enumeration district (ED) or Output Area (OA) in Scotland. These form the 'building bricks' for the main statistical output of the 1991 census. The difference between the two is that the ED was designed around how many households an individual enumerator could visit in a day in a particular region, while the OA is composed of aggregations of unit postcodes. While EDs are primarily designed for data collection, OAs are designed for data output. Both OAs and EDs nest within Electoral Wards which form the primary output unit for the census statistics.

It is well known that EDs are a non-ideal geography (due to variability in size and social heterogeneity, lack of a completely accurate relationship with postcode geography) whilst census data, which is collected once every 10 years, lacks some important questions (i.e., income), and is a mix of 100% and 10% coded variables. However, EDs are the smallest spatial units for which the 1991 census was published so it is the best that can be managed in the 1990s.

You need to bear in mind that the objects being classified here are not people but areas (i.e., EDs). ED profiles are composed of an aggregation of the census characteristics of all the individuals who lived there in April 1991.


5. What data were used in the classifications?

GB Profiles was developed from a list of 80 variables. These are listed as follows:

Aged 0 to 4
Aged 5 to 14
Aged 15 to 24
Aged 25 to 44
Aged 45 to 64
Aged 65 to 74
Aged 75 to 84
Aged 85
Total married population
Single population
Total retired (pensioners)
Working Women (excluding Govt Sch.) (S08)
Total Lone Parents
Students (16+) in term-time addresses
White
Black
Indian
Pakistani
Bangladeshi
Chinese and Others
Black (grps) and Owner & privately rented
Indian, Pakistani, Ban'deshi and Owner & privately rented
Chinese & others and Owner & privately rented
Black (grps) and council rented
Indian, Pakistani, Ban'deshi and council rented
Chinese & others and council rented
Movers last year
Pensioner migrants
Owned Outright
Mortgaged
Private Rented
Rented from HA, LA, NT
Detached Housing
Semi-detached Housing
Terraced Housing
Flats
Bedsits
No central heating
Lacking bath and shower
No car
2 or more cars
Hlds with more than 1.5 ppr
Number of Hhlds with 7+ rooms
Couple hhld, aged 16-24 without child(ren)
Couple hhld, aged 16-24 with child(ren)
Couple hhld, aged 25-34 without child(ren)
Couple hhld, aged 25-34 with child(ren)
Couple hhld, aged 35-54 without child(ren)
Couple hhld, aged 35-54 with child(ren)
Couple hhld, aged 55-75 plus
No Family Hhlds & Owner
No Family Hhlds & Council
Marr'd + cohabiting Couple no children & Owner
Marr'd + cohabiting Couple no children & Council
Marr'd + cohabiting Couple + dependent children & Owner
Marr'd + cohabiting Couple + dependent children & Council
2+ Family Hhlds & Owner
2+ Family Hhlds & Council
Hhlds with dependants
Economically active residents 16+
Self-employed
Total e.a. unemployed
Agric/Forestry/Fishing
Energy, Water & Mining
Manufacturing
Construction
Distribution & Catering
Transport
Banking & Finance
Professional (1,2,3,4)
Intermediate & Junior Non-manual (5,6)
Manual (8,9,12; 7,10; 11)
Agricultural (13, 14, 15)
Armed Forces (16)
Workers with higher degrees
Workers with other qualifications
Total persons with LLI (S12)
Train or Bus to work
Car to work
Work at home
Total Imputed Residents
Medical & Care Ests.
Detention centres & Defence Est.
Education Ests.

The selection of these variables is documented in Blake and Openshaw (1995). In brief it was based on an amalgam of the following:

Note that the selection of variables is subjective and reflected a desire to provide a representative coverage of the range of available 1991 census variables. Note that there is no unique or correct single set of variables to use but the final selection reflects personal experience, intuition, prejudices, commonsense, and undisclosed value systems. Classification is an art rather than a science, indeed the principal scientific aspect concerns its replicatability.

The data was assembled for all Census ED in England, Wales, and Scotland, ignoring EDs with suppressed data. The result was a set of 145,736 EDs.


6. What are the issues involved in postcode geography?

The UK Postcode is a summary of an address in a form which can be read by a computer and thus enables mail to be sorted automatically. It consists of a group of letters and numbers whose format conforms to a set of standards. Every address in the UK to which the Royal Mail delivers has been given a postcode. Most postcodes represent a small group of households and therefore subdivide the UK into small areas.

The Postcode geography, like the Census geography, consists of a group of nested spatial units, which are uniquely labelled using a hierarchical code. There are four levels in this hierarchy:

broken down into:

Level in spatial hierarchyNumber of units
Postcode Areas120
Postcode Districts2,679
Postcode Sectors8,820
Residential Unit Postcodes1,397,754
Non Residential Unit Postcodes151,765
Large Users Unit Postcodes171,541
Total Unit Postcodes1,721,060
Residential delivery points23,845,162

The Post Office works by sending mail to District sorting offices from where it is redistributed to individual delivery points. The advantage of sorting mail at such a local level is that the local knowledge of the area can be used to correct mistakes. This is one of the strengths of the postcode system.

This two stage process is represented in the Postcode itself, in that it has two parts separated by a space. The first part is called the Outward Code and contains information that allows the mail to reach the district sorting office. The second part is called the Inward Code and is used to target the mail to a small number of delivery points.

The Postcode

The Outward Code: The largest postal unit is the Postcode Area. Most of these are (or were) centred on major nodes in the national transport network. They are generally denoted by two alphabetic characters, chosen wherever possible to be a mnemonic for the place (e.g. OX is the Area code for Oxfordshire).

Each Postcode Area is subdivided into Postcode Districts each denoted by a number ranging between 0 and 99; thus OX5 is a postcode district covering the Cherwell area in Oxfordshire. The Inward Code: The postcode sector is also indicated by a number; a single digit between 1 and 9 and then 0. Hence, OX5 2 is the Postcode Sector, which includes the village of Islip, in the Cherwell District. The full postcode is produced by adding two final alphabetic characters; OX5 2SH is a group of 48 households in Islip.

Types of Postcode

There are three types of Unit Postcode. Organisations which receive more than 25 items of mail each day in urban areas and more than 50 items in rural areas are normally given their own Large User Postcode. PO Boxes are also catagorised as Large User Postcodes.

The Central Postcode Directory

The Central Postcode Directory is a computerised directory that links each postcode in the UK to its ward, district, county and National Grid reference. The CPD was created by the Census Offices in conjunction with the Royal Mail. The first address within the postcode is used in making the linkage and all other address within the postcode are allocated the same ward, district and county. The directory is updated by the Post Office twice a year and is estimated to omit only about 0.1 per cent of postcodes. Unfortunately, the version purchased by the academic community (held by both the ESRC Data Archive, Essex, and at the MCC at MIDAS) is updated less frequently. This file is central to the Leeds GAS as it allows the postcodes that are entered to be linked to the census information.

Postcode Formats

The format of the Postcode is not truly common across the whole of Britain. The following Table lists the variety of different formats that exist.

FormatExampleNumber of Outcodes in the form
AN NAAS3 9JD70
ANN NAAS34 3AB262
AAN NAAOX5 2SH1052
AANN NAADN169AA1482
ANA NAAW1P 1PA9 (only in W1)
AANA NAAEC1A 1HQ49 (only London districts EC1-4, SW1, WC1-2)
AAA NAAGIR 0AAOne-off postcode used by the National Giro Bank

Other conventions are that the letter J is not used in either the first two alphabetic positions of the postcode and Q, V and X are not used in the first alphabetic postion. The letters I and Z are not used in the second alphabetic position. There are also restrictions on the letters used in the inward code which means that there are 400 letter pairs available.

Linkage with Census EDs

The 1991 census EDs do not perfectly match postcode geography. A fraction of postcodes can be assigned to more than one census ED. Also the postcode directory undergoes change due to re-organisation by the Post Office, new developments and demolitions since 1991. So current matching postcodes to 1991 EDs will unavoidably result in a small number of mis-matches and failures to find postcodes in the postcode ED directory. Fortunately, this is not often a serious problem although there could be localised exceptions.


7. What classification should you use?

The key decision is the number of clusters or area types to use. The choice of method is far less important particularly if you bear in mind the view that the world's best classification with 50 clusters (perhaps produced on the world's fastest supercomputer using the world's best classifier) may be superior to any other system with the same number of clusters but inferior to virtually any system with 60 or more clusters in it.

The number of clusters is really a measure of detail generalisation. What is 'best' for one application need not be 'best' for another. There is also a problem in measuring best! So you need to experiment with several numbers of clusters. Fortunately, LGAS lets you do this quite easily.


8. What is special about GB Profiles?

It is the only generally available small area census classification of a geodemographic type that exists for academic researchers to use. It was produced using state of the art methods at least as good as those used to create the various commercially sold geodemographic systems. However, it also offers some additional benefits:

  1. If is free to registered census users from HEIs in the UK
  2. It offers an extremely broad range of levels of resolution with between 5 and 1000 clusters
  3. It is easy for social scientists and others to access and use via Leeds GAS
  4. One of the GB Profile classifications has now been added onto the ESRC's Sample of Annonymised Records (SARs) and you can also apply this classification to your data
  5. You can apply these classifications to files coded with census area codes and postcodes, and
  6. A profile ranking report writing facility is also provided


9. What is the purpose of the GB Profile classifications?

In GB Profiles the aim was to provide a good description of the characteristics of UK residential areas based on a broad and representative coverage of 1991 census data and performed at the finest geographical scale. The aim was to ensure that the UK research community had access to small area census classifications at least as good as those available to the commercial sector, and ideally of a better quality.

Care was taken not to exaggerate any particular topics but may this is not entirely unavoidable. Classification is a data descriptive device. Its value to the end-user is not independent of the purpose it was designed to meet. The choice of variables and selection of geographical scale of the classification are reflections of purpose but only in a vague, ambiguous, and general sort of way.


10. What's good and bad about census data?

A small area census classification is based on census data and therefore is affected by the "goods" and "bads" of census data.

Some of the "good" aspects are:

Some "bad" aspects are:


11. Which classification should you use?

The key decision is the number of clusters or area types to use. The choice of method is far less important particularly if you bear in mind the view that the world's best classification with 50 clusters (perhaps produced on the world's fastest supercomputer using the world's best classifier) may be superior to any other system with the same number of clusters but inferior to virtually any system with 60 or more clusters in it.

The number of clusters is really a measure of detail generalisation. What is 'best' for one application need not be 'best' for another. There is also a problem in measuring best! So you need to experiment with several numbers of clusters. Fortunately, LGAS lets you do this quite easily.


12. What kinds of outputs are produced by LGAS?

See below.


13. How can you measure the performance of the classification?

A small number of performance statistics are provided that attempt to summarise the overall levels of discrimination being provided. They are:

  1. Within cluster sum of squares (% loss)
  2. The area of the plot in the gains chart
The % loss is an indication of how much of the variance of the numerator (or rate variable) occurs within the clusters and is therefore lost by the classification. So the % loss is what the classification has lost; hence smallest is best. The quantity 100 - % loss is a crude measure of how much of the total variance has been "explained" by the classification.

The area of the plot in the gains chart is a measure of the degree of segmentation power provided by the classification. This too is expressed as a percentage: 50% is broadly random and 100% is impossibly good. The statistic has a relative interpretation and is mainly useful for comparing classifications.


14. What does the segmentation analysis do?

If you run the segmentation analysis on your dataset, it will produce a table with several columns like the following example:

Segmentation Analysis based on the UK Population
Cluster Number Data Cases Percentage counts Segmentation Index Label
2 1 1.7 156 Cold, but otherwise not deprived
3 10 16.7 87 Not deprived
4 8 13.3 89 Not deprived
5 3 5.0 100 Housebound & ill in rented property
6 8 13.3 90 Not deprived
7 1 1.7 279 Showerless sardines
8 3 5.0 118 Smelly Britain
12 2 3.3 96 Cold renters
13 24 40.0 107 Rented, high unemployment, lone parents, immobile

The first column contains the cluster number although not every cluster will necessarily be represented in the data set you upload to LGAS. This analysis was run on a 15 cluster classification but clusters 1, 9-11 and 14-15 were not present.

The second column contains the number of EDs or postcodes in each cluster, referred to as Data Cases in the table.

This number is used to calculate the Percentage counts in column three, which is simply the percentage of data cases (i.e., EDs or postcodes) in cluster j compared to the number of EDs or postcodes in cluster j for the whole UK:

Perc countj = 100 * Data Casesj / UK Data Casesj

The fourth column contains the segmentation index, which is a useful descriptive measure that expresses the cluster mean values as a percentage of the global mean and is commonly used for comparing the prevalence of the cluster type or variable of interest (viz the designated numerator) in a cluster compared with that cluster's share of either the number of EDs in all of the UK or the designated denominator variable. Taking the simplest case into account in which you upload only postcodes or EDs to LGAS, the segmentation index can be calculated as:

Seg Indexj = 100 * Perc countj / UK Perc countj

It produces an easy to interpret index, where 100 means average, values of greater than 100 indicate a higher than expected concentration in a cluster and values of less than 100 a lower than expected concentration. Both extremes are of greatest interest in geodemographic analysis. In LGAS the cluster means are estimated using a bootstrap to ensure robustness. If the number of EDs in a cluster are large then this will probably have little effect, but for small sized clusters it may well be essential.

An index such as this statistic ignores the variability of values around the mean (use a Z score instead!) but provides a simple indication of whether the value for a cluster is extremely high or low compared with the average. It is the really large values that will get excited-size matters, but very small values may be interesting too. After all, knowing that something is missing or deficient is almost as useful as knowing that there is a massive excess.

In LGAS the cluster means are estimated using a bootstrap to ensure robustness. If the number of EDs in a cluster are large then this will probably have little effect, but for small sized clusters it may well be essential.

The final column contains the labels associated with each cluster.


15. I don't have just EDs or postcodes. I also have numerator and/or denominator data to analyse. What difference does this make to the way I interpret the segmentation analysis?

When you upload numerator and denominator data, the segmentation index is calculated as:

Seg Indexj = 100 * Aj / Bj

where

Aj is the mean value for the numerator in cluster j or the average rate if there is a denominator, and

B is the global mean that corresponds to the variable used to compute Aj.


16. What are Z scores?

A generally useful statistic to report for each cluster is a Z statistic. This is calculated as follows for each cluster j on selected variables:

Zscore j = (A-B)/S

where

A is the mean percentage value for all cases assigned to cluster j

B is the global mean percentage for all the data

S is the standard deviation of A for cluster j

Note that the data used to compute A depends on whether you supplied numerators or just cases. If you used both numerator and denominator variables then this A mean would be a rate (maybe even a response rate). Conventionally S would be based on all the data not just the data assigned to cluster j. Here it is used as a local measure of within cluster variability because it seems more useful.

Typically Zscore values vary from -3 to +3. Values above zero indicate a higher than average value for the cluster. Values far above +3 indicate extremely high values. Negative values imply a deficiency.

The values of the Zscore are relative to the global data mean but take into account the variability within the cluster. A high positive Zscore indicates that the classification/cluster(s) in question has been successful at segmenting (or isolating) the variable in question. A high negative value is similarly important except that the variable in question is largely missing from the cluster (i.e. it is under-represented).

In LGAS the values of A and S are estimated using a bootstrap procedure to ensure a more robust approach that will be less influenced by non-normal data distribution and/or small number problems and/or questions related to data unreliability.


17. Why test for cluster significance?

It is often useful to know whether the index value for each cluster is significantly different (either greater than or less than) from what would have been produced had the classification been generated by a random method. LGAS uses a Monte Carlo signficance test to answer this question. A probability threshold of 0.05 is used to limit the amount of output. Note that the significance test tests for differences from random that may be in either direction, i.e. either above or below the value that would be expected if the classification was random.

Note that there is a problem here with multiple testing. For example, if you select a random classification with 100 clusters in it then on average 5 of these clusters would appear significant at the 5% level by having values significantly above or below the average. However, hopefully, you would not be too misled as their index values would only be slightly above or below values of 100.

This is a fairly minimal check as randomness is a very poor benchmark. It will only really become useful if the data being segmented/described contain many small values or have clusters with very low rates.


18. What is a Gains table or plot?

A gains table or plot is a useful tool that can be used to show the degree of concentration of your data by certain types of cluster. The steps are:

  1. You decide how to rank the clusters - the choices are mean value, Zscore, or index.
  2. A table of cumulative values is then computed. If you have input two variables then this table will show the cumulative percentage of the numerator compared with the cumulative percentage of the denominator. If you have only input one variable then you can choose a second variable from a list of census variables contained in LGAS. A strong level of segmentation would be indicated by a big difference between these percentages. If either the classification or the data are random then the two sets of percentages would be very similar. These results can also be input into a spreadsheet and a X, Y scatterplot generated to show the relationship. The greater the deviations the stronger the degree of discrimination or segmentation being provided by the classification.


19. Why on earth would anyone want to run LGAS on a RANDOM classification?

Well the answer is obvious really - curiosity as to what a random classification would do with your data. Note also that good data analysed using a random classification is broadly similar to using random data with a good classification. Note that the number of clusters in the random classification corresponds to whatever classification you selected and that each cluster in the random classification will have an approximately equal share of your data.

Random Data

Select this option and your data are replaced by random uniformly distributed values on the range 0 to 100. This option is provided so that you can identify what happens to random data when a census based geodemographic segmentation system is applied to it. Note that any "significant" results are false positives; i.e. or putting it more bluntly - WRONG!

Artificially clustered data

It is sometimes interesting to know what a segmentation system would do with strongly clustered data. Unfortunately, there is no easy way to create "strongly clustered" data whereas random data is, by comparison, very easy to generate. Clustered data requires that you provide a model of the clustering process and, unfortunately, there is a large number of alternatives here which is one reason why synthetic clustered data are still very rare.

The following algorithm is used here. If you input two variables then the data are converted to percentages. The data are then split into quartiles. Data for the first three quartiles are recoded to the range 0 to 25% and the data for the upper quartile left untouched. The resulting artificial data are used in LGAS.


Return to: Help Topics  LGAS homepage