Some Questions and Answers

Q. I am a simple minded user. I do not want hundreds of different systems. Which one should I use?

A. Try the 50 cluster GB Profiles Classic.

Q. How can I find out more about Geodemographics?

A. Read some of the references or come to Leeds and do an MA in Business and Service Planning, either face-to-face or part-time at a distance.

Q. How can I get help to use LGAS?

A. You should not need to. It is offered as a zero support DIY service - zero support because no one is paying us to provide any. However, we also believe you do not need any (or not much anyway!) Please let us know if you think otherwise and particularly those parts that you have problems with.

Q. Why must I use one of the fixed GB Profile classifications rather than create my own?

A. Because it is easier and quicker. In the mid 1980s it was estimated that a national classification (Super Profiles) cost £200k of computing time each time a new one was created - purely a theoretical cost because academics in the UK did not pay for computing time. The GB Profiles91 classifications took between 10 minutes and several weeks on a Sun workstation in the middle 1990s, and even in 1999 a single run could still take a few days. Soon you will be able to choose your variables, select the scale, decide on a range of clusters, and get the results back almost instantly, BUT not just yet. Indeed this would be possible today albeit only on very expensive high performance computing hardware. So why not exercise patience and wait a few years or else create your own classifications from scratch on your own hardware.

Q. Would a more targeted classification perform better than a general purpose one?

A. Yes and No. If you compared a classification carefully hand crafted to identify deprivation areas with a general purpose one then it is possible that the special purpose one would offer a better description for the same number of clusters. Maybe all you need do is increase the number of clusters in the general purpose classification.

Q. Are the 1991 census data too old to be useful?

A. Yes it is old but it is still the best small area data we have for Britain. You just have to bear in mind that a small percentage of the results will be wrong but that would also have been true in 1993 except the size of the percentage would have been less.

Q. Are lifestyle data and lifestyle census data far superior?

A. That is unproven. Basically you would have to trade-off a 1991 classification of Britain against lifestyle census based systems created in the mid to late 1990s (possibly using data spread over several years and of unproven validity). This lifestyle classification (which is not yet in the academic sector so is unavailable) would probably be good in some areas but have holes in others! It must also depend on what you want to do and what you can afford to pay.

Q. Are the variables used in GB Profiles male focused?

A. Yes. Marcus and Stan are males but Linda is not. However, we accept there may be an unconscious gender specific bias but, so far, we refuse to believe it matters much (or at all) given the other problems.

Q. None of my postcodes or ED codes on my input file were matched. Is LGAS totally naff or is it me?

A. It's you! You need to ensure your files match one of the input formats.

Q. Is there a limit on how many records I can have in my input file?

A. No, other than those implicit constraints related to disk space and, or, internet ftp problems.

Q. My input file has several different data values which I was hoping LGAS would analyse for me but it does not appear to handle more than two at a time.

A. That's right. So split your input data up and send LGAS multiple files. Sorry, but sometimes you have to live with annoying little restrictions imposed by others. Basically this restriction greatly simplified our task in writing the LGAS software.

Q. Some of my postcodes were not matched. What should I do?

There is probably nothing you can do. Postcodes change, they are re-cycled and deleted. People also write them down wrong. Provided the number lost is small (less than 10 to 20%) then forget it. If it is much larger than you need to check on the quality of your data.

Q. Do my postcodes have to be formatted in a special way with spaces etc?

A. No. LGAS will standardise them for you.

Q. Where would I get some census ED data to test out LGAS?

A. From Casweb, if you are a registered census user.

Q. How does a census classification compare with a deprivation index?

A. The short answer is good. A good census classification should be able to retain a much greater degree of the multivariate profiles of areas than an index based on far fewer variables. You could try and analyse the distribution of deprivation scores by cluster code.

Q. I hate computers, I hate census data, I hate the web, and I think the internet is rubbish. Any suggestions?

A. Take up growing organic spaghetti in Tuscany and make us all jealous by your relaxed and downsized lifestyle!

Q. Are any of these GB Profile classifications better than commercial geodemographic products?

A. Maybe yes, some are. The comparisons have not yet been done and in any case they would be application specific and therefore not necessarily helpful.

Q. Why is the response of Leeds GAS sometimes slow?

A. That's because you are using our hardware, which we paid for, burning Leeds University electricity, and consuming our CPU cycles. We offer this service for "free" and so far no one has paid us any money to support it. If you want a faster response, then download it on to your machine if you can.

Q. Why does LGAS exist?

A. Well, we live in the internet age. Research outputs extend beyond publication counts. Impact via you doing something useful on our LGAS might help a little towards the Leeds University Profile, count as a dissemination of some kind, and maybe help us, one day, to obtain funding to support this service. The latter requires a pool of users and offering a free service is one way of building up a user community.

Q. Are the cluster codes a prime example of bad and dangerous statistical analysis performed by useless geographers?

A. No. If you do not like what we have done then demonstrate you can do better and we will happily add your results to LGAS!

Q. How can you link a census classification to a remote sensing database?

A. You would either have to aggregate the RS data to census ED boundaries or else create a raster representation of the space within each ED and give each the appropriate cluster code.

Q. I am a non-academic from a Local Authority or company X or government department or 10 Downing Street or a nuclear submarine under the North Pole. How can I obtain my own personal copy of Leeds GAS?

A. As academics we are not allowed (by the conditions attached to the ESRC-JISC census purchase) to sell you the classifications. So send us an e-mail and we will explain the alternatives available to you.

Q. I am concerned that expensive census data is available on a security flawed website where virtually any 5 year old hacker could steal it. Are my concerns warranted?

A. Well, yes an expert could try to copy the files used in Leeds GAS but whether he or she could successfully unencrypt the files containing the cluster codes and match them up to postcodes or census EDs is possibly doubtful. You could just ask the Data Archive. It would save you a lot of effort!

Q. How secure are the files going into LGAS?

A. Well, that's a risk for you. We could keep a copy! However, only you know what it means or what it relates to so what good will it do us keeping a copy of your data? So we promise not to. If you are that paranoid about the security of your postcodes or census ED codes then maybe you should try and acquire a copy of the entire system and run it locally in a locked and barred room, ideally 50 metres underground!

Q. List some example applications involving LGAS

  1. Crime incidence by postcode can be analysed by type of residential area. The ideal input would be postcode or ED code, count of crime X, total addresses/population.
  2. Disease data can be studied in a similar manner.
  3. Response analysis of a mailing campaign or questionnaire survey, likewise.
  4. You can analyse the distribution of a census variable or deprivation index by residential area type. This would answer questions such as: are the high or low values of this variable concentrated in specific types of residential area?

Q. Why does the total number of EDs in GB Profiles not match the UK total?

A. There are 725 EDs with missing information due to data suppression. There have been excluded from the analysis. Well, come on: how do you classify EDs with missing information? There is only one answer: "badly" except you do not know how bad because the data are missing!

Q. Are Census Classifications evil?

A. Well hardly! If you believe they are, then don't use them!

Q. I want to know how the variables used to create the GB Profile classifications were computed.

A. OK, you can find the list of pseudo SASPAC commands that were used to create them.

Q. Why did England and Wales postcodes not nest into census EDs?

A. Because of short-sightedness during 1991 census planning. It was easier (and cheaper) to continue the census tradition of having a unique geography for gathering census data (the census enumeration district -ED) without worrying about postcode "boundaries". Another slight difficulty was the failure of the Post Office (who are responsible for postal geography) to view postcodes as zones (even though they are).

Q. Does it matter that postcodes do not exactly match 1991 EDs?

A. Probably not, or if at all, not much in most applications.

Q. What is segmentation?

A. Segmentation is another name for dividing your data up into sets of clusters (or segments). The hope is that you will find that the majority of your data (numerator variable) will come from relatively few clusters. In a direct mail context the numerator variable might be response, the denominator number mailed. If you rank the clusters by response rate (or Zscore or index) and express the totals of response and mailed as percentages, you may discover that 50 or 70 per cent of your response comes from clusters that contained a far lower percent of total mailed. Similar effects in crime or disease data indicates a concentration of the data in relatively few types of residential areas. In general this is a useful form of descriptive analysis that can be used for targeting problem areas or hot spots or good prospects for various purposes.

Q. What is the difference between a segmentation and a classification?

A. Here we arbitrarily assume that a segmentation consists of one or more clusters. So it is the same as classification but in a marketing context is would also imply a targeting of certain data clusters.

Q. What value is a cluster Zscore?

A. It measures the difference between the mean value for a cluster and the global mean divided by the cluster standard deviation. It's a descriptive value that attempts to summarise the magnitude of the difference and the degree of variability of the data in a cluster. For more detailed information on calculating Zscores with LGAS, consult these help pages.

Q. What happens if the data being analysed by the classification are Poisson?

A. Nothing much. The bootstrap and Monte Carlo significance testing procedures are non-parametric and will work with most kinds of data. Indeed only for low incidence rate data or small number problems will these tools become useful.

Q. Why use a Monte Carlo significance test?

A. Because it will work well regardless of how the data being tested are distributed. Stan also knows how to do it and it is far simpler than any alternative.

Q. What are the disadvantages of using Monte Carlo significance tests?

A. It is more expensive computationally but that no longer matters much if at all.

Q. What is the minimum amount of data needed to yield a reliable result?

A. In general more data are better than less data. The minimum number of EDs with non-zero data in a cluster before it can be analysed is 2. This threshold can be changed. Small number problems should not be too much of a problem.

Q. How do I identify an optimal number of clusters?

A. Probably with great difficulty. You will need to re-run the analysis using different numbers of clusters until you find whatever YOU consider to be the BEST one for A SPECIFIC application. Maybe it is not worth being too purist here. Since you will probably never be able to identify an optimal number why not be pragmatic and stop as soon as you find whatever you consider to be a good one.

Q. Is a neuro computing based classification better than a K Means (or CCP) based one?

A. Maybe yes, maybe no. This is something for you to determine, not for us to say or declare. Much depends on your criteria for determining "better". "Better" must relate to ease of interpretation or strength of discrimination or degree of performance as measured by a statistic of some kind.

Q. Classification is black magic!

A. Rot

Q. Classification is a scientific and objective process.

A. Rot. It's highly subjective and arbitrary in terms of how it is applied even if the technology is objective and scientific in nature.

Q. Will any data produce a great classification?

A. Rubbish. Remember the "Garbage-In-Garbage Out" rule of thumb.

Q. Is there a world best globally optimum classification method?

A. No. Some methods may appear "better" than others on theoretical grounds but quality is a subjective attribute that depends on purpose and how you choose to measure it.

Q. A geodemographic census classification provides a description of residential areas.

A. Yes. It is a descriptive and highly exploratory form of analysis.

Q. Which classification is best?

A. That is for YOU to determine.

Q. What can be done about those EDs not included in LGAS?

A. Nothing. Stan refused to classify them because of potential erroneous results and problems in tagging the analysis as being based on possibly misclassified data. It is best to ignore them and regard them as LGAS, i.e. as an unclassified residual.

Q. The unclassified residual sometimes does extremely well. Why?

A. A good question about a common observation. The unclassified areas are not randomly selected but have certain features that may well result in them having some descriptive utility. For instance, unclassified postcodes that reflect post 1991 housing. Sadly in a census based system there is not much that can be done about the unclassified.

Q. What happens if some of the postcodes I am using relate to houses that were demolished in 1992 and then re-issued?

A. Nothing, except you will obtain the wrong cluster code for any postcodes affected by post census change. In other words a percentage (hopefully small) of cluster codes will be wrong. Census geodemographics is neither based on rocket science or is it error free. On the whole it should not be too bad but do not push the results too far!

Q. Census geodemographics contains errors and so it is wrong?

A. Yes but it is only a little bit wrong! There is probably no better alternative you could employ.

Q. A classification system based on wrong data is useless?

A. No. All data are wrong to some degree. It is called error and there are several different species. The trick is to learn how to live with wrong data and classifications that are also wrong in some instances. LGAS attempts to help you by using robust methods of analysis. Ultimately you need to use a degree of commonsense and not push slightly uncertain classifications too far.

Q. In what way is a wrong classification uncertain?

A. The uncertainty relates to not knowing where precisely the wrong bits are located. Best therefore to treat the entire classification with some circumspection.

Q. What is the principal geographical problem with a census classification?

A. It handles geography badly. Cluster 27 is regarded to consist of similar types of census EDs regardless of where they occur. What is lost is the geographical context. For instance, Cluster 27 in a town centre may be "different" from a village location or an edge of town location. Relative location is only implicit not explicit. However, this may not matter much, or at all. It depends on what you want.

Q. Are some methods of classification better than others and does it matter?

A. Yes and No. Whether it matters depends on whether differences in classification performance impact on the application. It may not. A more critical decision might concern the choice of the number of clusters to use.

Q. I am a simple minded totally confused user - what is the simplest possible set of options to use?

A. Use the defaults and hope all will be well. Experience suggests that it usually will be ok, after all this is essentially what most users of commercial geodemographic systems do! You are in good company.

Q. Where is the intelligence in geodemographics?

A. It is put there by the system builder and reflects various critical design decisions. On the other hand, it is quite obvious that many of Britain's towns and cities are composed of residential areas belonging to a reasonably small number of distinctive types. Now isn't that discovery a reasonably intelligent one?

Q. How sensible is it to use a system with 500 or more clusters in it?

A. It can be quite useful if you have lots of data. It breaks down the basic area types into fine subtypes, maybe too fine - it depends on your purpose.

Q. Why do I have to label the clusters being used?

A. Because it is good for you! It also saves us from:

  1. having to do all the work, and
  2. being criticised for getting the labelling wrong.
However, all is not lost. We have provided our ideas of labels for some of them.

Q. How can I use a classification without knowing exactly what the different clusters are?

A. Quite easily! Treat the cluster codes as an index. Only if you really need to know the characteristic and nature of the most "interesting" clusters do you need to label them. Initially the principal question is whether or not the classification yields a worthwhile segmentation (or division) of the data. What you hope to find is that your data is concentrated in relatively few of the clusters. If it does then it may be worth labelling the clusters.

Q. Can the same cluster be given different labels?

A. Yes. Much depends on:

  1. the person doing the labelling
  2. their degree of experience
  3. their knowledge of urban geography
  4. the variables being used to assist the process
  5. whether the cluster members are mapped, and
  6. whether the labelling is done by reference to exemplar areas.
Quantum effects exist here! The same cluster may be given a different label by the same person on a different day based on the same data.

Q. Doesn't this multiplicity of labels cause problems?

A. Why should it? If fits in well with post-modernist thinking and it avoids one person's labels and implicit value systems being imposed on others.

Q. The 100metre X,Y values that LGAS produced are not very accurate and a few seem to be wrong. Why?

A. Well that's the fault of the census data provider. In GIS terms 1991 was right at the beginning of the modern age so just be glad that there are some X,Y values attached to the 1991 census. Hopefully, the 2001 census will be an improvement.

Q. What are the benefits of a fuzzy classification?

A. It avoids the all or nothing assignment of each ED to only one cluster and allows EDs to belong to more than one simultaneously albeit with unequal degrees of membership.

Q. What are the problems with fuzzy classification?

A. There are a few problems:

  1. The degree of fuzzyness has to be specified in advance and how on earth is the classification builder supposed to know that?
  2. You have multiple classifications with the same numbers of clusters but different degrees of fuzzyness so it involves you in more work as there is a greater choice.
  3. The resulting application of a fuzzy classification requires that the fuzzyness is removed by a defuzzification procedure.
  4. Some research suggests that the benefits are more theoretical than real but this could change as more experience is gained.

Q. What types of fuzzyness are there in census geodemographic classifications?

A. Openshaw (1988, 1989) identified fuzzyness in both the geography space and the cluster space. In the former it concerns geographic distance in that a postcode (or ED) may be "near" to several EDs some of which are in distinctly different clusters. In the latter an ED may be "similar" to a few different clusters. Current geodemographic systems focus on targeting specific areas belonging to a set cluster(s); for example, find all areas belonging to cluster 27. A more fuzzy approach would allow queries such as find all areas "similar" to cluster 27 in classification space and "near" to cluster 27 areas in geographic space whilst being assigned to non cluster 27 codes.

Q. How does LGAS defuzzify a fuzzy classification so it can compute a rate or a value?

A. Badly - because more research is needed. Both the options (weighting by membership probability and weighting by membership probability attenuated by distance) need further development.

Defuzzifying options

A fuzzy classification allows a single ED to belong to more than one cluster albeit with different membership probabilities. This complicates the calculation of data for clusters as the classification has to be defuzzified. Only one option is provided: Membership weighted data.

The data being assigned to a cluster is multiplied by the normalised membership probability of belonging to that cluster.

Q. If a classification explains 85% of the variance of my rate variable is that good or what?

A. Yes that is probably good but it is still a relative measure.

Q. Is an 85% explanation equivalent to an R2 of 0.85?

A. Well you could say that but do not push the "explanation" angle too far. It is only a descriptive index.

Q. How can I test whether 85% is statistically significant?

A. You should not really be wanting to ask this question for a number of reasons:

  1. There is no notion of simple random sampling error,
  2. The baseline null hypothesis of randomness is probably unhelpful,
  3. Classification is not really a confirmatory approach but a descriptive tool, and
  4. Clearly you are too wedded to statistical testing for your own good - try thinking for a change.

Return to: Help Topics  LGAS homepage