Fairer Council Tax: consultation analysis

Analysis of responses to the Fairer Council Tax consultation.


Annex B: Technical approach to qualitative analysis

The consultation included six questions with free-text fields. There was significant variation in the level of detail, length and topics covered by the responses. Given the scope, breadth, and scale of consultation responses, the research team followed an approach combining manual coding (to develop a codebook of themes) and automated text analysis (using Natural Language Processing models to assign themes to responses that were not initially reviewed by the research team). As manual reading of all 15,628 responses to the consultation was not possible given the consultation timelines, this combined approach was the most efficient way to fully consider all responses to the consultation while leveraging insights identified by experienced qualitative researchers.

Three different models (all designed to work with unstructured text data) were used for the automated text analysis. To remove bias from any individual model, a “wisdom of crowds” approach was used: the final set of themes was based on combining themes assigned by each of the three models in a voting scheme. The responses and themes assigned by the models were reviewed as part of quality assurance checks by the reseach team, with incorrectly-assigned themes fixed and feedback provided to subsequent iterations of the models. The models were run (and outputs manually reviewed) as many times as needed to reach the desired level of accuracy – for each iteration of the model, themes assigned by the automated analysis were reviewed a minimum of three times by two different members of the research team. This procedure ensured that, in line with UK[16] and Scottish Government[17] guidelines on codes of conduct required for undertaking consultation analyses, all responses were analysed carefully and in full.

As discussed above, the integrated manual coding and automated text analysis approach was selected as it was the most accurate way to consider all responses given the very large number of responses received and tight project timelines, which made manual reading of all responses unfeasible.[18] In particular, manual coding of large numbers of text data can lose reliability and become intractable (Abbasi et al. 2016[19]). In addition, automated text analysis can help researchers identify and extract more fine-grained patterns in responses, achieving both breadth and depth in coding in a way that manual coding techniques cannot, due to their resource-intensive nature (Nelson et al. 2018).[20]

One key limitation of this approach is that themes are assigned to responses based on different probabilities assigned by each model. This means that automated text analysis, although extensively quality assured, will not be as fully rigorous or thorough as manual coding of all responses (as the models can only learn from a smaller subset of manually-coded responses, compared to full manual coding which can draw on a researcher’s full breadth of experience). In addition, it may be difficult for automated text analysis models to correctly identify themes in more complex responses (Chen et al. 2018[21]), for example:

  • Data ambiguity: a respondent may express an attitude or opinion but does not clearly state the target of the opinion, or a respondent may provide multiple, contradictory opinions.
  • Human subjectivity: different levels of understanding or sets of experiences among qualitative researchers may lead to different interpretations of the response.

In both of these examples, multiple reasonable interpretations of a respondent’s answer might be possible, making it challenging for the automated text analysis models to identify the correct “range” of themes or codes that should be applied. However, rapid advancements in natural language processing models in the past few years (for example, moving away from dictionary- or keyword-based approaches to models that can understand sentence context and meaning) have led to major improvements in accurately coding more complex text data (Grandeit et al. 2020).[22]

Finally, the physical infrastructure used to carry out the automated text analyis fully complied with UK general data protection regulations (UK-GDPR) and generative AI guidelines:

  • All models used for automated text analysis are fully open-source and freely available for commercial and non-commercial use.
  • Because the models were trained and fine-tuned by the research team, the Python scripts used to run the models did not contact or require a connection with any third-party service (such as through an API) at any point during the automated text analysis.
  • To ensure all responses were considered in full within the consultation timelines, the research team deployed NLP models on the Google Cloud computational platform (GCP). Cloud servers are faster, more efficient and specifically designed for storing and processing very large text datasets.
  • GCP follows stringent, state-of-the-art data security and encryption protocols, ensuring that neither Google nor other third parties can access data stored on GCP.[23]

1. Initial manual coding

A random sample of 250 free-text responses for each open-format question was manually reviewed and coded. The sample consisted of different respondents for each consultation question, and responses were selected to prioritise longer responses (75% of responses in the sample were longer than the median response length) and to be representative of the distribution of respondent characteristics (by Council Tax band and local authority).

All responses submitted via e-mail and post were read in full, as manual reading was required to add these responses to the master dataset for analysis. In addition, all organisational responses (including those submitted online through Citizen Space), as well as all responses from local authorities with island communities[24] for Question 8, were manually reviewed in their entirety. These responses were not part of the sample of 250 given to the researchers using the procedure outlined above; rather, they constituted a separate set of responses and were not used as input for the NLP algorithms. Themes from responses submitted via e-mail and post were also added to the codebook to inform qualitative analysis by the research team.

Responses to the consultation differed in depth and approach, and while many responses included evidence to back up opinions, other responses primarily expressed preferences, concerns or expectations without further analysis. As part of the coding process, the research team’s approach to handling these differences involved:

  • Capturing the main idea regardless of whether it was expressed as a personal view or if evidence was provided to sustain the argument.
  • Reading beyond grammar and spelling mistakes to capture the main idea regardless of difficulty in distilling the information.

All coded responses were reviewed by a second coder as part of quality assurance, and regular project team meetings ensured that themes were defined consistently across researchers. Team members then added codes identified in the manually-read sample of responses to a separate codebook in Excel (a list of all themes raised by respondents across all consultation questions). Codes were organised based on a treemap format, in which codes are arranged hierarchically into main themes and subthemes. This aligns with the inductive approach to thematic analysis set out by Fereday and Muir-Cochrane (2006).[25]

2. Automated text analysis

The next step was to extend the range of qualitative codes assigned by the research team (the codebook) to the full set of consultation responses for each consultation question using Natural Language Processing (NLP) machine learning models, embedded within a larger procedure comprising human supervision and quality checks. This task is known as multilabel text classification, and is an example of a supervised learning task. In supervised learning, a dataset that has been labelled by skilled researchers is used to train machine learning models to classify a new dataset (that has not been labelled).[26]

When analysing text data, the models generate a set of probabilities for every label in the codebook.[27] These probabilities indicate the likelihood that the individual label applies to a specific response (a higher probability means the model believes the individual theme correctly applies to the response). For example, a model trained on a set of newspapers should be able to differentiate between an article pertaining to the categories of “World Politics” and “Foreign Affairs” (perhaps an interview of a diplomat or ambassador offering the points of view of his country about a trade war with another country) from one pertaining to “Sports” (such as the weekly football results of the Premier League), by assigning a relatively high probability to both World Politics and to Foreign Affairs and a relatively lower one to the label Sports in the first example, and opposite in the second.

The most recent NLP innovations in the fields of text classification and generation were deployed by the research team. These are referred to as Large Language Models (LLMs) and act as very advanced “categorization machines”. This technology was first developed in 2017 and is currently widely used across both industry and academia for NLP tasks such as multilabel text classification.

LLMs are based on vectors, or numerical representations of words that capture their meaning based on the context they are used in: for example, “bright” has different meanings in the sentences “the student is bright” and “the lightbulb is bright”. These models take sentences and transform them into vectors (this process is called embedding), allowing text to be processed in a way that is both quantitative and qualitative. Because LLMs have been trained on billions of webpages, tweets and Wikipedia articles, they learn to understand context, meaning and semantics, as well as slang, jargon, dialects and the ever-evolving set of expressions of a language that vary with the age, generation and social group of a language user. LLMs can also work around spelling errors, if these are not too extreme and the language deployed is still understandable to a human reader. In a nutshell, these models are able to, in some sense, “understand” natural language.

To analyse responses to this consultation, the research team used three widely-used open-source models: BERT, GPT-2 and few-shot learning.

  • BERT and GPT-2[28] are particularly adept at understanding natural language, as they use an “encoder” to read and understand different aspects of the input text (meaning, structure, semantics) and “decoder” to generate a response. These models can be trained to carry out multilabel text classification through a process called fine-tuning: by providing a large sample of labelled consultation responses as input, the model can be “taught” to predict labels for a set of unlabelled consultation responses that the model has not previously seen.
  • The third model deployed, few-shot learning, is adept at fine-tuning with a smaller input dataset. Rather than predicting labels based on flows of information across a “multi-layered” model like GPT-2, the few-shot learning model makes predictions based on the embedding input and the embedding labels presented in the input dataset, as it learns to extend the classification to more responses.

There are two reasons for using multiple models for automated text analysis. First, it allows the choice of the best model among a set of competing alternatives. Second, it also opens up the possibility of combining model labels to leverage a “Wisdom of the Crowds” effect. Assignment of labels is more robust and done with a higher degree of confidence if the label is “voted” by two or more classifiers for the same answer, rather than only one. This minimises the bias from only using a single model, as different models may perform better on specific consultation questions or in identifying specific themes.

These models (GPT-2 and BERT in particular) rely on having a reasonably large amount of data for fine-tuning: the input dataset should include a minimum number of “examples” of each theme that the model can learn from. This means that one way to boost the accuracy of the model is to increase the size of the input dataset. More specifically, for each response to the consultation manually coded by the research team, a number of “synthetic” responses with the same meaning as the original response were generated using a LLM. These synthetic responses were only used to provide additional examples of each theme to improve the model learning process, and no analysis was conducted on the synthetic responses (the analysis presented in the main body of the report is entirely based on the dataset of consultation responses received from respondents). For this consultation, the research team used the Llama LLM.

Our overall approach to automated text analysis proceeded as follows:

1. Increase the size of the training dataset by using Llama to artificially generate 100 synthetic responses for each manually-coded consultation response. This strategy allowed the research team to increase the accuracy of the NLP models used for automated text analysis, as higher accuracy levels are correlated with larger input datasets used for fine-tuning.

a. A variety of strategies were used to generate synthetic responses that still had the same meaning (class-preserving) as the original manually-coded response: changes in individual names, swapping adjectives, using synonyms and paraphrasing. This approach is based on Edwards et al. (2021)[29] and Guo et al. (2023).[30]

b. 10% of all synthetic responses were manually reviewed by the research team to check for quality (is the meaning the same as the original response?) and diversity (are the synthetic responses just duplicates of one another?). This ensured that the themes assigned to the synthetic responses were correct and the NLP models “learned” as much as possible (if synthetic responses were duplicates of the original responses, the NLP models would not have learned to make new connections or understand new patterns by reading the synthetic responses).

2. Fine-tune the BERT, GPT-2, and few-shot learning models by training them on the augmented training dataset (including the manually-coded and synthetically generated responses).

3. Use the fine-tuned models to output a set of probabilities of a given answer belonging to a particular theme:

a. More specifically, if the codebook (produced from responses manually coded by the research team) included a total of 50 themes, the model would estimate the probability that each of the 50 themes correctly applied to a response.

4. Use a procedure known as maximum cut (Largeron et al. 2012)[31] to select the threshold to determine which themes should be assigned to each response (only themes with a probability above the threshold would be assigned). This approach estimates a different threshold value for each response based on the midpoint between the two themes with the highest difference in probabilities for the specific response. This allows more than one theme to be assigned to a specific response. In addition, responses shorter than 25 characters were restricted to only have one theme assigned (as it was very unlikely these short responses of 5-6 words or fewer had enough space to discuss multiple ideas or topics).

5. Combine the themes for the three models using both a majority-voting and weighted majority-voting approach.

a. In the majority-voting approach, for each response, the themes selected by a majority of the three models were assigned to the response. This approach is more robust than relying on outputs from a single model, as each model produces a different probability that a specific theme is correctly applied to a response.

b. In the weighted majority-voting approach, the models were first evaluated on a subset of responses manually coded by the research team. This subset of responses was not used for fine-tuning the models (so the models could not “cheat” because they had previously seen the responses). Using the performance metric for each model and consultation question, the research team could then more heavily weigh the model that more accurately classified themes to responses for that specific consultation question.

6. Run a keyword-based search to check for obvious classification errors – for example, responses that did not include mention of “elderly”, “pensioner” or similar words should not be classified to the theme “Negative impact on elderly individuals or pensioners”. Results from this keyword-based search were manually checked by the research team.

7. Manually review a random sample of 50 examples of each theme for each consultation question (for example, if the codebook for a question had 15 themes, then 750 responses would be manually reviewed). For each consultation question, the sample was reviewed by the researcher who read responses for that question during the initial manual coding phase.

8. If fewer than 70% of responses in the manually-reviewed QA sample were correctly labelled, the automated text analysis would be re-run, with parameters such as batch size, number of epochs, number of synthetic responses and approach to generating synthetic responses in the training dataset increased.

a. The voting schemes were also updated based on findings from the manual QA checks and analysis of the overall distribution of themes assigned by each model. For example, as BERT tended to be overly “generous” in assigning themes to respones, the voting procedure was designed to mitigate the effects of large differences in the outcomes of different models (based on selection of model weights).

b. The specific threshold of 70% was chosen based on previous experience on the performance of NLP models on other public sector consultations.

c. In general, the threshold depends on the number of themes in the codebook. If only two themes are in the codebook as potential labels, then a random assignment would have a baseline accuracy of 50%; if 20 themes are in the codebook, then random assignment would have a baseline accuracy of only 5%.

d. The research team’s “accuracy” metric defined 100% as all themes for the response being correctly assigned (in other words, the model assigns the correct number of themes as a trained researcher would, and each of the themes correctly aligns with a theme a trained researcher would record). This means that an accuracy level of 70% also includes “partially” correct responses (with at least one theme correctly assigned). For example, if an answer has two themes that should be assigned (“Council Tax is becoming unaffordable” and “Tax bands are flawed”), but the model only assigns one theme (“Tax bands are flawed”), this is calculated as 50% accuracy. The overall accuracy (compared against the 70% benchmark) was calculated by first finding the accuracy of themes for each response, then averaging the response-specific accuracy across all responses for a single consultation question.

e. In practice, almost all of the 30% of remaining responses included at least one theme that was correctly assigned, and these responses were more likely to be ambiguous (when manually reviewing these responses, different members of the research team disagreed on the correct themes that should be assigned).

f. Fewer than 5% of responses had all themes incorrectly assigned, and these responses were then reviewed and coded manually.

Across all steps, themes assigned by the automated text analysis were reviewed by the research team at least three separate times: step 4 (to ensure the maximum cut procedure had correctly assigned themes and to check for outliers), step 7 (to check the results of the keyword-based search) and step 8 (formal QA check to estimate the percentage of themes that were correctly assigned). Each review was carried out by two separate members of the research team (one data scientist and one social researcher) to ensure the research team agreed on the performance of the automated text analysis and performance was consistently evaluated across consultation questions.

Number of responses manually reviewed by the research team

Number of manually-reviewed responses

Question 1: 1,977

Question 4: 1,335

Question 6: 2,235

Question 7: 2,094

Question 8: 833

Question 9: 1,024

Contact

Email: ctconsultation@gov.scot

Back to top