Department for Transport (DfT) officials have reported that a newly developed, in-house AI tool for analysing public consultation responses is operating with 'high accuracy', paving the way for estimated savings of up to £4m a year.
A new report from the department suggests these would be 'opportunity cost' savings, as staff time 'can be invested into additional activities', however previous DfT research into public attitudes to AI revealed that many feel such efficiencies would 'ultimately mean job losses for people working in the Department'.
Co-developed by the DfT and The Alan Turing Institute, the CAT uses AI to analyse free-text consultation responses for two key stages of thematic analysis:
Theme generation: An ensemble of large language models (LLM) extracts a set of main themes and golden insights (i.e., rare themes) from free-text responses.
Theme mapping: An ensemble of LLMs classifies which human-validated themes are mentioned in each response (i.e., multi-label classification).
The CAT detected around 75% of human-generated themes (recall) without any human oversight. In live pilots, initial CAT-generated themes were compared to the human-validated ones, with an overall recall of 90%. These findings suggest that the CAT performs 'with high accuracy in automated theme extraction, whilst human review remains important to ensure that all main themes are identified'.
In terms of theme mapping, the CAT achieved over 92% overall raw agreement with human experts, putting its reliability somewhere between ‘substantial' and ‘almost perfect'.
There was also 'no evidence' of demographic bias, i.e. systematic differences in the treatment of certain groups of people. The report states there was 'no evidence of systematic differences in accuracy across observed demographic groups' - a proxy measure for algorithmic bias - and that 'design features of the CAT further mitigate risk of demographic bias'.
Cost savings
'As the CAT replaces the majority of the manual work involved in thematic analysis, we estimate that it saves around 50-70% of the entire cost of responding to a medium-sized consultation,' the report notes. The remaining fixed costs include human review, survey design, project management, synthesis and report writing.
To date, the CAT has completed four consultations, which varied significantly in size. The system analysed 200,000 responses and over 8 million words, saving an estimated £0.5m compared to the cost of analysing responses by humans. The report notes that the CAT will also process every response every time, unlike DfT staff analysis, where only a sample of responses might be looked at for larger consultations.
As the DfT conducts roughly 55 consultations annually, the CAT could yield savings of between £1.5m and £4m per year - assuming that the cost of a medium-sized consultation (1 million words and 15,000 responses) is £80k-£100k.
'The CAT saves significant full-time equivalent (FTE) hours. The human-in-the-loop theme review takes just 1-5 hours per question; even with this investment, the CAT has roughly saved 15,000 hours of work to date compared to a scenario where all responses for all consultations are rigorously analysed by humans manually. This represents a significant ‘opportunity cost' saving, so time can be invested into additional activities that support keeping the UK on the move.'
Caveats
The report notes: 'Ground truth is slippery in qualitative analysis: Language is subjective, experts legitimately disagree, and coding errors or variability creep into qualitative analysis.
'As a result, comparisons to human-analysed datasets are inherently contestable, which makes evaluation challenging. For example, variability in the CAT's accuracy results across datasets is partly a function of the variability in the quality and heterogeneity of the human-analysed reference datasets themselves. This renders a perfect evaluation accuracy score of 1 as practically unachievable for the CAT in many cases.'
It also 'observed variation in the accuracy across datasets, which highlights the importance of the human review stage for theme generation', adding that 'maintaining meaningful human oversight, or a human-in-the-loop, is a core ethical safeguard in AI deployment'.







