How I used data visualisation with Citizens Advice content

This guest blog post is written by Ian Ansell. He is a Senior Performance Analyst who has been working in data, in some capacity, for over a decade.

Cluster coding is a really great way of grouping large datasets and visualising data, and you can use this approach to guide your content decisions. It takes a bit of logical thinking, trial and error and working with certain database platforms. But for the value it gives you, it’s worth taking the time to learn how to do it. 

Citizens Advice wanted to iterate on their content strategy and further streamline their content operations. I used code clustering to help the team see connections between various pieces of content. 

Get your source data

I started by getting data from Citizens Advice’s frontline adviser database. I used its 2 tier taxonomy: 

  1. The issue code provided by the adviser, for example: a benefits issue, 
  2. The specifics of the issue - for example, universal credit, or personal independence payment. 

I also noted the client’s ID so that the data was specific to them. 

This gave me a wealth of information around our clients’ issues. Once this was in place, I could start playing with the data! 

And now it gets interesting. 

Data patterns

I started to look at the patterns in the data. For example, let’s say we have Client A and they need help with: 

  • benefits, 
  • dealing with debt, 
  • divorce. 

This is 1 client with these 3 specific issues. Based on this client, we know that benefits are linked with debt and divorce. 

The questions started to gnaw away at me. Were there any patterns or clusters in these needs and could it be done at scale? Would this be useful to show how user needs are linked?

Technically, the approach I used is called a co-occurrence network diagram. Everyone calls it code clustering. (To be pedantic, clustering is something different. But if there was anything I learned from content design, it was that the language of your users matters.)

The concept was straightforward enough, it’s about assigning users to needs. For example, Client A (user) is assigned with the issue code (need) BEN2. Then, at some point, they also have the issue ‘DEB1’ then the data would look like this:

Client A - BEN2
Client A - DEB1

I could then see that BEN2 is ‘linked’ to DEB1. The problem was figuring out how ‘linked’ they are and then doing it at scale.

I turned to the graph database, Neo4J. I’d played around with it before as a tool for reshaping the data and making the connection between issue codes. Users and issue codes would be nodes in the database, with the edges representing the strength of the relationship.

Loading the data into the database 

The code I used to get the data into the Neo4J database was:

USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM “file:///path/to/file.csv” AS csvLine WITH csvLine, toInt(csvLine.Quantity) AS Quanity MERGE (u:User { name: csvLine.User }) MERGE (c:Code { name: csvLine.Code }) MERGE (u)-[r:HAS]->(c) SET c.ccount = coalesce(c.ccount,0) + 1

Understand your code

Let’s pause for a minute and take a look at what that actually means. These commands can help you make sense of your own data: 

USING PERIODIC COMMIT 500 - means that rather than trying to hold the whole files in memory at once, load it in batches of 500 lines.

LOAD CSV WITH HEADERS FROM “file:///path/to/file.csv” AS csvLine - loads the csv from a local location and give it the name csvLine

WITH csvLine, toInt(csvLine.Quantity) AS Quantity - Sometimes users could have the same code multiple times, so take that field and  convert it to a number. I’m not sure if NEO4J does things differently now, but back then everything was loaded as text.

MERGE (u:User { name: csvLine.User }) - Merge is a useful command where if the node doesn’t exist, then it makes a new node. But if it does exist then don’t bother and we can refer to all user nodes as ‘u’ later on.

MERGE (c:Code { name: csvLine.Code }) - Same as with clients but we differentiate the nodes by calling these ‘c’.

MERGE (u)-[r:HAS]->(c) - Here we are creating the relationship between each user and their code. That relationship is called HAS.

SET c.ccount = coalesce(c.ccount,0) + 1 - This creates a property on each code, called ‘ccount’ which shows the total number of users that ‘HAS’ any specific code.

..Still with me? Right, let’s get back to Citizens Advice’s data. 

A big cluster of data 

I now had a graph database of users and codes.

This wasn’t particularly interesting yet. All I’d really done was load my CSV file and defined the data structure. 

Building relationships between the issues 

This is where it gets fun. I wanted to remove the individual users and focus on how each code is linked to other codes. To get to that answer, I wrote this code: 

MATCH (a:Code)<-[:HAS]-(u:User)-[:HAS]->(b:Code) WHERE ID(a) > ID(b) RETURN AS Source, AS Target

And this actually means:  WHERE ID(a) > ID(b) - To stop duplication in the relationships  RETURN AS Source, AS Target - this is the starting point, but rather than writing a new relationship in the database, I decided to return the ‘strength’ of the connection between the two codes.

From the ‘ccount’ property of an issue, we know how many clients had any single issue. We can also count the number of clients who had a and b codes. 

To get mathsy, it’s (the probability of ‘a’ given ‘b’ + the probability of ‘b’ given ‘a’) divided by 2. This is the code I used: 

RETURN AS Source, AS Target, ((COUNT (c:Client)/a.ccount) + (COUNT (c:Client)/b.ccount) /2 AS Weight.

Getting closer, then. But what about the more unusual relationships? For example, you might have Client B with issues such as: 

  • tax problems, 
  • poor health.  

Should they really be presented as major elements of the visualisation?  I figured for this case, no, although this should be made clear when presenting the information.

I then moved onto looking at the size of a code. How many clients had that code? 

Here’s the code I used to work it out: 

RETURN AS Source, AS Target, ((COUNT (c:Client)1.0/a.ccount)LOG(a.ccount)+(COUNT (c:Client)1.0/b.ccount)LOG(b.ccount))/2 AS Weight

So now I had my data in a form I could use. HURRAH! 

BEN2 and DEB1 has a weight of 2
CON5 and BEN1 has a weight of 1

Drumroll please, I visualised the data…

The best visualization tool for graph data I could find (without building one) was a tool called Gephi. Plug in the data and BOOM! Yes, this is what I wanted to see.


A squirting mass of spaghetti!

It took quite a bit of trial and error. And a whole lot of tweaking to get the spider’s web into something presentable.

Sign up to our newsletter

Get content design insights sent straight to your inbox.

  • Choose what information you get: (required)