Developers Retreat - An exercise in big-ish data

Many years ago the UK police opened up the data they held on crimes to the public. You could use an interactive map which showed info on crimes, the type of crime, roughly what street it happened on and the outcome of said crime. Pretty cool, and obviously the first thing you did was look at the area you lived in to see what had been happening.

Skip forward to 2023 and a few interesting accounts I follow on Twitter had brought up the topic of pSEO. Which is short for "Programmatic Search Engine Optimisation". The concept is where you take a large dataset and make valuable and interesting content from it. Since the dataset is large, this allows you to make many, if not hundreds or thousands of pages full of content.

When I heard this, I immediately thought of the police data. The number of crimes in the UK since the last time I looked must be huge by now and I was right, it was. Around 20 million crimes! However the icing on the cake was that this data is available to download for free! I downloaded all the data and began to look through and parse it.

Working with big datasets

Now to some people reading this ~20 million rows might not seem too big, however it's probably the biggest dataset I've worked on to date. So this was a challenge I was ready to take on. It needed proper planning on the database, the table structures and indexes. Fetching a few thousand rows or smaller doesn't really have a big impact on performance, but querying millions certainly will.

Another challenge was that the crime data contained spatial data, such as latitude and longitude coordinates. I needed to be able to say "Get me all the crimes in this area / bounding box". This is where the right database for the job comes into play. Normally MySQL is my preferred choice. However it soon became clear that searching through millions of crimes in a given area took a long time, even with spatial indexes at play. Think 30 - 60+ seconds. Which is totally unacceptable given that I wanted to turn this into a website.

Cue PostgreSQL. Upon doing some research it turns out Postgres is much stronger when it comes to spatial querying and I found that the query times dropped to more workable times of 10 - 30 seconds. However, this still wasn't acceptable for a website. I do implement caching on the website which brings the times down to around 1 second per page load, but to me this wasn't good enough. I needed to find a better way to query the data, one that wasn't so resource intensive.

Optimising spatial queries

It was clear that given the size of the dataset and the time taken to get results back that this wasn't the way forward. However and quite luckily, the crime dataset contained LSOA codes. LSOA stands for "Lower Super Output Areas" Which are like IDs assigned to areas of the UK. I was able to find a list of all the LSOAs which i could use to query the crime data! So instead of saying, "Get me all the crimes in this area using the lat / lon coordinates". I could now say "Get me all of the crimes where the LSOA code is this list of LSOA codes".

The LSOA codes are grouped by region, so I was able to select all the codes for a given region. This turned a 10 - 30 query into a sub 100ms query, which is a massive improvement and something I could work with! With this breakthrough I was able to perform more complex queries which resulted in much better content. Things i've included are:

  • Number of crimes per crime type for this year vs the previous year
  • Number of crimes per electoral ward and crimes per 1000 people
  • Interactive crime map which plots the crimes to the rough area it was committed in.

Adding in housing and salary data

Although the crime data was a big dataset to work with, I felt that I needed to add in additional data, after all there are plenty of websites which show crime data. I needed to make mine stand out from the rest. I managed to find housing data and salary data estimates for each region in the UK, so now you can view house price averages for detached, semi-detached, terraced and flats for the current year all the way back to 2010. You can also view the mean and median salaries for each region, not bad at all! Plus this adds to the content - all of which I find to be quite interesting.

The plan was to build out several hundred pages with stats, graphs and tables which should hopefully prove to be interesting and engaging. I'm looking to keep people on a page for a long time and then for them to want to view other pages / areas. All of which helps with rankings and bounce rate.

The website I've built is called Developers Retreat. A matching domain name that I've been sitting on for more than 15 years, I just didn't have the right idea for it until now. There are 2 main sections to Developers Retreat:

Final thoughts

This idea came together fairly quickly and was a challenge to work with a bigger dataset than what I'm used to. I got exposure to:

  • A new Database - PostgreSQL
  • Which included working with spatial data and indexing.
  • Trying out what indexes work best and in what situations
  • Importing large datasets

More Posts

Laravel PDF API

As part of my day job I was required to make PDF's from HTML templates and expose this via an...