Scaling for Black Friday - Part 2

By Ed Hartwell Goose — January 17, 2022

Welcome back to our two part series on Scaling for Black Friday. If you've not read our first part, visit our previous blog post. We'll wait!

Last time we explored the power of good, scaleable, architecture and the power of the cloud. This post opens up on a challenge we had and shares our load testing plans. Then we'll peer together into our crystal ball for what the future might hold… 🔮

Optimisation & Contention

Horizontal scaling is a powerful way of growing our platform. With providers like Amazon and Google providing compute power cheaply and quickly, it's easy to add more servers without breaking the bank.

But not all components can be scaled quite so simply. Even with the best technology there are always opportunities to optimise parts of our product to scale more efficiently.

A great example is our voucher fulfilment system. We give out tens of thousands of vouchers a day to consumers across the globe. The more efficiently we give out vouchers, the better experience our customers have.

Unfortunately, contention on vouchers can be a major bottleneck for our system performance. This is often exacerbated in periods of high traffic, as multiple users contend for a finite supply of unique vouchers. This leads to different parts of our scaled up platform "fighting" over vouchers – leading to long delays, unhappy customers and unhappier engineers.

During load testing (more on this shortly), we found voucher retrieval was an area ripe for improvement. As a result, we extensively examined, tested, reworked and finessed our SQL queries and the indexes that support them to improve performance. (You can read the nitty gritty details I posted on dba.stackexchange.com as we tackled this challenge.)

Unexpectedly, we also found the optimal strategy was actually two queries rather than one. While this doubled the number of queries, the performance of these two highly specific ones - with equivalent specific indexes - enabled us to cut the query time from around 1 second to under 50ms.

This led to a huge performance boost which directly impacts the consumers using our product. Years later, that code is still cheerfully running in production too.

Combining optimisations and auto scaling ultimately enabled us to reduce the amount of preparation we did for Black Friday 2021. Instead, we relied on the excellent work by the AWS team to ensure our platform will automatically scale to meet users' needs.

Black Friday 2021 was a roaring success. While a couple of team members kept an eye on our monitoring and alerting systems to ensure nothing was anomalous, the majority of the team stayed focussed on extending our platform.

The next Black Friday is never too far away though, so let's look forward to 2022 and how we plan to make it a success.

Careful Planning

AWS technology plays a major part in our architecture, but our Black Friday is not without planning. Each Black Friday (and really, every day!) is about careful planning and prediction. While our compute platform, our queues and services like CloudFront can auto scale, we still scale up our MySQL databases for the big day.*

* Actually, traffic for Black Friday often starts the week before and rolls through Cyber Monday and into early December - so really it's the big fortnight.

Black Friday is a moving target. We continue to deploy throughout November and December and we don't feature freeze. We're bringing new clients onto our platform throughout the month too.

Load testing is at the core of our testing strategy. We have used jmeter & oliverlloyd/jmeter-ec2 in the past, but recently switched over to Locust. We found it easier to use and monitor larger volumes of requests with Locust. Whereas, jmeter had a tendency to crash often and was hard to monitor once our volumes grew beyond a few hundred rps.

Locust enabled us to simulate wide ranges of different consumer behaviour, across clients using different features. Load testing a single component or a single user persona would be easy. But our experience tells us that the fascinating bugs and problems come from reflecting real user behaviour and load profiles on our system.

Our voucher example above was found in production only as a result of examining real traffic.

We start each load testing exercise by reviewing logs of requests from the last 30 days and reviewing our findings from previous Black Fridays. We also look carefully at what has changed since last year. Do we have a new feature that hasn't been "battle-tested" that will need close examination?

By building these thorough load tests, examining our assumptions and comparing to real data we can build a picture of our platform's strengths and weaknesses. This enables us to focus our team on researching and improving efficiency fixes - just as we did with vouchers above.

Our planning often begins in late summer and runs right through to the day itself. We scale up, double check our load testing, ensure our monitoring is correct and our disaster plans are in place. There's a bit of finger crossing too 🤞

Fortunately, we've 'survived' each Black Friday and are looking forward to the next already.

As our platform grows, with it will come challenges to ensure all parts of it seamlessly interoperate, performantly.

Looking to the future

As our company grows, so will our platform. With it will come new hurdles. The volume of data we hold, for instance, is approaching the tens of billions of rows and problems like data migrations are getting harder. Our scale and presence globally will continue to grow too.

Key parts of our future will include:

Jumping into serverless technology to serve users even quicker "at edge", globally
Reimagining our data stores for the future. That will mean embracing even more NoSQL as our data volumes grow, as well as reimagining data processing to support real time data analytics
Exploring and increasing our usage of advanced architecture options like decoupling, event driven, and asynchronous compute to let us get to the next level of scale
More advanced and continuous performance testing to catch bottlenecks before they reach end users
... amongst many more!

These are ambitious goals, and we're looking for engineers who can help us achieve them. If reading this has got you feeling fired up about tackling the many challenges and opportunities that engineering a platform like ours presents, get in touch via the form below or visit our careers site. We'd love to meet you.