Scaling for Black Friday - Part 1

By Ed Hartwell Goose — January 14, 2022

Black Friday is the busiest day of the year for the Mention Me platform and with it comes a whole host of challenges.

If you're interested in what makes a successful Black Friday for the engineering team of a growing martech scale-up, read on. In our two part series we'll first explore the challenge of Black Friday and explore our performance. Later, we'll look at a specific optimisation challenge and we'll share a few secrets about the future of our architecture too…🤫

The Challenge

We've talked in the past about how Black Friday can be an opportunity to run powerful marketing campaigns for our clients - but today is about looking at the technology that powers it and the challenges Black Friday brings.

Each year, more traffic goes through our platform. Our total number of clients have grown from ~20 in 2015 to over 450 in 2021. And the scale of our clients has grown too, including high volume clients like Asos, Missguided, Farfetch and PrettyLittleThing.

This means our technology stack can't sit still - what worked for us in 2015 isn’t fit for purpose in 2022.

Our graph below brings this to life, showing the volume of transactions through the platform each year since 2016. You can see the increase in traffic volumes, and each Black Friday sets a new record. 2020 was of course an outlier year, but we expect to resume growing the graph from 2022 📈

Read our Black Friday blog for more insights from running 450+ referral programs

All in all this translates to large volumes of HTTP requests - at peak we have delivered over 900 requests a second (rps) to consumers across the globe. That’s a substantial increase compared to 2015 when it was nearer 100 rps.

Each request can then translate into many calls to key components of our backend, including queues, databases, caches and NoSQL stores.

In addition to the scale challenge, we have key performance and reliability criteria. We aim to serve the majority of those requests within ~100ms and our platform is available 24/7 – making the challenge even more difficult.

Let's explore the architecture that enables us to do this below.

Our Architecture

Our platform has evolved over the years. We’re constantly making changes. Right now, we deploy multiple times a week, and are aiming to get that down to multiple times an hour. We live in AWS and use GCP for our data platform.

One of the more unusual factors of the Mention Me platform is that our traffic is write heavy.

We can serve a lot of our content via HTTP and in-memory caches (we use AWS' CloudFront and ElastiCache extensively). But, unlike a typical e-commerce or transactional website, most of our content is heavily personalised and unique to the referring individual. After all, we can't give everyone the same referral link.

Serving every customer a unique and personalised experience requires both reading and writing to our datastores. Equally, we need to track and analyse the minutiae of each referral program to give our clients the best reporting, and that requires writing a lot of data. This gives us some unique architectural challenges.

We exploit asynchronous processing to do this. We rely on AWS Simple Queue Service (SQS) to push writing into queues for later processing, aiming to do only the bare minimum to serve a user request there and then. This helps keep our response times very low.

Email is our best example. It takes time to generate an email, connect to our email provider and send it - so, to make sure the user isn’t affected, we do it asynchronously.

But there are many other async processes across our platform. This includes data processing for analytics, giving out rewards and tracking referral progress. Each of these are pushed into their own queue which can be independently scaled (or, in case of emergency, paused).

Speaking of Scaling...

Prior to 2021 we relied on manual scaling. Mention Me's traffic hasn't tended to fluctuate significantly or suddenly, so we relied on manually scaled EC2 instances. We knew this would only last so long, and 2021 brought with it a significant amount of work from our team to embrace Docker on AWS Elastic Container Service (ECS), utilising Fargate instances, auto scaling and spot instances for improved cost efficiency. We'll do a deep dive on our deployment in another post soon as we reduced the time it takes from ~45 minutes to around 5 minutes in late 2021 🚀

The above helped us to scale the platform as a whole. However, it's not the only part of the story.

In our next blog post we'll deep dive into an example optimisation of a component of the Mention Me platform. And we'll explore what the future of Black Friday looks like at Mention Me.

See you soon 👋

Part 2 of our series has been published - click through to read on.