Taming Dexerto.com - One Million Hits At A Time
*** I gave a talk about this project at a Craft CMS Manchester meetup. Slides here ***
Towards the end of 2018 I was asked to help out with a site that was experiencing massive traffic growth year on year and needed some TLC to ensure it kept performing under increasing pressure. The site was Dexerto.com, a gaming and esports journalism website.
Initially our calendars didn't quite match up so it wasn't until Jan 2019 that I could get involved in the project. What I found was a Craft CMS suffering as a victim of its own success.
Traffic Is Good
Dexerto has been growing at a rate of 800% unique hits year-on-year. They are clearly doing something right. All of this traffic has allowed them to expand their enterprise into multiple territories as they've optimised their content, SEO and social media presence.
The site continues to go from strength to strength, continually building a loyal audience. This audience has recently reached high enough volume to generate daily unique hits in the millions and total request counts in the tens of millions. Daily.
This volume of traffic obviously affords them a lot of opportunities!
Traffic Is Bad
Dexerto was struggling under the load of all this traffic. This was culminating in daily outages during peak traffic hours caused by various issues but all resulting from too much load placed on a traditional server architecture. I'm sure any PHP dev out there will be sympathetic given this level of traffic.
The site is built using Craft CMS. I doubt there are many Craft sites out there at the moment handling this level of traffic and there's a good reason for that: traditionally architected PHP apps don't deal well with this type of load.
To its credit, Craft performed admirably, managing to serve unique hits into the hundreds of thousands per day, spread over three load balanced application servers, before it started falling over. But something had to be done to keep the site alive as the traffic continued to grow.
Humble Beginnings
The site had originally been built, much like any other Craft site, with utility as the predominant driving force. The content structure had developed over time as business use cases became apparent. Templates had gradually grown around the need to serve unique content, functionality and targeted ads to users in different territories and on different devices.
This had culminated in a site which was generating 300+ database queries per page load with minimal template level caching in place.
I can feel you wincing at these numbers. I did too.
** As an aside, I think finding the balance between optimisation and reacting to evolving business requirements is really hard when your business is growing this quickly. Their huge growth is in part a result of pushing hard towards the latter. Too many users is a good, if stressful, problem to have. **
To offset the significant load placed on the database, the site had recently benefited from the addition of nginx based full page caching. This had been set up to cache indiscriminately for a fixed 60s TTL. At the time of implementation this additional caching layer had saved the day, but it didn't come without down sides:
- Admin-only front end functionality was occasionally leaking out to users via the cache
- The cache didn't respect cache headers sent by Craft so per-page cache TTLs weren't possible
- The site had such a large catalogue of articles that, even with this cache in place, the database was still serving 4000+ queries per second
- CSRF tokens attached to some form submissions and ajax requests were broken
The site was being served through CloudFlare which was caching static assets (CSS, JS, images) but just acting as a pass-through proxy for html content.
The Goal
With traffic at this volume and a growth trajectory of 800% yoy there are only three sensible options to consider:
- Full page caching on a CDN with long cache TTLs and a cache busting mechanism to allow for content updates
- Static site generation and distribution of resulting HTML via a CDN
- Not using PHP
These options are in order of potential time required to implement and, as the site was already suffering downtime, the quickest fix was the most attractive. We therefore decided the best bang-for-buck would come from enabling full page caching with the existing CloudFlare distribution. CF would respect cache headers sent by Craft which would allow us to control the cache TTL for every page individually. CF's API also allows for arbitrary cache busting which would make it easy for us to clear the cache for particular articles and associated pages when they were created and updated.
Once implemented, this system would reduce the total number of requests reaching the Dexerto servers by around 80% based on our anticipated cache TTLs.
There were just a few hoops we had to jump through to get there.
Problem 1: Dynamic Content
All of the site's pages contain 'trending' content. By its very nature, this content will change over time, sometimes quite rapidly. We therefore need to exempt this content from any long TTL full page caching. There's no simple way to achieve part-page caching in CloudFlare as it has no awareness or ability to control the specific templates which will be rendered as part of a user request.
To get around this problem any dynamic content was removed from the main page and replaced with an empty box - ready to load content into. When the site's pages load we are able to perform an ajax call to a different URL which just receives the trending content which can then be inserted into the space on the page. This trending specific request can then have a much lower cache TTL set which allows it to update much more frequently than the page into which it is being placed.
An added bonus of this approach is that we've given the trending content its own URL and, as the trending content is the same across all pages, the same URL is used to add it to all pages. The cached trending content is therefore only being cached once for the entire site and any associated database queries are only being executed once each time the trending content cache expires, rather than for every single page on the site.
Problem 2: CSRF Tokens
Craft outputs CSRF tokens into the HTML and therefore, when requests are served from CloudFlare's cache, any CSRF tokens are shared amongst all users. This will clearly result in CSRF token mismatches whenever a user submits a request to our back end in order to perform an action such as subscribing to the mailing list.
We solved this using a similar method to the dynamic content. We first output CSRF placeholders in the HTML but populate them by performing an ajax request to the server which is completely uncached. This request just returns the current user's specific CSRF token which can be dropped into the placeholders ready for when the user submits a form.
Problem 3: Mobile Specific Content
Several parts of the site were making use of Craft's isMobileBrowser() in order to change specific content and ad placements between mobile and desktop devices.
By caching requests in CloudFlare we lose any differentiation between mobile and desktop users. If a mobile user is the first to hit a URL, their content is what will be cached and served to all subsequent users. We therefore needed to make all the templates device agnostic.
Some parts of this were simply a case of replacing template conditionals with media queries to hide content appropriate on the client side. However serving different ad placements to different device types proved more difficult.
The ad technology being used was Google Publisher Tag, a very common client-side ad placement system. Luckily we found that GPT includes a media-query-esque method of configuring ad placements and the type of ads which can be loaded into them. After a bit of experimentation we were able to standardise the ad config code across all device types in order to allow caching across all devices.
Problem 4: Cache Busting
If article pages on the site are being cached by CloudFlare for one hour it could take anywhere up to this length of time for updates made in the back end to start showing up for users. This isn't acceptable in in context of high-paced journalistic content.
We decided to implement a custom cache busting mechanism which was took the form of a Craft plugin. Whenever an article is created or updated in the Craft control panel this plugins figures out all of the potential URLs that might have been affected by the change. In the case of an article being updated this could effect:
- Any URLs for the article itself
- The home page
- The article author's overview page
- Any associated category or tag index pages
Once this list of URLs has been established they are dispatched to CloudFlare's API as a cache invalidation request which results in these URLs being removed from the cache.
I've found that this process takes between 5 and 10 seconds in the majority of cases, which is much more palatable than on hour!
The Result
After solving each of these problems we were able to activate full page caching in CloudFlare whilst maintaining support for all of the existing functionality on the site.
Average TTFB dropped from around 7s to 80ms!
Outages dropped from multiple times per day to (so far) zero.
Total database load dropped by about 75%. (This is still quite high given our caching strategy, but most of the remaining load is caused by view count tracking, CSRF token calls and a few uncachable actions.)
The team now has breathing space to consider their next move - whether that be re-architecting the site towards a more static approach or continuing to develop a system utilising CloudFlare's caching system as its base.
Oh, and they can actually start thinking about feature and functionality development again! Much more rewarding than worrying about servers every day.
A Failure: Craft Template Caching
I mentioned earlier that we were trying to find thew quickest fix to prevent the site from regularly falling over. When I first started working on the codebase an initial review revealed that there was very little template caching in place - resulting in every page generating 300+ database queries for every request. I thought the quickest and easiest win in this situation would be to add template caching around any modules which were shared over multiple pages (navigation, sidebars, trending lists etc).
I added this functionality to the site's templates, being very careful to set the shared module caching to be global and setting custom cache keys which would allow the module caches to be shared between pages. This approach brought the number of database hits per page request down to around 90, a massive improvement. I also ensured that query caching was disabled as we didn't need that on top of the pure HTML caching.
I somewhat naively assumed that conditional cache busting based on element updates wouldn't pose too much of an issue as I was defining my caches with global keys which contained variables with only tens of potential values. This should have limited to number of conditions to approximately:
The number of defined cache blocks * potential key variable values * number of elements referenced in cache blocks.
And indeed, during local testing I was able to generate several hundred cache conditions in the database before the number plateaued.
When these changes were pushed to live everything looked great. Page load times decreased, database load fell significantly and everything worked as expected. However, a couple of days later the database attempted to perform an automated backup and crashed.
The cache criteria table had grown in size into the GB range and dumping the table was causing the entire server to fall over 🤦
We quickly truncated the offending table and disabled all cache tags in the site to prevent the same happening again.
Unfortunately, I wasn't able to determine exactly why the cache conditions were growing at such an alarming rate, especially as the cache blocks weren't linked to unique URLs. Also, there doesn't seem to be any option in Craft to disable the conditional cache busting functionality - if I was able to just cache the HTML for a fixed period that would have been fine, but alas, not possible.
A lesson for future projects: if you have high traffic, be extremely careful when making use of Craft's cache blocks in templates. You will probably want to use something like Cache Flag instead, a plugin which does exactly what I actually needed (but only learned about afterwards).