Using CDN to Accelerate Drupal API Sourcing for GatsbyJS

95x faster sourcing of Drupal JSON API data for GatsbyJS, thanks to KeyCDN.

Logos of Drupal, KeyCDN and GatsbyJS
Champions of this story: Drupal CMS, KeyCDN and GatsbyJS

Blazing fast frontend

For those of us building decoupled websites and apps, Drupal CMS provides a full-featured, state-of-the-art JSON API. It ships with the core (as of version 8) and requires very little configuration.

GatsbyJS, using plugin gatsby-source-drupal, makes it very easy to source the Drupal data and provide it very conveniently via GraphQL.

Elegance and flexibility. And — after deployment — also the celebrated "blazing fast" performance.

Freezing slow builds

On the other hand, the build performance, as the site grows bigger and more complex, becomes agonizingly slow. To construct a representative static web frontend, Gatsby needs to request all of the relevant data from Drupal at build time. This is not a problem for small and simple sites, but sites with tens of thousands of nodes and many fields per node will suffer from the fact that the JSON API calls are expensive.

Recently I've been working on a frontend for the acclaimed Encyclopedia of World Problems and Human Potential. It's supposed to be just a sandbox to run data analysis and visualization experiments (see askthefox.org), but I needed to fetch pretty much all data from its Drupal 9 backend. Now, the Encyclopedia has almost 100 thousand nodes, with some content types sporting more than 30 different fields, and several tens of thousands of taxonomy entries.

While the backend runs on a respectable 8 vCPU server with 32GB RAM and an SSD drive, it took Gatsby more than 30 minutes to do a fresh fetch all the Drupal data via JSON API (the whole build took almost 45 min).
To make things worse, the server suffered enormously under the load: the whole virtual machine and Drupal's functionality slowed down to a crawl, even crashed sometimes.

Tweaking gatsby-source-drupal options like concurrentAPIRequests and concurrentAPIRequests helped to prevent the critical overload of the backend, but it obviously could not improve the performance.

Disabling unnecessary resources using Drupal module JSON:API Extras at /admin/config/services/jsonapi/resource_types must have helped, but the acceleration was not measurable.

It's important to note that we talk here about fetching fresh data. Gatsby has evolved to be pretty smart about build caching, incremental builds, and Slice API, so repeated builds are much faster. I definitely recommend using the Fastbuilds option (rebuilding only where the data has changed) provided by the Drupal module Gatsby to achieve significantly faster builds.

However, unavoidably, now and then a full build is necessary. And it's unacceptable for it to take three quarters of an hour.

The obvious idea was to cache the Drupal JSON resources.

Cache to the rescue

First I looked at the caching on the Drupal side. Module JSON:API Boost (in combination with module Warmer) makes it possible to warm the Drupal cache, so that the API responses are quicker and less taxing for the server. I set it up at /admin/config/development/warmer/settings, configured the its cron job at /admin/config/system/cron/jobs/manage/warmer_cron, but after running the queues I was not able to measure any significant build speed improvements.
NB: one cannot use this module in combination with Key Auth access control. JSON:API Boost provides no way to add a header to its requests, therefore all requests result in "401 Access Denied". There's a feature request to add the option of adding headers to requests.

Then I turned my attention to the possibilities of external caching.

In need to speed up the consumption of API data, at first I considered using an API Gateway. I took the time to set up an instance of AWS API Gateway and got it to work handsomely, but eventually I decided it was a bit too complex, opaque and limited (e.g. caching for max 3600 seconds). All what I needed it to do could be done in a more flexible and even cheaper way using more basic components such as a Content Delivery Network (CDN).

Request throttling, a feature common in API gateways to protect the backend from overload, is not something that a CDN would offer. On the other hand, requests to CDN are cheap and fast, and the backend is by definition insulated from high usage and many forms of DOS, so throttling as such was not a requirement for me. 

Since my Drupal backend runs on AWS infrastructure, when I needed a CDN I absentmindedly reached for CloudFront. It worked beautifully. I created a distribution with Drupal as the origin, set up authentication via a header, preliminary tests showed some build speed improvement over fetching from the naked origin. 

For an even higher response rate I wanted to implement a method called stale-while-revalidate. This is an approach to proxy caching that makes sure that each request gets a fast response as long as the data is still cached. That means that sometimes even slightly aged (i.e. "stale") data is served immediately, and the CDN does an asynchronous request to the origin to fetch the fresh data. After a painstaking research I was surprised to find out that this feature is not yet implemented on CloudFront CDN!

Disgruntled, I did a search for a CDN that does support this much requested feature. Three CDNs came on the top of my list — CloudFlare (needlessly complex, unattractive), Fastly (slick but essentially more a monster Varnish than a classic CDN), and KeyCDN. I went for the latter. KeyCDN is no-nonsense, simple yet powerful, and comes with a superb documentation. To learn more about its implementation of stale-while-revalidate, see this article.

I set up a CDN for the Drupal origin URL and set it up as baseUrl in the options of Gatsby plugin gatsby-source-drupal, but test still showed no performance improvement! Then I realized that Gatsby only fetched the root API URL /jsonapi via the CDN, parsed it and followed the links within. The problem was that the links in /jsonapi pointed to the Drupal origin, not to the CDN.

It was necessary to rewrite the API URLs fetched from Drupal, something that Gatsby was not able to do. I posted a discussion question and a ticket, and I've created a pull request to add a new option proxyUrl to be used for CDN.

The PR was accepted into the GatsbyJS codebase and published a part of Gatsby v5.2 on 2022-11-25 (and also backported to Gatsby v4).

    Setting up Drupal 9

    Enable core module JSON:API. 

    Install Drupal module JSON:API Extras and enable modules JSON:API Extras and Defaults. Make sure to enable “Include count in collection queries” at /admin/config/services/jsonapi/extras.

    Install Drupal module Gatsby and enable modules Gatsby and Gatsby JSON:API Extras. Configure Fastbuilds for quick rebuilds based on only changed data.

    If you wish to control access to the Drupal API, install and enable Drupal module Key Auth. At /admin/config/services/key-auth set "Header" as the only detection method and decide its name and length. Create a user account to answer the API requests and provide it with sufficient permissions. Generate a key for the user from the user page and note it down — we will need to use in GatsbyJS settings.

    For a granular control of CDN caching we will need to set particular Cache-Control headers on the origin. Install and enable module HTTP Response Headers, go to /admin/config/system/response-headers and add response header named "Cache-Control" with the following value: "max-age=300, s-maxage=300, stale-while-revalidate=3600, stale-if-error=3600" (without quotes). This will instruct the CDN to keep all API requests cached for 300 seconds (5 min). If a request comes after the data expiration but within 3600 seconds (1 hour), it will be served immediately, and the CDN will refresh the cache asynchronously from the Drupal origin. Similarly, should the Drupal API return an error code, responses will be returned from cache for up to 3600 seconds (1 hour).

    Since we need to send Cache-Control headers for with every JSON API response it is crucial to DISABLE Drupal's own caching. Go to /admin/config/development/performance and set "Browser and proxy cache maximum age" (i.e. header max-age) to "<no caching>".
    Download and enable module Content Length Header since KeyCDN is unable to cache responses that do not specify that header.

    Setting up KeyCDN

    At https://app.keycdn.com/zones/index create a new zone and configure it as follows (similar to my needs):

    • Zone Name: /choose an unique string, will become part of the CDN URL/
    • Zone Status: active
    • Zone Type: pull
    • Origin URL: https://www.yourdrupalsite.com
    • Origin Shield: enabled /ensures central caching of responses, as opposed to caching per CDN POP/
    • Max Expire (minutes): 1440 /just a default value; will be overridden by our Cache-Control headers set at /admin/config/system/response-headers/
    • Ignore Cache Control, Ignore Query String, Forward Host Header, Cache Key Scheme, Cache Key Host, Cache Key Cookie, Cache Key Device, Cache Key WebP, Cache Key Country: disabled /all of these/
    • Cache Brotli: enabled /verify your backend using https://tools.keycdn.com/brotli-test/
    • Cache Cookies: enabled /caching even if cookies are present/
    • Strip Cookies: enabled /strips cookies from the origin server/
    • X-Pull Key: KeyCDN /default/
    • Canonical Header: disabled /seems useless for API content/
    • Robots.txt: enabled 
    • Custom Robots.txt/see Example 1 at https://www.keycdn.com/support/what-is-a-robots-txt-file to block crawlers from indexing the content in your zone/
    • Optimize for HLS, Generic Error Pages, Image Processing, Force Download, CORS: disabled /all of these/
    • Gzip: enabled
    • Expire: 0 /means header honoring - very important/
    • Block Bad Bots: enabled
    • Allow Empty Referrer: enabled
    • Block Referrer: disabled
    • Secure Token: disabled
    • SSL: shared
    • Force SSL: enabled

    Setting up Gatsby 5

    The setup of the GatsbyJS Plugin gatsby-source-drupal is very well described at https://www.gatsbyjs.com/plugins/gatsby-source-drupal/.

    Use the options parameter baseUrl to specify your Drupal origin base URL (e.g. 'https://www.example.com/') and the new proxyURL parameter to specify your CDN base URL (e.g. 'https://example.kxcdn.com/').

    Parameter apiBase is usually 'jsonapi'.

    Provide the name and key, using Key Auth above, as a header under options parameter headers

    You can tweak parameter concurrentAPIRequests to a rate suitable for your Drupal server.

    Conclusion

    According to a test conducted on 2023-01-11, sourcing directly from Drupal lasted 1850 seconds, sourcing via (populated) CDN only 19.5 seconds. 
    We can conclude that thanks to the use of KeyCDN and the new proxyUrl parameter in gatsby-source-drupal, my GatsbyJS site is now able to fetch almost 100k rather complex Drupal 9 nodes via the JSON API about 95-times faster than directly from Drupal!


    GatsbyJS has recently introduced a new product called Valhalla that is able to continuously synchronize data from any and all source APIs your site needs into one single one, called Content Hub. It seems to be primarily focusing on unification of the diverse APIs (there's also a common GraphQL layer), but it can also effectively solve the problem of slow / not scaling origin APIs that we have been tackling in this article. For sites hosted on Gatsby Cloud, the Content Hub API is close to the deployment server, and therefore very fast. 

    Related links

    article, project

    GatsbyJS KeyCDN performance CloudFront Drupal CDN Encyclopedia of World Problems and Human Potential build API AWS

    English

    Comments