Creating a architecture presentation the other day helped to crystallise some thoughts in my head on integrating a CMS like Tridion, and Content Delivery Networks (CDNs).
I have been involved with such matters before (see my SDL Tridion World Article on how to technically integrate a CDN through Storage Extensions) but I thought it was worth sharing my ideas on the considerations when working with a CMS and CDN.
SDL Tridion’s Enterprise Content Management features are a good match for companies with a truly global digital presence and audience. Such companies are also those most likely to benefit from the scaling features offered by a global Content Delivery Network, so Tridion + CDN is a hot topic.
I think the problem can be most simply boiled down to the following 3 questions:
- What assets do you want to cache using a CDN?
- How do the assets get into the CDN cache?
- How do we invalidate the cache when assets are (re/un)published?
Lets address them one by one.
What assets do you want to cache using a CDN?
The main benefits of using a CDN are many. A CDN puts your content physically closer to the visitor, so page load times are faster and they have a more positive web experience. The reduction of traffic to your own servers means you have a less extensive and expensive infrastructure to maintain in-house, or with your hosting provider. Finally CDNs are designed to cope with fluctuating demand – they are better equipped to deal with traffic flashpoints caused by an event or campaign.
Understanding this is key to establishing what to cache. There is little point caching your whole site. Archived or infrequently visited pages can just as easily be served from your own servers, focus on the parts of your site(s) that have significant load. Maybe 10% of your pages drive 90% of the site traffic, perhaps it is only campaign micro-sites which create a heavy load. This kind of analysis gives a good place to start analysing.
Then you start to move into content: Video, Images and other high bandwidth assets which are for the most part static (do not change often, and are not different depending on who is accessing them)
Then you have the pages themselves. Probably you can divide these up into static (contains content that does not depend on the context in which it is viewed), dynamic (for example a list of items that can be filtered, or is dynamically generated, or a page dependent on personalization), and transactional (forms and other pages that have an explicit interaction with the user)
How do the assets get into the CDN cache?
There are 2 approaches here. The simplest is the Pull approach, whereby you publish the asset as normal to your own servers, and the CDN requests it using a normal HTTP request to your website. The second approach is to Push content to the CDN’s infrastructure. Here you hook into the publishing process, and additionally send the assets to the CDN (using whatever API, or protocol the CDN vendor makes available).
Wherever possible, I would recommend the Pull approach. You do not need to understand the workings of the CDN, and retain control over how your assets are rendered (suppose you are publishing pages containing application code – for these to work when published on the CDN you will need to deploy your web app to the CDN, which may be complicated or even not possible).
The main advantage of the Push approach is that cache invalidation can be simpler, but as we see in the next section, this can be easily implemented in conjunction with the Pull approach also.
How do we invalidate the cache when assets are (re/un)published?
- Have the CDN cache expire based on HTTP Header information (check out the Expires, Cache-Control and Vary headers)
- Notify the CDN that something has changed, at the point of publishing or (re/un-publishing) an asset
- Have the whole site or sections of the site invalidated on a schedule (eg nightly)
- Manually flush the Cache using the GUI of the CDN
The first approach is the simplest and represents a no-code integration with the CDN if you are using the Pull approach. If you want you can allow the expiration to be controlled by business logic in the web application and/or configured by content editors specifying metadata on the assets or their organizational items.
As mentioned previously, you will get the second approach for free if you implement a Push method of integration – when you push the asset to the CDN, it will know that it needs to flush any old version from the cache (be sure to handle un-publishing also). This approach can easily be integrated also for Pull integrations, by implementing a Storage extension. Here you can hook into the publish transaction and add a CDN notification for the assets published (or re/un-published). The technical details on how to do this are given in my SDL Tridion World article. Its worth noting that I would abstract any functionality to connect to a CDN to push assets or make notifications into a separate webservice, and then hook this into your storage extension. The benefits of this are that you can code this in whatever technology makes sense (Storage extensions must be Java) and you can also call this from other parts of your system architecture if required (for example, hook up an external DAM, PIM or eCommerce system to provide page invalidation when updates are made from these).
With a dynamic website where content is shared and linked across many pages, it is worthwhile considering the third approach. High-volume and volatile sections of the site could be flushed on a short schedule (perhaps every 10 minutes to an hour) whereas the whole site might perhaps be flushed on a nightly basis.
The last approach is always useful as a back-up. There will always be a time when parts of the site need to be flushed, and this could be the best way to do this quickly.
Some final thoughts
The above considerations have hopefully given you plenty of food for thought if you are looking at CDN integration. What seems initially like a simple concept, can quickly become complex and perhaps there are more questions than answers, as every site has different content and different requirements and restrictions on caching possibilities.
The best advice is to be pragmatic – like any form of caching there is always a compromise to be made and you will never find a perfect solution (if you do – let me know!).
Below I list some final thoughts on the subject which I didn’t manage to fit into the sections above.
- Avoid tying your publishing model and web application architecture to a particular CDN – there are many many CDN providers out there, and you may well want to switch in the future (for performance, functionality, or financial reasons)
- Make sure your site works normally without a CDN. This is related to the point above but also makes it much easier to manage your staging or preview websites, plus other environments (dev, test etc.) which probably will not use CDN.
- When notifying a CDN of updates to the site and using a Pull integration, CDNs often have the possibility to do a instant update of the cache (CDN requests asset immediately) or on-demand (CDN waits until the next time the asset is requested by a visitor). If you are doing bulk publishing, the first approach can result in high volume of traffic to your own webservers (from the CDN, requesting all the updated pages). The second approach will better distribute the load on your servers through time.
- For static assets like JS and CSS, consider putting the version number as part of the file name (eg styles-v23.css) along with making CDN cache invalidation easier, this will also help you overcome issues with browser caching.