Server Side Caching

Project Image

Intro

As I was writing the code for this website, I decided I would want to try to implement a caching system. Pages for the site are generated using templates, with the majority of the content coming from a database. The benefit of this method is that I only have to write a few scripts, and then new content can be added via the database instead of making a new page. The disadvantage is that every time someone loads a page, the script asks my database which templates to load as well as all of the data for that page. The database query and response takes under 5 ms. Then PHP generates the final HTML that will be sent to the client within another 5 ms. As a site of modest size and following, these times are by no means unreasonable. The bulk of any delay in loading a page comes from transmitting the data from my host to the client. However, as the site grows, more people will concurrently be viewing my various articles, and therefore processor time will begin to be more valuable. Database queries will have to respond as fast as possible to avoid delays to the viewer.

Since all of the data for the individual pages remains mostly constant over time (in fact, it never changes unless I personally update it), wouldn't it be useful to store that data into a file so that when a user requests the page, my server only has to print out the content of that file? This is the job of caching. It could potentially allow me to bypass the database query step altogether, which would cut up to half of the processing time of generating any given page. My image watermarking protocol would only have to generate new versions of images if I update the actual image, instead of every time it is shown to the user.

Going into this article, I only have ideas in the back of my mind of how I would implement such a system. This article will show my method of implementing those ideas.

Design

One main feature I want with this system is easy integration. I donít want to have to place code in more than a couple places in any script. There should be a function to check whether to use the cached data for any given page, and there should be functions to save not only the data but also paths to files on which the page depends. Whenever a page is requested, the script will check the last-modified timestamps of all files and database entries against the date-created timestamp of the cached data. If the timestamp on the cached data is greater than all of the timestamps for related data in the database AND all of the dependent files, then load the page from cache. This would entail loading the cached data, inserting all non-cacheable data, and outputting the generated page to the client. Non-cacheable data would be any on-the-fly dynamic data such as ads or user specific information. If the timestamp for the cached data is not greater than all those other timestamps, then a brand new version of the page must be fully generated, being sure to save the cacheable portion of the page to a file along with the time it was last updated.

Coding

There are two parts to implement. The first part is to determine whether to use a cached version of a page, or generate a brand new version. The easiest way I found to pair a cache file to its intended result page was to calculate the CRC code of the URL to the page. I can also append other variables (in a CSV manner) to the URL if I want to store specific cache files for different states of the page. In order to determine if the data from a cache file can be used, I can check the date-modified value for each file listed in the cache object, but in order to notice content changes, the database needs an important enhancement. Every row with unique, cacheable, information in the content table needs a 'timestamp' field that gets updated every time the row itself is updated. Then I only need a long (yet simple) SQL query to get the latest timestamp out of all of them.

The second part is to either load the cached version or save the new version. If the function mentioned in the first part returns true, we load the cached version. This means I copy the data directly from the cache file and use it as though I had just built it using fresh results from the database. Instead, if the function returned false, that meant the cache file is old, and must be replaced. Ultimately, at a certain point in generating the page from scratch, I will save the new version to a file along with the update time and any associated files.

I will post the complete source code in a few days.

Conclusion

How much time is really saved by using this method? In the worst case, the pages are updated constantly, and the cached data is never valid for use. In that case, in effect, I have added more processing time for each page requested. However, since any given page on my website is updated only sporadically, the cached data is used very often. All large data requests from the database are completely bypassed.

On my main page, which loads abstracts for five articles at a time, any increase in efficiency is barely noticeable to the visitor. It saved <5s using the cached data rather than re-generating the page. JPEGs on the other hand, which I run through a watermarking script, have their entire processing time bypassed by using the cache. So as I start to make longer articles with lots of text and pictures, the processing time for each page will be drastically better for my server. Also, as more people start to view the site, any time saved on the server is time that can be spent serving another visitor.

.