With a synchronous web of data being the norm in this day and age it is no longer good enough to update a website on a per request basis. Users expect the content and data on a website to be relevant and readily available and whereas daily and less regular updates would have been acceptable a few years ago users are now expecting relevant content to be made available by the hour, if not by the minute. However there is clearly a considerable overhead in processing data for display on-the-fly as requested by your visiting public.
For anyone running their own data-driven site this would traditionally be achieved through running a cron job – a task scheduled to be performed on a regular basis on a server irrespective of whether that site is being visited or not. However, running cron jobs feature two complexities:
- the cost in learning how to configure and manage them (though having never done them it could be a piece of piss), and
- the cost associated with hosting a site that has cron capabilities (this functionality is chargeable as extra on my hosting account)
So what about a compromise – scheduling visits to your site to take place periodically rather than scheduling the server to perform task at regular intervals?
This was something I set about figuring out when I realised that a technology already existed which periodically checks a website: RSS. If we can set up an RSS feed to process a script as it is requested could we harness the power of RSS to semi-automate updates on your website? Hell yeah!
Setting up a test script on a server, I created an empty RSS feed and then subscribed to the feed using Google Reader. Everytime the feed was requested the time was logged in a separate text file. Like clockwork it transpires Google Reader (other feed readers are available boys and girls) was requesting the feed every three hours on the dot. So it might freak out if more complex scripts are attempted but in principal this might prove one way of scheduling relatively simple scripts to run on your server with relative frequency if you don’t have access to running cron jobs on your server.
The cron syntax is a bit archaic, but it’s easy enough to look up when it’s needed. This seems like an interesting hack, but it’s really fragile – you’re relying on an external service to work properly.
If you need to avoid cron, why not just update data when it’s requested by a user, and then cache it for a certain length of time or number of requests?
A (gs) account from Media Temple does this very well, and has (at no extra cost) access to easy to set up cron tasks.
In the <bad> old days, we have used Opera, which had a ‘Refresh this page in x mins’ option for web pages. Great to leave on a server and let it run and kick in.
One word of warning. Any server function that can be accessed on a http page is accessible to any source, and could be triggered more that Google Readers schedule, leading to sites tipping. That’s the benefit of cron, the tasks are private, running as a different user, on the server.
It probably goes without saying that you should action with caution on what you do via this method .. ie non destructive shizzle.
And you would probably want some kind of mechanism to ensure if the feed got spidered etc it wasn’t adversely getting hit constantly .. so some kind of check on last update before doing anything expensive.
Cool idea though. And yeah cron jobs are pretty simple to set up, and those who are inclined to like those ‘panel’ type server things – its probably even more trivial.
To be honest, you can probably ignore my first 3 paragraphs .. I’m ignoring the hard work I’m not getting done :D.
Was a hair/hare-brained idea spawned in the pub so appreciate feedback
in time honoured fashion was more interested in whether it could be done rather than whether it should) ;)
Of course anything that affects the data should be through POST requests – thinking more here of data processing, especially for third-party data eg. API crawling/caching etc.
The idea first arose when thinking about a site using the twitter API which limits periodicity of requests so tying updates to an automated script seemed like a logical approach but I don’t have cron jobs with 1&1 (but otherwise happy with them for hosting).
Obviously – as Dan said – as an RSS feed it would be susceptible to any http request but there could be ways round this (off top of head, if only subcribed to a particular feed and provenance of feed was known could lock down processing to a particular request source).
But then also likely talking out my ass!
I think its a valid idea, because you are automating the process cheaply with an RSS reader which you probably have open anyway .. it’s easier than say having a local cron job on your computer that periodically accessed a server side script etc ..
It’s probably better (depending on traffic trends) to delegate it to page request and cache the result (which allows you to get over API limits) .. so only the first request over the cache lifespan would be slightly longer but you can always fork the request so the page load isn’t affected .. and if its super important to be fresh, then investing in a cron job is probably worth it..
I need a drink.