Creating a cron to refresh feeds

They have: 3 posts

Joined: Oct 2007

Hello,

I've built a news site using SimplePie to pull in a set of feeds and display them on a page. The caching is working but the problem is that the first initial load is slow. After that, you can hit refresh and it loads very quickly. I'd like to eliminate that first slow load by creating a cronjob, which is what I've heard many other people do. I'm very new to crons and programming in general so hopefully someone can help me understand this.

I found one example of a guy using the following cron to automatically visit his page and he syncronized the cache duration, thus eliminating any slow starts:

0,15,30,45 * * * * wget -q --spider http://thewebsite.com'

The problem I found was that my host doesn't support wget (or lynx, or ssh) so I've been trying to write a cron that will achieve the same results but using either curl or php. Here is what I have so far:

*/15 * * * * curl --silent --compressed http://www.mysite.com/index.php'

*/15 * * * * php /home/username/public_html/index.php'

Again, I found these examples from reading other resources so I'm not entirely sure what they're doing or if they're correct for my situation. I've also set the cache duration to match at 15 minutes. When I run either the curl or php cron above, in my e-mail I receive the entire html source for the specified index.php. I'm assuming this means they're working correctly, but still my news page has that initial slow load. I'm open to any thoughts or ideas. Thanks in advance.

JeevesBond's picture

He has: 3,956 posts

Joined: Jun 2002

When you've got it working you might want to change your curl line to:
curl http://www.example.com > /dev/null 2>&1'
That'll stop you getting the entire contents of the page as an e-mail (although that might be desirable for now).

This seems correct, I don't think there's anything wrong with your cron jobs. The problem is with timing, how do you ensure that it's always the cron job that triggers the RSS reload? Even though you've got them both designed to go off every 15 minutes that doesn't mean that the cache will expire and cron job activate at the right times.

You need to make that script only refresh the RSS feeds when the cron job is run. Otherwise you'll always have what's called a 'race condition', the two events will only rarely sync with each other.

I went and had a look on the software your using's Web site, are you chevy409 in this thread? Was reading some other threads in there and it looks like there's no way of running the script so it's only refreshed on a cron run, I find it a little weird that the writer of the software has never setup a cron job.

Maybe on the main page you could set the cache duration to infinity (or as near to infinity as the software will let you, 99999 for example), then create a new page just for cron where the cache duration is set to 1. So on the page that people view the cache is never refreshed, but the page the cron job views always refreshes the cache. This does depend upon how the script caches feeds though.

Does that make sense? Feel free to ask on that forum and refer to this page, if I'm correct I'll even write that cron tutorial the author is so desperate for. Wink

a Padded Cell our articles site!

They have: 3 posts

Joined: Oct 2007

Hi, and thanks for the detailed response. I've been testing more today and you are indeed correct, it is the timing of the caching that is the problem. I'm using the curl script I posted above but I'll try the new cron you've provided as well. What I ended up doing was setting the cron to a short period of time, say 5 minutes, and matched the caching duration in SimplePie. I then watched the cache folder by refreshing my ftp (after deleting all previously stored cache files) to see if new cache files would appear at the 5 minute mark. I was happy to see that the cron did access the page and triggered the caching, however, I noticed that it would take at least a couple minutes for all cache files to be stored. I think that processing delay is what was not allowing the cron and cache to sync up.

Yes, I did post on the SimplePie forums, and many other places, since this seem to be a common problem that really hasn't been documented with good step-by-step instructions yet. Most people simply say "set up a cron and you're fine" but obviously it's more involved than that.

Your explanation makes perfect sense (thank you!) however, I'm not sure I follow your last paragraph only because I'm very new to crons or back-end programming. Please let me know if I'm understanding this correctly. So the main page would be my index.php that viewers see and the caching duration for this page would be very long so that it never refreshes. Another page, which I'm assuming would be an exact copy of the index so that it has the same feeds, which is not viewable to the public, would have it's content refreshed via the cron and the resulting cache files would be stored in the same cache folder that the main index.php would use? I guess I'm not sure what the difference would be between this and having the cron access the main index.php.

All this talk about caching duration also has me wondering, what is that? It's not like the cache files are automatically deleted after a set amount of time so what does the duration mean?

Again, thank you very much for the response.

JeevesBond's picture

He has: 3,956 posts

Joined: Jun 2002

eightgames wrote: Another page, which I'm assuming would be an exact copy of the index so that it has the same feeds, which is not viewable to the public, would have it's content refreshed via the cron and the resulting cache files would be stored in the same cache folder that the main index.php would use?

Yes, the difference between this and the main--public viewable--file is that the file cron loads sets the cache duration to something very short. Hopefully this will cause the cache to be refreshed, unfortunately I don't know whether this would work since I've never used that software. Smiling

eightgames wrote: I guess I'm not sure what the difference would be between this and having the cron access the main index.php.

It will cut out the race condition. If you've got the cache set to expire every 15 minutes and the cron job set to run every 15 minutes, how can you be certain the cron job is going to get open the file before a user does? This seems to be what's causing your problem. You could set the cron job to run at particular times, but unless you can also set the cache to expire at particular times you'll have the same problem. For example: if the cron is set to run at 15 minutes past every hour and the cache is set to expire every 15 minutes, how do you know the cache isn't going to expire at one minute to the hour, then at 14 minutes past the hour? That's a whole minute where if a user visits that page they'll have to wait for the cache to be refreshed.

Although this probably wont be much of an issue unless you get lots of visitors, it's still rather sloppy and likely to cause problems. Smiling

eightgames wrote: All this talk about caching duration also has me wondering, what is that? It's not like the cache files are automatically deleted after a set amount of time so what does the duration mean?

Well, you get what a cache is right? In the case of this software it's a copy of the RSS feeds, so it doesn't have to go and get the feeds every time the page is loaded. Every time the software loads it looks at how long ago the cache was created, if that is longer than the 'cache duration' then the software retrieves all the feeds again, overwriting the cache. So you're correct, it's not deleted as soon as it goes out of date, but overwritten when the script is next run. Smiling

a Padded Cell our articles site!

They have: 3 posts

Joined: Oct 2007

Awesome, thanks for all the help. I've played with it some more and it was the timing of the cache duration and the cron that weren't syncing up. Right now I have it set so the cache expires just before (30 seconds) the cronjob runs, since it seems it was the cron that was the slow one. I'm wondering if shortening the time span even more would be better, say a second or two before the cron runs have the cache expire. Either way, I think this will work for now so thanks again for your time.

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.