The robotic precision of cron ensures that each subsequent job runs, on time, every time.
But cron doesn't check that the previous execution of that same job completed first -- and that can cause big trouble.
This often happens to me when I'm traveling and my backup cronjob fires while I'm on a slow up-link. It's bad news when an hourly rsync takes longer than an hour to run, and my system heads down a nasty spiral, soon seeing 2 or 3 or 10 rsync's all running simultaneously. Dang.
For this reason, I found myself putting almost all of my cronjobs in a wrapper script, managing and respecting a pid file lock according to the typical UNIX sysvinit daemon method. Unfortunately, this led to extensively duplicated lock handling code spread across my multiple workstations and servers.
I'm proud to say, however, that I have now solved this problem on all of my servers, at least for myself, and perhaps for you too!
In Ubuntu 11.04 (Natty), you can now find a pair of utilities in the run-one package: run-one and run-this-one.
run-one
You can simply prepend the run-one utility on the beginning of any command (just like time or sudo). The tool will calculate the md5sum $HASH of the rest of $0 and $@ (the command and its arguments), and then try to obtain a lock on a file in $HOME/.cache/$HASH using flock. If it can obtain the lock, then your command is simply executed, releasing the lock when done. And if not, then another copy of your command is already running, and it quietly exits non-zero.
I can now be safely assured that there will only ever be one copy of this cronjob running on my local system as $USER at a time:
*/60 * * * * run-one rsync -azP $HOME example.com:/srv/backup
If a copy is already running, subsequent calls of the same invocation will quietly exit non-zero.
run-this-one
run-this-one is a slightly more forceful take on the same idea. Using pgrep, it finds any matching invocations owned by the user in the process table and kills those first, then continues, behaving just as run-one (establishing the lock and executing your command).
I rely on a handful of ssh tunnels and proxies, but I often suspend and resume my laptop many times a day, which can cause those ssh connections to go stale and hang around for a while before the connection times out. For these, I want to kill any old instances of the invocation, and then start a fresh one.
I now use this code snippet in a wrapper script to establish my ssh socks proxy, and a pair of local port forwarding tunnels for (squid and bip proxies):
run-this-one ssh -N -C -D 1080 -L 3128:localhost:3128 \
-L 7778:localhost:7778 example.com
Have you struggled with this before? Do you have a more elegant solution? Would you use run-one and/or run-this-one to solve a similar problem?
You can find the code in Launchpad/bzr here, and packages for Lucid, Maverick, and Natty in a PPA here.
bzr branch lp:run-one
sudo apt-add-repository ppa:run-one/ppa
sudo apt-get update
sudo apt-get install run-oneCheers,
:-Dustin
I've had some similar problems with backups and rsync. run-one, in particular, looks like it should make my backups go much more smoothly. I look forward to taking a look at the code. Thanks!
ReplyDeleteHi Dustin,
ReplyDeleteJust what I was looking for. I did notice it handles the locking on files in $HOME. Would it be difficult to make it user-independant by ie optionally moving the lock files to /tmp or /var/lock or something?
Fred
I have been saying I should write something similar myself for quite some time now :) Will give it a try!
ReplyDeleteVery cool! My (sometimes broken) offlineimap cron job thanks you.
ReplyDeleteFred,
ReplyDeleteThis is to prevent one user from DoS'ing another. Ie, one user could prevent another from running their backup job.
I am thinking about adding some special handling for the root user in the run-this-one tool, though.
I've always used "lckdo" for this kind of behaviour before. Look forward to seeing how run-once compares
ReplyDeleteThanks for the pointer, Adam, I was not aware of that tool.
ReplyDeleteFrom the manpage, I see:
"Now that util-linux contains a similar command named flock, lckdo is deprecated, and will be removed from some future version of moreutils."
Eeek, good to know it's on the way out I guess. Will have to add that to the mental "stuff that needs changing when we dist-upgrade" list!
ReplyDeleteHey Dustin,
ReplyDeleteI used to solve this with Tim Kay's solo: http://timkay.com/solo/
which locks by opening a local port instead of writing a lockfile. I'd never thought of that :)
Perhapse something line RUN_ONE_LOCK_DIR enviroment variable could be used to tell run-one where the lockfiles are kept.
ReplyDelete