python: allow only one running instance of a script

UPDATE: Thanks so much for all the feedback! I’m going to look at using flock as well, and I’ll write that up soon.

Imagine you have a script that archives a bunch of data by copying it to another box. You use cron to schedule that script to run every hour, because normally, the script finishes in about thirty (30) minutes or so.

But every so often, maybe when your application gets really popular, the cron job takes more than an hour. Maybe it takes three hours this one time.

And during that time, cron starts up two more copies of your script. That can cause all sorts of havoc, where two or more scripts each try to modify the same file, for example.

In this scenario, you need a way to prevent those second and third (and maybe fourth and fifth, etc) scripts from starting as long as one is already going.

It would be very helpful when the script started, it first checked if another process was already running. If one is already running, then this new script should just immediately exit. But if no other script is running, then this script should get to work.

Here’s a simple method for doing that:

1. When the script starts, the first thing it does it look for a file in /tmp named something like /tmp/myscript.pid.

2. If that file exists, then the script reads that file. The file holds a process ID (pid). The script now checks if that any process with that pid is running.

3. If there is not a process running with this pid, then probably what happened was the old script crashed without cleaning up this pid file. So, this script should get to work. But if there is a process running with that pid, then there is already a running instance of this script, and so this script should just immediately exit. There’s a tiny risk with this approach that I’ll discuss at the end of this post.

4. Depending on what happened in step 3, the script should exit at this point, or it should get to work. Before the script gets to the real work though, it should write its own process ID into /tmp/myscript.pid.

That’s the pseudocode, now here’s two python functions to help make it happen:


import os

def pid_is_running(pid):
    """
    Return pid if pid is still going.

    >>> import os
    >>> mypid = os.getpid()
    >>> mypid == pid_is_running(mypid)
    True
    >>> pid_is_running(1000000) is None
    True
    """

    try:
        os.kill(pid, 0)

    except OSError:
        return

    else:
        return pid

def write_pidfile_or_die(path_to_pidfile):

    if os.path.exists(path_to_pidfile):
        pid = int(open(path_to_pidfile).read())

        if pid_is_running(pid):
            print("Sorry, found a pidfile!  Process {0} is still running.".format(pid))
            raise SystemExit

        else:
            os.remove(path_to_pidfile)

    open(path_to_pidfile, 'w').write(str(os.getpid()))
    return path_to_pidfile

And here’s a trivial script that does nothing but check for a pidfile and then sleep for a few seconds:


if __name__ == '__main__':

    write_pidfile_or_die('/tmp/pidfun.pid')
    time.sleep(5) # placeholder for the real work
    print('process {0} finished work!'.format(os.getpid()))

Try running this in two different terminals, and you’ll see that the second process immediately exits as long as the first process is still running.

In the worst case, this isn’t perfect

Imagine that the first process started up and the operating system gave it process ID 99. Then imagine that the process crashed without cleaning up its pidfile. Now imagine that some completely different process started up, and the operating system happens to recycle that process ID 99 again and give that to the new process.

Now, when our cron job comes around, and starts up a new version of our script, then our script will read the pid file and check for a running process with process ID 99. And in this scenario, the script will be misled and will shut down.

So, what to do?

Well, first of all, understand this is an extremely unlikely scenario. But if you want to prevent this from happening, I suggest you make two tweaks:

1. Do your absolute best to clean up that pidfile. For example, use python’s sys.excepthook or atexit functions to make sure that the pid file is gone.

2. Write more than just the process ID into the pid file. For example, you can use ps and then write the process name to the pid file. Then change how you check if the process exists. In addition to checking for a running process with the same pid, check for the same pid and the same data returned from ps for that process.

Check back soon and I’ll likely whip up some kind of some simple library that offers a context manager that does it to the extreme case described above.

  • xubuntix
  • http://profiles.google.com/ionel.mc Ionel Maries Cristian

    What’s wrong with flock ? I think all linuxes have it. ( http://linux.die.net/man/1/flock )

  • Walter Dörwald

    A simpler solution is to create a lock on the script file itself:

    with open(__file__, “rb”) as f: try: fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB) except IOError as exc: if exc.errno not in (errno.EACCES, errno.EAGAIN): # some other error raise # The previous invocation of the job is still running return # Return without doing anything
    try: # do stuff here finally: fcntl.flock(f, fcntl.LOCK_UN | fcntl.LOCK_NB)

    The module ll.sisyphus (which is part of XIST (http://pypi.python.org/pypi/ll-xist/4.0)) uses this approach.

  • Evgeny

    Another approach is to try to connect to a predefined /domain/ socket and do not start if connection succeeded. Otherwise start and listen to that socket. This has that advantage that socket could be used for life-cycle control RPC (shutdown, reload config, etc.)

  • http://schinckel.net Matthew Schinckel

    There's also a race condition, where you check for the pid file, and then when it isn't present, you attempt to create your own. In the meantime, it's possible another instance has created the file at that point in time. You need some atomic process that 'checks-for-pidfile-or-create-with-our-pid'.

  • http://twitter.com/glubothemad Petr Sykora

    I like to use flock based solution much more. It's in the end a little shorter to implement and it has no race conditions. Also there is a possibility to lock something on a different base than a pid (e.g. some queue in a directory accessed by different scripts).

  • Chris Arndt

    Another possibility would be that the script uses some kind of IPC (i.e a unix domain socket or similar) to contact potential already running instances. If it receives a valid response from the running instance, it shuts down again. Makes the implementation a bit more complex, though, since you'd need some kind of concurrency to handle the IPC requests.

  • gdamjan

    one sollution I'm using is a unix domain socket in the abstract namesapce, i.e. bound to a name begining with a .
    it'll be cleared automarically by the kernel on any process destruction.
    this is a linux only feature though.

  • AdamSkutt

    The check is racy, as Matt states. You need to do a low-level os.open and specify O_CREAT|O_EXCL when writing out the PID, and then handle the failure gracefully. This is safe for almost all cases, though it'd fail if the directory is NFSv2 mounted and/or pretty old. The effort to fix that case isn't really worth it. You also need to provide a mode of no more than 0775 (really 664 since it's not an executable) to prevent others from modifying the PID once it is written to the file.

    You're also vulnerable to an issue where an attacker writes out a PID file with an invalid PID, causing the script to run multiple times anyway. You can mitigate this by walking the entire path and performing a set of ownership, permission, and type checks to ensure this is impossible.

    However, you're still vulnerable to a denial of service attack for the same reason. There's no reliable & portable way to solve the DOS problem besides not putting the PID file in /tmp. Put the PID file in a directory where no one else can write to the file. These things are usually put in /var/run (or similar, though /run now) for a reason.

    Your doctest case for pid_is_running isn't reliable. 1 million is a perfectly valid PID.

  • AdamSkutt

    flock(2) has tons of portability and reliability problems, so I'm not a fan. While not perfect either, another technique is to create a Unix domain socket. These provide reliable process check semantics, and remove the whole “PID reuse” issue.

  • AdamSkutt

    Not all UNIX does though, and it frequently doesn't work right, and there's no way to tell if it doesn't work right. You just have to know, and you have no way (in code) to find out.

  • AdamSkutt

    You don't actually need to talk, merely being able to connect to a Unix domain socket proves the process on the other end is alive. You still need to add code to handle the connect/disconnect servicing, though. Writing a thread to do just that is pretty trivial though.

  • AdamSkutt

    This isn't portable, you can't assume EWOULDBLOCK == EAGAIN. You need to check for the former, as I'm not aware of any platforms where EAGAIN is correct (though I could be wrong). Likewise, EACCESS shouldn't ever come from flock.

  • AdamSkutt

    *Sigh* Stupid disqus ate my comment. Let's hope it doesn't show up again later.

    Your code does contain a race condition, as Matt states. To close it, you must do a low-level os.open with O_CREAT|O_EXCL and mode no greater than 0755 (really 0644) when writing out the new PID. Be sure to handle any failures gracefully.

    Since you use a predictable file name in /tmp, your code can also be used to overwrite any file writable by the script user. However, opening the file as I suggest above will correct that problem.

    However, the predictable name would allow an attacker to still cause two instances of the script to run by writing out a PID file with a garbage PID. You can avoid this by walking the path and doing a series of ownership, permission, and type checks on every path element.

    Finally, an attacker can launch a denial of service attack since you placed the PID in /tmp. Alas, the only way to solve this problem is not place the PID file in /tmp but in a safe place, where attacks simply cannot create files. System daemons and init scripts place PID files in /var/run (now /run) or similar for a reason.

    Also, your doctest for pid_is_running isn't reliable. 1 million is a perfectly valid PID.

    And please stop leaking file handles.

  • http://blog.tplus1.com Matt Wilson

    Thanks for the feedback. What do you mean about leaking filehandles?

  • AdamSkutt

    You must ensure that the close() method is called on the handle returned by open(), via a with block, or try/finally. It is bad to rely on the runtine to do it for you because the runtime does not guarantee when it will call close().

  • http://blog.tplus1.com Matt Wilson

    Yep, makes perfect sense. Maybe I'll use a context manager to make sure that close() works. Thanks for the feedback!

  • Mike Fletcher

    http://fussy.readthedocs.org/e… provides a flock-based cron-lock.

  • http://blog.tplus1.com Matt Wilson

    Thanks! Will have to look at that.

  • Benji York
  • http://blog.tplus1.com Matt Wilson

    Thanks for the link!

  • http://www.electrogsm.pl/ Thomas Andeas

    can it be used as a context manager to make sure that close() works.?

  • http://blog.tplus1.com Matt Wilson

    That's a great idea. At some point I might put this code on github, and then it can be polished up.