Rostamizadeh.Blog

A place for me to write about interesting technology topics.

Wrangling Unicorn USR2 Signals and Capistrano Deployments

I recently had an issue with my Capistrano deployments on a Rackspace server (their smallest offering at 256MB RAM) where Unicorn would receive an upgrade signal (USR2), but a new master process wasn’t started. This ended up being a problem between the example Unicorn init script and the lack of horsepower on my VPS. Hopefully the information below will aid other people in their quests to get zero downtime (or close to it) deployments with Capistrano and Unicorn.

Technology

  • Unicorn 4.3.1
  • Rackspace Server (256MB RAM), Ubuntu 12.04
  • Capistrano 2.12.0

What It Should Be Doing…

First off…Unicorn is pretty cool! Combined with solid Capistrano deployment recipes, you can achieve fast deployments with minimal, if any, impact to users. Unicorn responds to many signals (the full list can be found here: Unicorn Signals Page), but the one I’m most interested in for incremental deployments is USR2. Per their signal page:

USR2 - reexecute the running binary. A separate QUIT should be sent to the original process once the child is verified to be up and running.

Here is my Unicorn upgrade process from a high level:

  1. Send USR2 to the master process
    • I do this by calling my Unicorn service with the upgrade argument. USR2 causes the pid file to be appended with .oldbin, and for a new master to be loaded up which includes the latest deployed changes. As soon as the new master is started, it begins spawning workers.
  2. Use before_fork in Unicorn config to kill off old worker and master processes
    • As every worker is spawned, it checks for a .oldbin pid file. If the file exists, and it has workers, a TTOU signal is sent to the old master to decrement the number of workers; alternatively, if it has no workers left, a QUIT signal is sent to shutdown the old master. I am using the before_fork code supplied from the example Unicorn config, as so:
Exerpt from Example Unicorn.conf.rb Example Unicorn.conf.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
before_fork do |server, worker|

  # The following is only recommended for memory/DB-constrained
  # installations.  It is not needed if your system can house
  # twice as many worker_processes as you have configured.
  #
  # # This allows a new master process to incrementally
  # # phase out the old master process with SIGTTOU to avoid a
  # # thundering herd (especially in the "preload_app false" case)
  # # when doing a transparent upgrade.  The last worker spawned
  # # will then kill off the old master process with a SIGQUIT.
  old_pid = "#{server.config[:pid]}.oldbin"
    if old_pid != server.pid
      begin
        sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
        Process.kill(sig, File.read(old_pid).to_i)
      rescue Errno::ENOENT, Errno::ESRCH
      end
    end
end

My Solution

I said above that my VPS lacks horsepower, but to be fair, my server isn’t slow. In fact, my Blitz.io results showed my server performs significantly better than the previous host, Heroku…however, in terms of tasks like precompiling assets (something I’ll blog about later) or responding to a USR2 signal, the server couldn’t perform as quickly as I’d like.

The template/example init.sh listed below sends a USR2 signal to the current Unicorn master process, sleeps for two seconds, pings the new master process to verify it exists, and finally sends a QUIT to the old master.

Unicorn Example init.sh from the Unofficial Unicorn Mirror on GitHub GitHub Source
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
upgrade)
  if sig USR2 && sleep 2 && sig 0 && oldsig QUIT
  then
    n=$TIMEOUT
    while test -s $old_pid && test $n -ge 0
    do
      printf '.' && sleep 1 && n=$(( $n - 1 ))
    done
    echo

    if test $n -lt 0 && test -s $old_pid
    then
      echo >&2 "$old_pid still exists after $TIMEOUT seconds"
      exit 1
    fi
    exit 0
  fi
  echo >&2 "Couldn't upgrade, starting '$CMD' instead"
  $CMD
  ;;

At first, my only change to this script was to remove the oldsig QUIT, since before_fork handles that for me. This actually worked for a while, and then I started noticing sporadic errors on deployment like:

1
2
** [out :: IP] Couldn't upgrade, starting 'cd /path/to/site/current; bundle exec unicorn -D -c /path/to/site/shared/config/unicorn.rb -E staging' instead
** [out :: IP] /etc/init.d/unicorn_staging: 94: cd: can't cd to /path/to/site/current;

and

1
2
3
4
5
6
** [out :: IP] cat: /path/to/site/shared/pids/unicorn.pid
** [out :: IP] : No such file or directory
** [out :: IP] /etc/init.d/unicorn_staging: 35: kill:
** [out :: IP] Couldn't upgrade, starting 'cd /path/to/site/current; bundle exec unicorn -D -c /path/to/site/shared/config/unicorn.rb -E staging' instead
** [out :: IP] /etc/init.d/unicorn_staging: 94: cd:
** [out :: IP] can't cd to /path/to/site/current;

My guesstimate is that 95% of the time deployments ran smoothly…but that 5% was starting to aggravate me and I wanted to figure out the issue and resolve it instead of leaving my web application in a wonky state after deployment. After thinking about it for a while, I realized that the errors most often occurred when my server was under load, and the most likely culprit was the sleep 2 being an inadequate wait period. The easiest solution would have been bumping up the sleep, but if I went down that path, what would be adequate? 4 seconds? 10 seconds? longer?

Like most people looking for a solution to a problem, I went to Google, and came up with this post: Address Already In Use. They were dealing with a different symptom but the same root cause. From that post, Aaron Suggs offers one possible rewrite for the init.sh upgrade task which watches running processes using $ ps. I didn’t want to go down that path, so I opted to write my own Unicorn upgrade task. That said, Aaron did a nice job explaining both the problem and his solution so it’s worth a read in his Gist. Although the error in the post is different from the errors I was semi-regularly seeing, I had seen it before, and for me it indirectly related to the sleep 2 problem, so I’ll mention it briefly. From my Unicorn log, here’s the error the post is talking about:

1
2
3
4
5
6
7
8
9
10
11
12
ERROR -- : adding listener failed addr=/tmp/unicorn.staging.sock (in use)
ERROR -- : retrying in 0.5 seconds (4 tries left)
ERROR -- : adding listener failed addr=/tmp/unicorn.staging.sock (in use)
ERROR -- : retrying in 0.5 seconds (3 tries left)
ERROR -- : adding listener failed addr=/tmp/unicorn.staging.sock (in use)
ERROR -- : retrying in 0.5 seconds (2 tries left)
ERROR -- : adding listener failed addr=/tmp/unicorn.staging.sock (in use)
ERROR -- : retrying in 0.5 seconds (1 tries left)
ERROR -- : adding listener failed addr=/tmp/unicorn.staging.sock (in use)
ERROR -- : retrying in 0.5 seconds (0 tries left)
ERROR -- : adding listener failed addr=/tmp/unicorn.staging.sock (in use)
/path/to/site/shared/bundle/ruby/1.9.1/gems/unicorn-4.3.1/lib/unicorn/socket_helper.rb:140:in `initialize': Address already in use - /tmp/unicorn.staging.sock (Errno::EADDRINUSE)

This is the result of Unicorn getting into an inconsistent state. In the example init.sh, when something goes wrong in the upgrade task, the normal start task is used…which under normal conditions works great. However, in the case where Unicorn is already running, the upgrade signal has failed, and the start is triggered, the start is unable to complete successfully because the Unix socket is already taken by the running Unicorn process. At least this is an easy fix…just stop the currently running Unicorn process (I like $ sudo kill -s QUIT [pid]) and fix your init script.

My upgrade task rewrite probably isn’t perfect, but I haven’t had any deployment errors with it yet, and the output gives me immediate feedback about what is happening with the old and new Unicorn master processes. In other words, it’s awesome. Here’s a typical upgrade output:

1
2
3
4
5
6
7
8
9
10
** [out :: IP] Original PID:  3934
** [out :: IP] USR2 sent; Waiting for .oldbin
** [out :: IP] .
** [out :: IP] Waiting for new pid file
** [out :: IP] .
** [out :: IP] .
** [out :: IP] .
** [out :: IP] New PID:  6729
** [out :: IP] 
** [out :: IP] Unicorn successfully upgraded

My upgrade task follows these steps:

  1. If there is a pid file, set $ORIG_PID to the file’s contents
  2. Send USR2 to current master process, and if successful, then:
    1. Wait for a .oldbin pid file to be created and populated, within defined timeout period
    2. If still within the defined timeout period, wait for a new pid file to be populated and for the .oldbin pid file to be removed
    3. If the new pid file doesn’t exist, or is empty, exit 1, with master failed to start message
    4. If the .oldbin pid value is equal to the new pid file value, exit 1, with master failed to start message
    5. If .oldbin pid still exists after timeout, exit 1, with .oldbin pid still exists message
    6. Output success message, exit 0
  3. Else, in case USR2 failed, run Unicorn start task

And here is the code:

TL;DR

  1. Use my upgrade task posted above
  2. Use the before_fork in the example Unicorn.conf
  3. Profit

Comments