lundi 27 avril 2015

Unicorn workers timeout after "zero downtime" deploy with capistrano

I'm running a Rails 3.2.21 app and deploy to a Ubuntu 12.04.5 box using capistrano (nginx and unicorn).

I have my app set for a zero-downtime deploy (at least I thought), with my config files looking more or less like these.

Here's the problem: When the deploy is nearly done and it restarts unicorn, when I watch my unicorn.log I see it fire up the new workers, reap the old ones... but then my app just hangs for 2-3 minutes. Any request to the app at this point hits the timeout window (which I set to 40 seconds) and returns my app's 500 error page.

Here is the first part of the output from unicorn.log as unicorn is restarting (I have 5 unicorn workers):

I, [2015-04-21T23:06:57.022492 #14347]  INFO -- : master process ready
I, [2015-04-21T23:06:57.844273 #15378]  INFO -- : worker=0 ready
I, [2015-04-21T23:06:57.944080 #15381]  INFO -- : worker=1 ready
I, [2015-04-21T23:06:58.089655 #15390]  INFO -- : worker=2 ready
I, [2015-04-21T23:06:58.230554 #14541]  INFO -- : reaped #<Process::Status: pid 15551 exit 0> worker=4
I, [2015-04-21T23:06:58.231455 #14541]  INFO -- : reaped #<Process::Status: pid 3644 exit 0> worker=0
I, [2015-04-21T23:06:58.249110 #15393]  INFO -- : worker=3 ready
I, [2015-04-21T23:06:58.650007 #15396]  INFO -- : worker=4 ready
I, [2015-04-21T23:07:01.246981 #14541]  INFO -- : reaped #<Process::Status: pid 32645 exit 0> worker=1
I, [2015-04-21T23:07:01.561786 #14541]  INFO -- : reaped #<Process::Status: pid 15534 exit 0> worker=2
I, [2015-04-21T23:07:06.657913 #14541]  INFO -- : reaped #<Process::Status: pid 16821 exit 0> worker=3
I, [2015-04-21T23:07:06.658325 #14541]  INFO -- : master complete

Afterwards, as the app hangs for those 2-3 minutes, here is what's happening:

E, [2015-04-21T23:07:38.069635 #14347] ERROR -- : worker=0 PID:15378 timeout (41s > 40s), killing
E, [2015-04-21T23:07:38.243005 #14347] ERROR -- : reaped #<Process::Status: pid 15378 SIGKILL (signal 9)> worker=0
E, [2015-04-21T23:07:39.647717 #14347] ERROR -- : worker=3 PID:15393 timeout (41s > 40s), killing
E, [2015-04-21T23:07:39.890543 #14347] ERROR -- : reaped #<Process::Status: pid 15393 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:07:40.727755 #16002]  INFO -- : worker=0 ready
I, [2015-04-21T23:07:43.212395 #16022]  INFO -- : worker=3 ready
E, [2015-04-21T23:08:24.511967 #14347] ERROR -- : worker=3 PID:16022 timeout (41s > 40s), killing
E, [2015-04-21T23:08:24.718512 #14347] ERROR -- : reaped #<Process::Status: pid 16022 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:08:28.010429 #16234]  INFO -- : worker=3 ready

Eventually, after 2 or 3 minutes, the app starts being responsive again, but everything is more sluggish. You can see this very clearly in New Relic (the horizontal line marks the deploy, and the light blue area indicates Ruby):

New Relic graph during and after deploy

I have an identical staging server, and I cannot replicate the issue in staging... granted, staging is under no load (I'm the only person trying to make page requests).

Here is my config/unicorn.rb file:

root = "/home/deployer/apps/myawesomeapp/current"
working_directory root

pid "#{root}/tmp/pids/unicorn.pid"
stderr_path "#{root}/log/unicorn.log"
stdout_path "#{root}/log/unicorn.log"

shared_path = "/home/deployer/apps/myawesomeapp/shared"

listen "/tmp/unicorn.myawesomeapp.sock"
worker_processes 5
timeout 40

preload_app true

before_exec do |server|
  ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile"
end

before_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection.disconnect!
  end

  old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin"
  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH

    end
  end
end

after_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.establish_connection
  end
end

And just to paint a complete picture, in my capistrano deploy.rb, the unicorn restart task looks like this:

namespace :deploy do
  task :restart, roles: :app, except: { no_release: true } do
    run "kill -s USR2 `cat #{release_path}/tmp/pids/unicorn.pid`"
  end
end

Any ideas why the unicorn workers timeout right after the deploy? I thought the point of a zero-downtime was to keep the old ones around until the new ones are spun up and ready to serve?

Thanks!

UPDATE

I did another deploy, and this time kept an eye on production.log to see what was going on there. The only suspicious thing was the following lines, which were mixed in with normal requests:

Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de

Aucun commentaire:

Enregistrer un commentaire