Friday, May 4, 2012

Zero-Downtime Deploys with JRuby


One of the most common questions I get from readers of my book is about zero-downtime deployment. That is, how do you deploy new versions of a JRuby web application without missing users' requests?

To answer this question, let's first look at how MRI-based deployments handle zero-downtime.  When a process running an MRI web server needs to load a new version, we shut it down, push the new code, and start it up again.


This leaves a gap where no requests can be handled.  But most MRI deployments use a pool of application processes, which provides a nice way around this problem.  While one process is reloading, we can rely on the other processes to service requests.  The result is a "rolling restart" in which the re-deployment of each process is staggered.


In practice this is a difficult dance to coordinate.  Technologies like Passenger make it a lot easier, but under the covers it's still complicated.

JRuby deployments are different, though.  Instead of having a pool of processes, we deploy our applications to a single JRuby server process, which never gets shutdown (ideally).  The result is that our deployment has just two steps: undeploy and deploy.


However, this still leaves a gap where requests can be dropped, and we don't have other server processes that can take over while we're updating.  To fix this, we simply need to reverse the order of the steps!

A zero-downtime JRuby deployment requires that we fully deploy a new version of the application before we undeploy the old version.  Thus, we will have two version of the app running at the same time, but only one will handle requests.

The good news is that Trinidad essentially does this for us.  All we have to do is redeploy our application. It works because deep within the bowels of Trinidad is a method that looks like this:



In the takeover method, Trinidad is creating a new context for the next version of the application while the old version continues to run.  Then it swaps those contexts in one step.  The result is effectively zero-downtime deployment.

Unfortunately, not all JRuby web servers do this for us, so we may have to script the process ourselves.  Let's take TorqueBox for example.  When we deploy a new version of a TorqueBox application to a running TorqueBox server, it completely undeploys the app before loading the new version.

Getting around this is pretty easy when TorqueBox is running in a cluster (i.e. multiple TorqueBox instances across mutiple physical or virtual servers).  We simply need to deploy a new version of the application to one node at a time.  When the old version is undeployed, the Apache mod_cluster proxy will stop sending it requests.

If you're really paranoid, you can manually disable a node prior to deploying the new version of your application by invoking the disable() operation on the server's jboss.as.modcluter MBean.  The screen shot below shows me doing this from the JMX console.



In my book, I show how to invoke an MBean operation programmatically from a Rake task. That way, you can easily work this step into your deployment scripts.

If you're not running TorqueBox in a cluster, the process is a little more complicated.  Rather than just dropping your Knob file into the deployment directory or relying on Capistrano to create a deployment descriptor, you'll need to create a custom deployment descriptor for each new version of your application.  An example might look like this:



When the YAML file is dropped into the $JBOSS_HOME/standalone/deployments directory, it will deploy the new version of the application under the myapp-v2/ context without undeploying the old version of the application (assuming it is not also using the myapp-v1/ context).  Then you need to configure your proxy to point to myapp-v2/ instead of myapp-v1/.  The resulting process looks like this:



In my experience, if you really care about zero-downtime deployment, then you are probably running a redundant cluster anyways.  So the need to orchestrate the context switching on a single node is unusual.

In any case, it's certainly possible to achieve zero-downtime deployment with JRuby.  And in most cases, it's a lot easier than with MRI.

4 comments:

  1. How do you handle deployments with migrations? As the DB may be in an inconsistent state (new tables/columns being added, data moved or normalized, data types on columns changing), how do you prevent issues with the DB while you are in between code versions?

    ReplyDelete
  2. Yea, migrations are another story. But any strategy that works with MRI should work with JRuby. I've found there are basically three cases:
    1) Migrations that are purely additive and can be run without affecting the old version of the app.
    2) Migrations that can be run in two phases (additive first, then deploy, then migrations that delete stuff).
    3) Migrations that make deep changes - usually have to bring everything down.

    ReplyDelete
  3. With Unicorn, it uses operating system level signals to handle the spawning and teardown of instances of your application's master deployment. I don't think the same solution could be used directly, but USR{1,2} might be usable?

    The idea is that unicorn is really easy to use this way. I'd love torquebox to be comparable, as I love jruby, the jvm and the whole torquebox team.

    ReplyDelete
  4. With the last techinique described for orchestrating the context switching on a single torquebox node, will torquebox still realize that its the same app and re-use the Infinispan cache? will it still run only one instance of each singleton jobs & services, or will there be a brief period where both the old and new versions will be running? Also, whats a good way of knowing that the new version has finished deploying and that its now safe to undeploy the old version? For that matter, even if i'm using the rolling cluster redeploy, whats a good way to know that a node has finished redeploying so I can move on to the next one?

    ReplyDelete