Zero-Downtime Deploys with a Single Server
PyDist uses a custom blue/green deploy mechanism to achieve zero-downtime deploys without significant resource overhead. A standard blue/green deploy setup requires two application instances—a live instance serving production traffic, and a standby instance where the new application version is deployed. These sit behind a proxy (usually nginx) which initially sends all traffic to the live instance, but can be hot-reloaded to start routing requests to the standby instance once it is ready:
In the above illustration, "blue" is the production instance and "green" is the standby instance at the start of the deploy. The Nginx proxy sends production traffic to the blue instance, but routes special staging URLs to the green instance. New versions of the application are deployed to a standby instance (green in this case). Once the deploy has finished and any verification of the green instance has passed, we can update the Nginx configuration to swap the roles of blue and green, so production traffic starts going to the green instance.
To facilitate this, I maintain two Nginx configurations, one of which routes production traffic to blue and the other to green. To avoid the two drifting out of sync, the green configuration is created from the blue configuration via a
nginx/nginx-green.conf: nginx/nginx-blue.conf cp nginx/nginx-blue.conf nginx/nginx-green.conf sed -i 's/blue/cyan/g' nginx/nginx-green.conf sed -i 's/green/blue/g' nginx/nginx-green.conf sed -i 's/cyan/green/g' nginx/nginx-green.conf
Note that "cyan" is essentially a temporary variable to allow swapping blue and green.
Deploys overwrite the current Nginx configuration with the configuration pointing to the other application. This does not take effect until I run
sudo systemctl reload nginx, after which new requests will get routed according to the new configuration. Nginx configuration reloads are atomic and do not disrupt in-flight traffic, so this deploy process results in zero downtime.
The downside of this process is that we are using three servers to do the work of one. PyDist is a self-funded service offering low-cost package hosting, so avoiding unnecessary infrastructure cost is important.
One improvement is to make the standby instance ephemeral, setting it up immediately before a deploy and then tearing the old instance down after connections have fully drained from it. This reduces the number of servers to 2 + ε, but significantly complicates the deploy process. Instead, PyDist does all of this on one server:
At the cost of higher memory usage on the application server, this architecture reduces resource overhead, reduces latency, and eliminates a point of failure. It also makes deploys simpler and less error-prone because there is only a single server to interact with. Because Nginx is so efficient and the staging routes see so little traffic, the application performance impact is minimal. The staging server can be stopped once the switchover is complete to reclaim even this small overhead.
Automating the Deploy Process
Deploys should be safe, fast, and painless, which means automating them down to as few commands as possible. Since I want to allow manual testing of the new instance before switching production traffic to it, this requires a minimum of two commands—one to deploy the new instance and another to update the Nginx config and switch traffic over. For PyDist's UI server this looks like:
python deploy.py [--dry-run] [--remote] [--init] [--install] [--migrate] [--autoswitch] ssh pydist.com "sudo /mnt/pydist/switch.sh"
The deploy script is essentially a wrapper around a few calls to
rsync and invoking small scripts on the server via
ssh, with options such as:
dry-run: print the commands that would be invoked without actually deploying
remote: deploy to
pydist.cominstead of a local test server
init: set up a server from scratch
install: install/update Python dependencies (slow, so not done by default)
migrate: run database migrations
autoswitch: automatically switch staging and production once the deploy finishes (in which case the second command is unnecessary)
Originally I used a bash script for the deploy script, but (as is usually the case) once I added non-trivial logic to it I came to regret that decision and rewrote it in Python. I'm pretty happy with the script now, although it could be further simplified to detect whether updates or database migrations are necessary.
One challenge of blue/green deployments with persistent instances is that the blue and green instances swap roles with each deployment, so subsequent deploys need to target the other instance. To handle this, the Nginx configuration includes a special
/bg route, which returns the string
green depending on which instance is serving production traffic. The deploy script queries this route and then deploys to the opposite instance.
Keeping it Simple
You may have noticed that I didn't mention building containers or running a CI/CD pipeline—standard features in every infrastructure blog post these days. These technologies have their place, but they come at a cost.
Containers create another layer of abstraction between you and your code, which can be helpful (if the container is easier to reason about than the underlying operating system) or harmful (if the abstraction leaks, or the underlying operating system is more familiar or convenient to work with). They are orders of magnitude larger than my application, slowing down deploys. And they come with performance penalties which are not always easy to reason about.
Of course, containers have their place. They are useful for:
- developing on a different operating system than your servers use,
- deploying the same image to a fleet of servers, at deploy time or through auto-scaling,
- bringing your resume up-to-date, and
- writing blog posts about how superior your infrastructure is.
In short, they are great for larger companies and I can wholeheartedly recommend them to my competitors.
CI/CD pipelines are more benign, but they're not really necessary when your build process is trivial enough to fold into a deploy scrip—in my case, it is just
make—and you don't have to worry about other developers skipping proper verification. As a solo developer, I can afford to use a more holistic deploy process—combining automated tests with manual QA for large code changes, but only cursory checks when I correct a typo or publish a new blog post.