June 17, 2022
Migrating from ECS to Kubernetes
G Gordon Worley III
A successful startup depends on great engineers, but all the engineers in the world won't matter if the code they write sits on their laptops collecting dust. Code needs to ship, and it needs to ship fast and frequently so customers have the latest and greatest version of your product.
At Plaid, we ship fast and frequently by deploying all production services on Kubernetes clusters hosted in AWS. We run hundreds of services across tens of thousands of pods on dozens of clusters. With Kubernetes our engineers are able to automatically deploy their services when they merge code to master with features like load balancers, autoscaling, canary deploys, and more without much effort on their part.
But getting there was a long journey.
The earliest production deploy system at Plaid was pretty simple: build an Amazon Machine Image (AMI), shove the AMI into an autoscaling group (ASG), put the ASG behind a load balancer, and call it a day. Got a new version of the product to ship? Build a new AMI, update the ASG, and you're good to go.
This was fine in the early days when we didn't have a lot of services, but as the complexity of our deployments grew we needed to find better solutions. In 2017, Plaid had just grown large enough to have a dedicated infrastructure team (2 people!) and containerized deploys were still somewhat new. Kubernetes, Mesos, and Swarm were in their early days, but it was clear that containerization was the right move, so we went with what seemed like the best option at the time given that we were deployed on AWS: Elastic Container Service.
ECS worked well for what it was. We could spin up clusters of machines, build our services using Docker containers, and deploy them onto our clusters using CloudFormation. This worked relatively well, but we soon discovered that the operational model of ECS was not a great fit for us.
It was somewhat complicated to create new services. Each time we had to stand up new CloudFormation stacks and Jenkins pipelines. These, in turn, meant that most engineers didn't have direct control of their services (at the time the infrastructure team got paged for everything). ECS also didn't give us a lot of visibility into what was happening with services, which led to some frustrating periods of sitting around and hoping issues would resolve. In short, as we grew, we needed a deployment system that would scale to meet our growing need to support more teams of more engineers delivering more products to our customers.
So in December of 2018, during a hackathon, we tried out Kubernetes. It seemed promising; maybe we would migrate to it in a quarter or two (Plaid had only 15 services at the time). Then in January of 2019 Plaid acquired Quovo. They brought additional Kubernetes experience to the team, and while much of 2019 was spent integrating systems, in the summer we signed off on a plan to move all of Plaid to Kubernetes and officially kicked off the migration in January 2020.
We didn't just want to deploy on Kubernetes, since that would have expanded many of the operational issues we had with ECS. Instead we wanted to use Kubernetes to automate much of the deployment process and hand the keys over to the teams who developed their services. That way we'd remove as many roadblocks as possible between engineers writing code and customers getting a better product. This would be a move to DevOps in the sense of developers operating their own services.
We created the first version of what we internally call variously Self-Serve or Scaffold (named for the file the configuration lives in—scaffold.yml). The goal was to use this file to take care of all service configuration, everything from how many pods to run to specifying the databases to use to managing secrets. Since most Plaid services are written in Go, we built the tooling in Go also to make it easy to run with the rest of our development toolchain.
With Self-Serve as the basis of our deployment model, we started migrating services to Kubernetes.
Those old migration blues
Our plan to migrate everything to Kubernetes eventually hit some roadblocks.
We'd initially hoped to have a fairly simple setup. We'd run a single Kubernetes cluster per environment, spin up namespaces for teams, and create deployments of teams services in their namespaces. However, it took very little time before it turned out this wasn't going to be enough.
Many services were easy to migrate. We had them move over in just a few months. But we started hitting scaling snags.
It turned out that one cluster per environment wasn't going to work. We had services that needed to run more pods than could comfortably fit in a single cluster. So we had to start splitting the clusters up.
It also turned out that, during the ECS days, some services had built custom features to take advantage of how ECS works. These did not translate smoothly to Kubernetes. For example, our service that fetches data from banks had a custom "fast deploy" feature that let them quickly change out code for broken bank integrations without needing to do a full deploy.
Prometheus also became a constant source of scaling issues. As we moved more services onto Kubernetes we kept putting more and more load on our Prometheus clusters. We had originally been able to scrape the Prometheus install running in each cluster into a single mega Prometheus, but we eventually passed the point where that was feasible, forcing us to switch to Thanos for Prometheus federation. That came with its own scaling challenges that we're still dealing with today as Plaid continues to deploy more services to serve more customer use cases.
And not all of our woes were technical. It was also logistically complicated to get busy teams to migrate to Kubernetes. All new services were provisioned on Kubernetes, but existing services on ECS had to schedule their migrations when both the infrastructure team had time to support them and when it would fit into the product roadmap. Heroic levels of multi-quarter OKR sequencing were required to get the migration across the line.
In the end, it took two years and 1 month to complete the migration to Kubernetes. What did we learn along the way about doing large infrastructure migrations?
There's hidden complexity everywhere. Any system that has existed for long enough will be full of hidden "hacks" that have become well-loved "features" that will need to be supported in the new system.
Whatever scaling problems you expect to run into, there will be more scaling problems you didn't anticipate. Leave room for scaling surprises and do what you can to test scale in advance of needing it.
Work with senior leadership early to get commitments from other teams. Make sure there is broad alignment on the goals of the project and that migration work will be appropriately prioritized
The Kubernetes future
Now that we've shut down the last service running in ECS (an internal monitoring service that required everything else to be migrated first), what's next for us?
Plaid has only continued to grow, so our Kubernetes deployment needs to grow with it. One of our current big initiatives is multicluster deployments. We're redesigning our Kubernetes clusters to be largely interchangeable so that services can be deployed across multiple clusters at once. That way if one cluster fails or just needs to perform maintenance operations that could impact performance, traffic can shift to pods running a service in another cluster. This will allow us to achieve yet higher reliability targets for our customers.
If you're interested in working on the challenges that come from supporting a large Kubernetes install and making it easy and simple for our internal engineering teams to build on top of, we're hiring!