February 28, 2018
Link Deployment with AWS Lambda@Edge
Updated on December 11, 2020
We’re always working to make Link the best way for developers to start building on Plaid. Link is our SDK that handles user authentication, two-factor flows, and error handling. Every release must clear a quality process that includes rigorous code review, an extensive unit test suite, and integration tests across multiple platforms and devices.
Still, because thousands of developers (and tens of millions of consumers) depend on Link, we’re always on the lookout for ways to introduce more stability and safer deploy processes. It’s common for backend services to slowly roll out new code to a subset of users before a new version starts to serve all traffic, and recently we wanted to test out this practice on a front-end application like Link.
This would be relatively straightforward if we were serving Link directly from our own infrastructure, because web servers like nginx provide features for splitting traffic between versions.
But front-end apps, because they consist mostly of static assets, are often served from a CDN, which acts as a fast, geographically distributed cache. For Link, we use Amazon CloudFront. Using a CDN limits control over individual requests and prevents easily varying versions that share a URL. Other companies have solved this problem by manipulating DNS resolution behavior with weighted rules pointing to CDNs with different asset versions, but we wanted something simpler.
We came up with a new approach to this problem with Amazon’s release of Lambda@Edge, which allows running custom code for each CloudFront request. With Lambda@Edge we’ve been able to implement a new deployment system that slowly rolls out new versions of Link to our end users.
How it works
We’d already been deploying each new release of Link to version-based URLs. There’s a single entry-point URL (called
link-initialize.js), which our customers use to load Link. This file points to further assets by their versioned URLs. With each new release, we override
link-initialize.js, which causes our users to receive the new version’s assets.
With Lambda@Edge, we’re able to capture requests to
link-initialize.js and implement custom logic for selecting which version is served. For the first three hours after a release, we send the new version to a small percentage of users, while the rest continue to receive the old, stable version. We watch our error and conversion rates in the three-hour window, so we’re prepared to roll back the release if we observe any problems. If no issues arise, after three hours we automatically switch all traffic to the new version.
A simplified version (one that selects users randomly) of our Lambda function looks like this:
We deploy the Lambda to AWS and then enable it in our CloudFront distribution’s settings. Lambda@Edge functions can run for one of four events:
viewer-* events are triggered whenever a resource is requested.
origin-* events happen on a cache miss, when CloudFront forwards the request to the underlying server (or to S3, as is the case for Link). We need to run for each
viewer-request and select the version on a per-user basis.
Unfortunately, while the simplified Lambda above does distribute traffic according to our chosen weights, it makes this decision anew each time a user loads Link. This could lead to an individual user receiving a different version than the one they saw the last time on each load for the duration of the rollout. We need users to be assigned consistently to either the new version being tested or the old version.
One way we could achieve consistency is by setting a cookie when we select a version and then reading it when
link-initialize.js is requested. But setting a cookie would also make this process visible to users and would require a separate
viewer-response Lambda function.
As an alternative to cookies, then, we opted to hash the user’s IP address and their User-Agent string. We then convert the hash value into the [0, 1] range and use that instead of
Math.random() to decide on the version served. The hash value is consistent when the user reloads the page. Even though multiple users can share the same IP address and User-Agent string, this method is good enough to partition traffic for rollout purposes.
While evaluating Lambda@Edge, we noticed that a
viewer-request Lambda@Edge function adds some latency, roughly on the order of 50ms, which in the worst case would double the response time of a request. Luckily, AWS CloudFront makes it possible to selectively enable Lambda@Edge functions based on URL patterns. We do this by adding a Cache Behavior for
/link/v2/stable/* URLs and only run the Lambda function there. This means the small overhead applies only to the
link-initialize.js request, and further assets are served without running the Lambda.
Because our versioned assets never change, we can use the browser’s caching and set long-lived Cache-Control headers. We don’t want these headers to be served for
link-initialize.js. If we have to do a rollback, it’s important that users don’t continue using the version we’re rolling back, which could happen with browser caching.
Previously, we’d just set proper Cache-Control headers on S3 and these were all forwarded by CloudFront. This is more complex in our new setup. Because our Lambda redirects
link-initialize.js requests to versioned URLs, the long-lived Cache-Control headers get forwarded too. We’ve solved this by adding a separate
viewer-response Lambda, which overrides the Cache-Control response header.
We’ve been using our new deployment process for a few months and released more than 30 versions of Link this way.
In the process, we’ve learned that the deploy system only solves half the problem: The key is in setting up proper metrics to monitor during the rollout period. For example, we once released a version that broke Link for an older version of Internet Explorer with very few users. Because our metrics were not granular enough to capture the issue, we ended up finding out from a customer bug report. In response, we revamped our error monitoring and built new monitoring dashboards.
The increased confidence allows us not only to sleep better—knowing that in the worst case, only a small subset of users would receive a broken release—but also to deploy more often and with less friction.
P.S. We’re hiring! Build tools for developers and impact how tens of millions of people manage their money.