December 19, 2024

Security at scale: Plaid’s journey to creating a key management system

Shuaiwei Cui is a software engineer with a deep passion for data and security.
Anirudh (Ani) Veeraragavan is a security software engineer who is experienced in identifying and mitigating security risks by building scalable systems and processes.

Staying secure means staying ahead of what’s coming. At Plaid, we run hundreds of services across tens of thousands of pods on dozens of Kubernetes clusters hosted in AWS. Our teams ship fast and frequently, so our infrastructure landscape will soon look different from how it does today. In the face of all this change, Plaid’s Security team is tasked with enabling the business by managing risk in an efficient and proactive manner. Building an internal Key Management System (Plaid KMS) was a fundamental piece of our security strategy for managing sensitive data at scale, and now that we’ve operated it for 3+ years, we’re confident that we have not only secured the present but also prepared to secure the future.

In this blog, we will describe our journey of creating and leveraging a secure Key Management System to protect sensitive data at Plaid.

Design

Cryptography is the foundation of data security. Before Plaid KMS, engineers found it hard to leverage cryptographic operations and approaches were bespoke across 22+ services maintained by 13+ different teams. The Security team was struggling to manage existing use cases, let alone future use cases as Plaid was beginning to undergo a period of massive growth. The urgency of building a Paved Road for cryptography was apparent.

While there were potential vendor solutions, such as AWS KMS, we chose to build our own KMS to address challenges unique to Plaid.

Scalability: We needed a solution that could seamlessly scale with our rapidly growing data volumes without the constraints of vendor-imposed limits. For example, AWS KMS had and still has account-level limits on API calls and number of keys, which were inadequate for our needs.
Cost efficiency: Using a vendor KMS at our scale would have incurred substantial recurring expenses. Cost efficiency is a business priority, and ideally we would not just maintain costs but instead reduce them.
Self-serve: We want to empower our engineers to use secure solutions in an independent manner. By building an internal KMS, we would be able to deeply integrate with existing tools at Plaid. For example, access control of Plaid KMS-managed cryptographic keys is provided by our internal authentication and authorization platform, which our engineers are already familiar with.

While we want to avoid “not invented here syndrome,” we also acknowledge that deep ownership of critical services can accelerate business outcomes. The experience of building and operating Plaid KMS has empowered the Security Team to accelerate investment in security solutions running in the production critical path. These strategic benefits, alongside the technical benefits listed above, have made the decision well worth it.

Architecture

In the above architecture diagram, the Plaid KMS components in blue are owned by the Security team and the client services in yellow are owned by different Engineering teams. Plaid KMS uses gRPC for inter-service communication, YAML files for access control configurations, and an SQL database for long-term cryptographic keys.

One of the key early decisions we made is that AWS KMS provides our root of trust. Rather than operating our own hardware security modules (HSM), we rely on AWS KMS to operate them. HSMs are hardened, tamper-resistant hardware devices that enable cryptographic operations. It’s critical that these devices are tested, validated, and certificated to the highest security standards, such as FIPS 140-3, and so by relying on AWS, we can instead focus our efforts on Plaid specific integration and user experience. Since we only use AWS KMS during the bootstrap process, we are able to alleviate concerns with scalability and cost efficiency while getting the velocity benefits that come with offloading complexity to a vendor.

Another key early decision we made was to leverage envelope encryption, which is the process of encrypting a key with another key. During an encryption request, a randomly generated data key (or data encryption key) is used to encrypt the specified payload. The data key is then encrypted by one of the long-term keys (or key encryption keys). The encrypted payload along with the encrypted data key is then returned to the user. During a decryption request, the encrypted data key is decrypted and then used to decrypt the encrypted payload. During a sign / verify request, the long-term keys are used directly.

Usage

With Plaid KMS, engineers are able to leverage cryptographic operations within their services in a self-serve manner. Without any security intervention, they can provision long-term keys for their use case, define appropriate access control configurations for their keys, integrate the Plaid KMS client within their services, and send requests using a simple encrypt / decrypt or sign / verify interface. Behind the scenes, we ensure that decisions such as algorithm choice, key length, rotation frequency, and more are all secure by default. By encapsulating complexity, we enable our engineers to focus on what they do best—delivering value for our customers and consumers in a safe and secure manner.

Challenges

Operational excellence

As a critical service for cryptographic operations, Plaid KMS must maintain high availability and efficient performance, even in the face of fluctuating workloads. Achieving this level of operational rigor required close collaboration with other engineering teams in order to implement numerous enhancements.

Dedicated, elastic resources: Plaid KMS runs on its own dedicated network and compute resources. Additionally, all resources on the request / response path autoscale in the event of increased usage. Plaid KMS is a CPU-bound service, and so we found CPU utilization to be a high signal metric to inform our autoscaling configuration.
Workload segregation: On top of having dedicated resources, we also segregated them by online and offline workloads. Online workloads, such as API requests from customers, require high availability / low latency and are prioritized. Offline workloads, such as our data analytics platform, are more fault tolerant but also involve large amounts of data and requests. This separation prevents higher volume but less frequent offline workloads from disrupting online workloads.
Optimized API usage: Finally, we reduced the amount of traffic that goes to Plaid KMS in the first place. By taking advantage of envelope encryption, we encrypt / decrypt variable-sized payloads locally on the client service and only encrypt / decrypt the fixed-sized data key using Plaid KMS. This technique not only reduces the necessary network bandwidth but also ensures performance is more predictable. Additionally, client services can also batch decrypt multiple payloads in a single request, which enables more efficient bulk processing.

Beyond the technical improvements, we also ensure operational excellence is a part of our daily practices. We regularly review service KPIs, perform pre-mortems for high-risk changes, and invest heavily in observability. Doing the hard work upfront ensures quiet oncall rotations for our team down the line.

Legacy migrations

The most time consuming part of the project wasn’t designing nor building Plaid KMS—it was the migration. While we anticipated that migration would be complex, the number of bespoke approaches to cryptography sprinkled among client services meant that we were continually running into unforeseen challenges.

We approached the migration using a standard playbook of Derisk, Enable, and Finish. We started off with the most challenging client services to ensure that Plaid KMS was operationally ready, we scaled migrations for 80% of client services by providing self-serve tooling and documentation, and we personally migrated the remaining 20% of longtail client services that didn’t conform to expected practices, were unstaffed, or had any other issues.

Our investments in operational excellence ensured we were well prepared for the Derisk and Enable phases, but getting through the Finish phase required creativity and grit. These longtail cases demanded close collaboration between service owners and the Security Team to ensure a smooth and successful migration. While it can be tempting to ignore these cases, the benefits of a migration are only realized if you fully finish it, and so we pushed onwards.

Today, all Plaid services have fully onboarded onto Plaid KMS. Any enhancements can now be implemented directly in Plaid KMS, simplifying future upgrades or migrations. Deleting all the legacy code was a milestone that took a long time to reach, but it was well worth the wait.

Lessons learned

Plaid KMS is now one of the highest-traffic services at Plaid, operating as a critical component with high-reliability standards. Our security policies mandate that all sensitive data must be protected using Plaid KMS. By executing this project, we learned a few key lessons.

Ownership is ongoing: Building a secure and easy-to-use key management system was just the beginning. Ensuring it operates as critical infrastructure requires significant effort, both during initial deployment and during ongoing usage to uphold reliability and performance standards.
Migration takes time, and strategy matters: Centralizing cryptographic operations from various services into a unified system required more than just technical expertise—it demanded a well-thought-out strategy. Effective collaboration between service owners and the migrating team is vital to address the unique challenges that each service may present. For particularly complex migrations, it’s essential to plan thoroughly, allocate ample time, and ensure early involvement from both service owners and the migrating team to facilitate a seamless transition.

What’s next

Plaid KMS is now the standard for cryptographic operations across our services. Our next focus is extending its capabilities to further secure our data lakehouse. This expansion comes with its own set of challenges, and we’re actively working to address them.

Acknowledgments

Thanks to the rest of the team, both present and past: Rosalyn Wong, Kaiyi Li, Jiangtao Li, Stephan Pfistner, and Haike (Hank) Yuan.

engineering