July 25, 2022
AWS SSO in a DevOps first world
Ashish Kurmi
At Plaid, we believe in baking in security best practices at every step of the DevOps workflow. We have an automated CI/CD pipeline to manage AWS & Kubernetes resources and the production platform runs on it. This means in many cases, engineers at Plaid do not need to interact directly with AWS resources for daily production management and updates. However, at times, Plaid engineers use their AWS user identities for accessing AWS resources and Kubernetes clusters for development, support, and troubleshooting.
Before we enabled AWS SSO at Plaid, we used another third-party solution to federate our corporate user identity with AWS via Okta’s SAML federation. However, it did not provide good support for temporary CLI/API access as it did not provide an official CLI tool. Additionally, it was blocking us from utilizing some important protection controls for advanced MFA that were functionally incompatible with the older solutions. For these reasons, in 2021, we planned to replace our solution with AWS SSO.
Due to some constraints described below, we took an unconventional approach (which is not uncommon in the industry) compared to a standard AWS SSO deployment. In this post, we’ll talk about how we built end-to-end automated solutions for our DevOps scenarios and our key learnings so far.
An unconventional approach to AWS SSO
In the older solution, each user role was defined as an AWS IAM role. Okta allowed us to map such IAM roles to specific Okta groups. Our initial approach was to convert all such IAM roles to SSO permission sets. However, we quickly realized that this approach would not work for a couple of reasons.
At Plaid, we author the least privileged IAM policy document for a given task. As AWS- managed policies grant broad access, we have multiple reusable snippets of custom IAM policies that allow users to achieve specific goals. For example, we have a custom IAM policy for granting read-only access to our AWS billing data. A typical user role has access to several custom policies. In 2021 H2, AWS SSO did not support customer-managed policies for permission sets. Furthermore, it only allows one custom inline policy with a maximum of 10 KB of policy content. These constraints made it difficult to migrate several of our existing team IAM roles to permission sets as highly restricted policies may include numerous resource restrictions or complex conditionals to ensure the least privileges are granted. AWS recently included support for customer-managed policies in AWS SSO, which alleviates some of these pain points.
For development and troubleshooting purposes, a few special user roles allowed certain teams to select their service IAM roles at login. As SSO creates dedicated IAM roles for user access, it won’t allow these teams to log into their service roles without performing additional manual steps.
Requirements
We have a high bar for optimizing the developer experience at Plaid. Engineers work collaboratively to reduce friction to maintain high development velocity. Wherever reasonable, we create easy-to-use self-service scenarios so engineers can complete engineering tasks and operations without relying on others. We also simplify our tooling wherever possible so that engineers even without the relevant domain knowledge can accomplish their everyday tasks. Because of these reasons, certain solutions that required manual steps or specific IAM knowledge were eliminated from consideration early on.
Solution
We knew the new federation system would need to eventually assume the existing team IAM roles until the constraints mentioned above are mitigated. To enable this scenario, we took the following approach for creating SSO permission sets.
We use Terraform for managing our AWS infrastructure including our AWS SSO deployment. We authored two internal AWS SSO Terraform modules to help us manage our AWS SSO Terraform templates with ease.
For every existing team IAM role, we created a new empty SSO permission set named {Team IAM Role Name}-Proxy. These proxy SSO permission sets don’t grant any privileges themselves. We mapped these proxy SSO permission sets to the relevant Okta groups.
Once the above Terraform change is deployed, our CI/CD pipeline creates an SSO permission set. In addition, inside all AWS accounts that this permission set is assigned to (e.g., In the awsacct_west account as shown in the screenshot), AWS creates an IAM role that represents the proxy SSO permission set.
We then update the trust policy of the existing team IAM role so it could be assumed by this newly created IAM role. For development and troubleshooting, we also created these proxy roles for a few service IAM roles. Essentially, the sole purpose of these proxy SSO permission sets is to assume the correct team IAM roles. These existing team IAM roles had all the access policies defined on them, and as such, once assumed, result in zero changes to the end user’s permissions when logged in via AWS SSO.
To access AWS resources, an AWS user would log into the proxy SSO permission set first. They would then assume the correct team IAM role before performing any operations.
AWS doesn’t have the functionality to automate the last assume role step in the login workflow described above. Asking Plaid engineers to manually perform the assume role operation would have resulted in user friction and dissatisfaction. To complete the entire login workflow automatically, we employed the following strategy.
CLI
The Plaid Infra team offers an internal CLI utility named megabin that allows engineers to perform common infra tasks with ease such as bootstrapping a new backend service or accessing an RDS instance for troubleshooting. Plaid developers were already using megabin to create AWS CLI sessions using the previous solution. We extended itto allow engineers to set up their local AWS CLI environment using the proxy roles. When users set up their AWS CLI environment via megabin, the utility performs the following tasks:
Make sure that AWS Vault is installed and configured for storing and accessing CLI auth tokens in the key chain securely.
Initialize AWS credential and configuration files
Execute the AWS CLI command to walk the user through the process to set up the AWS configuration file.
Once this is done, adjust the configuration file to assume the correct team IAM role if required.
Users only need to complete this workflow once for a given role. When this is done, AWS CLI & SDKs can automatically renew expired sessions by launching a browser renewal workflow.
Web Console
To assume the correct team IAM role when using AWS’s web console, we created an internal Google Chrome extension. The extension is internally published on the Google chrome marketplace
and is installed on all Plaid-owned user machines by default. The extension gets activated for AWS web console URLs. It extracts the account ID, role name, and user name from the page using screen scraping techniques. It then checks if the user is logged in as a proxy SSO permission set. If yes, then it assumes the correct team IAM role. These steps are completed transparently without needing any input from the user.
We have published internal documentation so that users can request IAM changes by submitting PRs for user access.
After migration, we prefer defining user access policies in SSO permission sets itself for new SSO roles instead of creating IAM roles. Once all the constraints have been remediated via AWS SSO service updates, we will migrate all access policies to AWS SSO permission sets in the future.
Lessons Learned
Integrate with existing DevOps tools
Because we have integrated the AWS authentication workflow into megabin, we can deliver a rich developer experience. For example, to perform certain operations in our Kubernetes environment, the user needs to authenticate with AWS first. As the user initiates this activity via megabin, megabin can create an AWS SSO session if required as part of that activity implicitly.
Add troubleshooting and support scenarios in your automation
Our new AWS access model is substantially different from the last model. When we rolled out the new access model, initially we received many user questions that had straightforward troubleshooting and remediation steps. We later extended our AWS SSO tooling to take care of most of these scenarios. For example, we added an option in our chrome extension that allows users to define custom proxy SSO permission set to team IAM role mappings to handle corner case scenarios. We added a reset option in megabin to allow users to start from scratch. All megabin CLI scenarios include robust self-help directly while running to ensure that all requirements are met and solutions for common configuration challenges are suggested.
Have backup options
The way our chrome extension extracts user details and assumes the correct team IAM role is not officially supported by AWS. As a backup, we published detailed documentation for users so they can follow official steps manually if required. Even though our extension simplifies the way Plaid engineers access the AWS management console daily, AWS web console changes can have unintended consequences. We had to go in the firefighting mode a couple of times in the past due to AWS pushing out web console updates that changed the underlying DOM. These documents have been useful for our users when our extension was down. A feature was also added to the chrome extension to support coloring the assumed roles, to enable the user to quickly identify if they were or were not successfully escalated. It also helps them use the recently assumed role list in AWS’s console to pick the correct role quickly if they’ve done it at least once via the extension.
In our experience, building custom tools on top of the AWS CLI and management portal has been largely beneficial due to increased developer velocity and better security. You can consider this approach If you use AWS SSO in your environment and want to build custom user authentication scenarios.
We are hiring for several security roles.