May 11, 2021
A New Architecture for Plaid Link: Server-Driven UI with Directed Graphs
The original design of the Plaid Link SDK was straightforward: the SDK contained all the user interface (UI) and business logic for the user experience and called a handful of internal backend endpoints to fetch localized data, authenticate users, and ultimately return a public token for the customer application to use with the Plaid API. As the number of features and fixes we deployed increased, this simple design became an impediment to velocity (code had to be changed on three different platforms), quality (client-server logic complexity was hard to test) and ease-of-integration (customers had to rebuild and resubmit mobile apps for each improvement).
While there’s a number of ways¹ to mitigate these problems, we chose to rebuild the foundation behind the Plaid Link SDK by moving all logic to the backend so it could be shared across Web, iOS, and Android. More importantly, we moved all user experience logic into a no-code directed graph data model and transformed the client-side SDK into a focused rendering engine. The particular mechanism in which we married directed graphs with dynamic services is novel and unlocks myriad advantages around backwards compatibility, quality, testability, velocity, reuse, and experimentation.
Dubbed “Flexible Link”, the re-architecture set out to not only solve the existing scaling problems but to develop a product platform that would deliver years of diverse new product experiences for our customers. We’re going to dive into the problem, how we defined success and how we built Flexible Link (or “Flex Link” as we tend to call it 💪). Follow along and let us know what you think!
The overarching goal of this project is to provide a better product experience for our customers while increasing velocity for our internal teams.
Specifically, we needed to make things:
Easier for customers — requiring fewer SDK updates to get the latest features.
Better for end users — providing best-in-class native experiences per platform with fast fixes and improvements.
Faster for Plaid — providing a product platform to scale the company and its suite of products.
Safer & More Reliable — deploying changes with higher confidence, easier testing, and better support for backwards compatibility.
Oh, and we didn’t want to adjust the surface area of our SDK at all, so that we could transparently shift to the new architecture without requiring any changes from our customers.
These goals directly addressed the problems exposed by a complex codebase and the difficulties faced when deploying native experiences across many platforms. Fortunately, a singular guiding principle allowed the team to define these wide-ranging goals while sharing the same underlying solution.
When designing Flex Link, our guiding principle was to reduce complexity, in particular, by leveraging strong boundaries. In short, the idea of a strong boundary is that any piece of code — be it a single function, a mobile app view model, or an entire service interface — should be designed to never leak assumptions, expose implementation details, or require extraneous knowledge of anything else in the system. In other words, avoid coupling anything to anything at all costs.
Often discussed when designing services, modules, or component interfaces, the concept of a strong boundary more broadly blends paradigms of encapsulation and single-responsibility, borrows concepts from functional programming, often utilizes dependency injection, and leverages value types for communication. For a great deep dive into these ideas, I recommend Gary Bernhardt’s talk on boundaries.
Enforcing strong boundaries in a codebase directly combats the tendency of code to succumb to the entropy of coupling and low-cohesion. It also keeps engineering velocity high because teams can leverage technical debt when needed (because that debt is quarantined), better utilize shared modules, and generally execute independently from one another.
This principle was top of mind because the team had seen the coupling of session state between the backend and the SDKs in the original codebase. Every new feature addition had increased this coupling, leading to decreased velocity and increased difficulties trying to ensure correctness, especially when maintaining backwards compatibility of old SDKs still in the wild. Coupling is a form of complexity, and complexity is what slows teams down.
Additionally, the team broadened the idea of strong boundaries to apply to separating experiential logic from pure application or business logic.
For example, think of displaying a quick survey when the user exits a session. The experiential logic controls when to show the survey, what kind of UI is displayed, in which order the questions are asked, where else the user can go from the survey, and what to show the user after the survey is done, while the business and application logic handles fetching the survey data model, logging responses, and validating correctness.
Often, experiential logic gets spread throughout a codebase across multiple controllers, reducers, or custom application routing layers. Even if each data model is nicely bound to an isolated user interface (think React components), the code backing the overall cohesive experience is typically spread out. When living in a controller layer (in the MVC paradigm), that experiential logic is coupled to both the model and the UI layer (as controllers depend on both). This makes it especially challenging to test and scale experiential logic over time to account for different platforms and experiences. This pain is more acute with Link, because we serve different experiences based on the product configured by our customer through a single SDK.
As you’ll see below, we solved this issue with the strongest boundary possible: all experiential logic is defined with static data. No code. Doing this allows us to fundamentally separate modular business and application logic from the user experience logic and derive a myriad unique advantages we’ll cover in a moment.
Representing User Flows
To accomplish these goals, it was clear that we had to move the experiential logic to the backend so that it could be shared across all platforms. Given our principle of using strong boundaries, we knew not to couple assumptions about any particular user experience into the API itself. However, moving the logic to the backend solves one problem (duplicated logic) but simply moves the other problem to a new location: how do you represent a user flow in the backend? What happens when we want to represent multiple user flows but share common business logic?
Fundamentally, user flows are best represented by directed graphs where each node is an application “view”. A user starts on one node and travels to other nodes along edges that represent user actions.
Here’s a simple diagram representing a small portion of a generic social media app, represented as a user experience graph:
In this simple sample graph, there are both sequential experiences (like account creation) along with more flexible core experiences (social feed, post, profile). Take a moment to think how you would represent this in your codebase. What happens as the graph complexity grows? How would you ship experiments to test tweaks in the account creation flow. What happens if you need a runtime decision on whether or not to show “find friends” or some other view after account creation? Can you easily write an end-to-end test that validates the flow you want is actually happening in production across all intended variations?
In practice, mapping what is fundamentally a graph to imperative code flow is problematic. At worst, things turn into spaghetti code with assumptions scattered across various bits of the code base. At best, abstractions and compositions help but can still lead to inflexible or complex code when you want to run experiments, test different variants, or create all new experiences leveraging the same codebase.
With Flex Link, we decided to embrace the graph and place a firm boundary between the experience and the rest of the code. All user flows are represented by a static directed graph data model, using protocol buffers for a cross-platform type-safe data model. It is stored and used by the backend to generate the rendering data that SDKs use to display UI.
There are many benefits to this approach:
First-Class Semantic Versioning — Each graph has a semantic version which allows us to ship variants, experiments, and entirely new flows easily by picking a different graph to use on the backend. Our product experiences can continue to evolve with new versions while we maintain support for legacy SDK versions by pinning those to certain graph versions.
Static Validation & Testing — Unit tests can ensure the validity of any graph and enforce any needed requirements. End-to-end tests can be declared with a simple static array of actions to take, validating that each node renders as expected.
Tooling — Enables no-code changes to the user experience flow or individual UI views.
Observability — Rather than assembling user funnel data from ad-hoc client events, the backend has full visibility into the path through the graph for any user cohort. The granularity of events, latency, and analytics increases dramatically and we can juxtapose this aggregated information over the graph itself to visualize hotspots in the experience or find areas to help users convert better for our customers.
We call the directed graph data model a Workflow. It’s more complex than the illustrative graph above, and we’ll cover those details in a moment.
The Workflow Service is what handles API requests from the SDKs. Its primary responsibility is to traverse and evaluate the nodes in a workflow and send the resulting UI data back to the SDK. The SDK subsequently returns data representing user actions, the workflow service continues to traverse the graph based on the data in those user actions, and the loop continues until the graph is exited (see below for a schematic of this).
The nodes in the workflow are never actually sent down to our client SDKs, rather the Workflow Service uses the graph to generate UI data for the SDK. Thus, the SDKs only have to understand the rendering properties and have no actual concept of the workflow graph or its data model.
This architecture allows the API to be fairly straightforward, with only two required API calls: start and next. On start, the SDK sends up any client-side configuration and the service decides which graph to choose. next is the endpoint called repeatedly with the last user action until the session is done. There are, of course, performance adjustments to prevent network calls on each action, but that nuance will be saved for a later discussion.
The Workflow Service
There is one key piece of information we have left out. The workflow graph cannot be only static data (if only it could be that simple). Using a fully static graph leaves two key unanswered questions:
How do you interact with anything dynamic? To do anything interesting at all, the workflow graph would need to communicate with external dynamic services (financial institution APIs, databases, event streams, etc) and handle various other kinds of experience customization (feature flags, translations, experiments, customer customizations, etc).
Once support is added for that dynamism (i.e. data fetched from a database), how do you store and access that data within the graph?
Tackling these questions required the development of a novel hybrid approach to leverage the benefits of a statically-defined graph with the needs to interact with dynamic services and data.
First, Workflow graphs allow certain nodes to act as dynamic data transforms backed by functions in code.
Second, the Workflow Service maintains a per-session state store and provides a mechanism by which nodes can declare required input and output state. This declared input state is automatically injected into the data transform functions mentioned above, and the output of those functions is automatically stored for downstream nodes to use as input.
Let’s go through the details in depth to understand how it all fits together. Within a workflow, there are three kinds of nodes:
A pane is defined as one discrete step within a user experience (think “the institution search pane”, or “the success confirmation pane”). As such, these nodes represent a single pane within a workflow.
They may have any number of incoming edges.
They have one outgoing edge for each user action supported by the pane (e.g. “submit”, or “cancel”, or “select an item in a list”).
The system supports a pre-defined number of pane nodes that all SDKs know how to render. These form the building blocks of the user experience. New panes can be added as needed over time, but require an SDK update to render them with native platform UI.
Each pane instance in the graph supports declaring sets of required input state, each of which maps directly to a single pure function in Go, named
Render(…) which receives the requested input state. The
Render(…) method output is called a Pane Rendering, which is sent along to the client. The Pane Rendering contains all the data the SDK needs to render a pane (e.g. Bank Name, Logo, Localized Strings, Button titles, etc). Think of it as React component props.
The output of the user action is stored in the session state. The semantic session state object (e.g. an institution ID) is derived from the raw user output (e.g. a string) by an automatically-invoked pure function named
When you see a pane node in the graph, it always maps 1:1 with a pane the user will see. No other node results in UI for a user.
Although the entire workflow is represented as static data, we couldn’t actually do anything useful without interacting with external systems and code. This is precisely why the processor node exists — to allow a static graph to interact with dynamic code.
However, the minute you have code executing, it’s tempting to put experiential branching logic into that code. To combat this tendency and preserve our strong boundary between experience and code, processors can make no decisions of where to go in the graph. They are single data in/data out functions. In essence, they represent the higher-order function
They may only have one incoming edge and one outgoing edge.
Each one maps directly to a single function in Go, named
Process(…)which gets any requested input state injected automatically.
The output of the processor node is always stored in the session state.
New processors can be added at any time with no impact on the SDKs.
Because they only have one incoming edge and one outgoing edge they cannot influence the path through the workflow in any way. Instead, they act as simple “transforms” of session state. Keeping processor nodes as “pure functional transforms”:
Prevents graph-walking logic from creeping into imperative code
Allows processors to be re-used across different workflows
Allows the code to remain modular and unit testable
The final node type is called a switch. Switch nodes are the mechanism by which the workflow graph can branch based on session state. They behave like traditional programming language switch statements: test a value, and pick the outgoing edge based on the cases provided.
There is no code backing a switch node, it is statically defined within the workflow graph itself. An example would be checking session state to see if the session was in the EU and picking the appropriate consent pane to render (as they differ between the US and EU).
Workflow versions are canonically defined as experience.variant.major.minor.patch. Supporting versioning provides the ability to deploy experiences with the same level of rigor as we are used to with services: percentage rollouts, rollbacks, experiments, and versioned metrics with alerting.
experience: corresponds to the specific user journey or product experience. This changes infrequently, only with new end-to-end experiences.
variant: represents a particular experiment or variation in A/B testing. The mainline experience should have “default” as its variant value.
major: follows the semantic versioning rules and is incremented for breaking and incompatible changes which require a client SDK upgrade. e.g. Adding a new kind of pane.
minor: represents any other compatible change in the workflow, e.g. changing edges, node configuration, adding known node types, and anything else that doesn’t require a client SDK update.
patch: represents minor compatible bug fixes to a workflow, e.g. tweaking text, fixing a switch case value, etc
All data used for node input and output during a workflow session is stored in session state. The kinds of data stored fits into a few classes:
pre-populated at the start of every session
e.g. locale, device, and SDK metadata
populated at the start of every request
e.g. request id, IP, user agent, etc
populated with the output of pane and processor nodes
e.g. institution data, user selections, domain-specific state
As mentioned above, panes and processors declaratively define what session state they require as input and the workflow service automatically injects the needed session state into their
Process(…) method automatically. Session state is also what Switch nodes are allowed to check. Static validation of the workflow graph can determine if the declared state for a node is even available as input at any point in the graph. This process is pretty magical (well, actually it’s all reflection which is neither magical nor pretty).
The workflow data model ends up as a fairly straightforward typed list of nodes and edges.
ProcessorNode types contain some basic shared metadata about each kind of node, along with a
one of enumerating which specific pane or processor type is used for that node (examples of which are covered in the next section).
A Detailed Example
Let’s look at an example of how processors and panes interact with session state to help make the section above a bit more concrete. If you’ve had enough detail for one post, feel free to skip this section.
We’ll start with one of our actual processors: the
InstitutionLoadProcessor. Its responsibility is to take an institution identifier and load the associated data for it into session state. This happens after a user selects a financial institution in the UI.
This is the proto definition (annotated for clarity):
You can see how this processor simply maps input state (the institution ID) to some output state (the full institution data).
When the workflow service walks the graph and encounters this processor, it gathers the required input state and calls the following method to perform that mapping:
That’s it. A processor which maps state and is easy to unit test. Static validation ensures that (1) our graph contains the required input data for this processor somewhere upstream and (2) the code actually implements a
Process function with the correct signature.
Processors, having only one set of input and output state, are fairly straightforward. However, because Pane nodes are generic UI components, they may have multiple input or output states to declare.
One such pane is our generic “Search and Select” pane — it contains a search bar, a list of results below it, and a button. In our transactions product, we use this pane to search across financial institutions. In our deposit switch product, we use this pane to search across employers or to search and auto-complete mailing addresses.
To accomplish rendering this pane, the
Configuration proto in our graph data model contains more information than in the processor example above. Specifically, the pane needs to identify which input state it should use to render and which output state should be used as a result of user action.
The proto definition looks like this (some details omitted or annotated for clarity):
While this proto looks complex, the actual data stored in our graph is simple — it has the IDs of outbound edges, and specifies the input and output states to use.
When the service encounters this node, it looks at the configuration to understand which input state to collect, and which
Render(…) method to call. In this case:
And finally, when the user taps an item in this pane, the backend receives the raw
SubmitAction output and is able to call:
Putting it all together, you can imagine having a
SearchAndSelect pane in the graph followed by an
InstitutionLoadProcessor where the
InstitutionID output of the search pane is used by the processor to load up data for the selected institution. In fact, this is exactly what we do in our production graph today:
There are many more features and details that are well beyond the scope of this article, but hopefully this gives a small sense of how the pieces start to fit together.
Client Rendering Loop
For completeness, the full lifecycle of client-server interaction is shown below.
As hinted above, engineers aren’t manually editing the graph data model. Instead, we built a lightweight workflow editor app to allow drag & drop creation and modification of workflow graphs.
Engineers can load up a graph in the editor, work on a localized area of the graph, export their new version, and have confidence that the change is good thanks to static validation and easy-to-write end-to-end tests (a topic for another day).
While not all Link changes can be accomplished solely by changing the graph (i.e. an engineer may still be writing a new processor or Render method), the code that does get written is more modular, more functional, and decoupled from any experiential logic.
In our most recent Plaiderdays, one team added live visualizations of traffic over the graph and another team leveraged the editor to make a graph demonstrating all variations of our supported panes. As this tooling is improved, we fully expect support and product teammates to be able to make graph changes and deploy them safely.
Behind this Workflow Service design lies a philosophical change in how we view our Link SDK. It is no longer an SDK that contains our various product experiences, but rather is a vehicle through which teams ship their own unique product experiences.
This decoupling of product experiences, combined with the transformation of our SDKs into flexible renderers allows this new Link platform to support the full range of the product life cycle: some teams looking for high-velocity iteration to find product-market fit, while some teams looking to maintain core mature product experiences.
Most of the benefits of the Workflow Service are covered by the Goals section above. However, there are two areas of improvement worth explicitly mentioning: Observability and Reliability.
The graph gives new visibility into Link sessions, with a level of detail and flexibility not previously possible. The team can view per-node latency and timing with Lightstep tracing, watch key graph metrics with Prometheus, debug customer issues with detailed logging and analytics, and comparatively test the conversion funnel across different graph versions to improve the experience.
Reliability is improved via a single source of truth on the backend (as opposed to deployments having unintended consequences on older SDKs), reduced complexity where teams have less risk adding new experiences, and higher confidence in end-to-end testing because there is no more experiential logic in the SDKs themselves. When there is an issue, explicit team ownership of graphs and processors allows for faster alerting and routing to the proper team.
Flex Link is rolling out now, so we are continuing to learn and iterate. That said, there are a few lessons we wanted to share.
Choose Constraints Carefully
When designing a new system, be thoughtful and diligent about what constraints the system inherently enforces in its design and what constraints can or should be loosened over time. The right set of initial constraints can reduce the scope of a large project and reduce the risk of delays or failure.
In our system, the constraint around separating experiential logic with dynamic logic is fixed for the lifetime of the system. The benefits of such a constraint are intentional to ensure a system that can scale with our growth. However, each constraint has consequences. In this case, the consequence of such a strong boundary between the experience and dynamic code meant we had to provide enough additional tooling to make day-to-day coding easier for engineers.
Another initial constraint was that all experiences had to be built from a fixed number of Panes. This constraint was put in place to meet our goals (reducing SDK updates) but also to reduce the scope of work so that we could actually replace this critical piece of Plaid infrastructure in a reasonable amount of time. However, it is clearly a constraint that must be loosened over time. Even if our panes are generic, there will be new use cases, new UI teams want to test, and we need a path forward to support those teams.
To meet this need, we defined a set of generic panes that would cover the existing products in Link for initial rollout. In our next phase, we’ll be implementing a mechanism for teams to experiment with any UI they want and a process for graduating those new experimental Panes into the system as a whole.
The Importance of Tooling
Investment in the right tooling compounds over time to exponentially increase engineering impact. In our case, we provide three major pieces of tooling: (1) the workflow editor, mentioned above, (2) code generators to provide easy scaffolding for new processors or new panes, and (3) a CLI “SDK client” for easy testing and debugging (this CLI client is also re-used to power our declarative end-to-end testing framework).
If anything, we underinvested in the tooling and often underestimated the amount of time required to develop it properly. In particular, the workflow editor has many potential quality-of-life improvements we haven’t done yet and many features we’d like to add for ease of use.
The main takeaway here is that tooling can be equally important as the system itself and thus should be managed and planned with the same attention to detail. Tooling doesn’t just come for free.
Always Have a Customer
You can’t build a platform in a vacuum. Any other single engineer using it will provide invaluable feedback and challenge assumptions you didn’t even realize you made. In our case, the deposit switch team became our first key customer. Having them engage early and use the platform was a wonderful forcing function for robust prioritization.
Risk and Safety
Not only was Flex Link a large architectural change, it also required many teams to migrate their own specific features within Link — all while in production flows that cannot fail. Adequately covering all the work that goes into deploying such a change is a blog post of its own. However, there are a few notable takeaways when considering risk and safety during such a project:
Testing and validation scaffolding is critical to have in place before asking teams to migrate. A simple to use testing system with good examples offloads the responsibility for good coverage to the platform team rather than individual teams. In our case, the testing framework for Flink allowed declarative end-to-end tests that reduced the amount of code teams needed to write per test.
“If it’s not tested, it doesn’t work” is a good motto to live by. A migration like this can often present a huge opportunity for net reliability improvement by adding tests to long-standing features that may not have had proper coverage already. During this migration we were able to add a large suite of new automated tests for lesser-used portions of our flow, adding confidence these will not break for customers.
Panic buttons are critical in any system, especially one undergoing such a change. In our case, we had an instance where a client bug caused an infinite loop in API endpoint calls, which was starting to explode logging. Fortunately, we had a per-session shutoff flag which enabled us to automatically invalidate the session and stop the loop while we debugged the issue. We have other mechanisms to turn this off for specific customers or SDK versions to achieve the same safety.
No matter what, there is still inherent risk both in deployment and the overall investment in a large re-architecture like Flexible Link. Kudos to Plaid for being a place where engineering can take big bets like this.
Over the next few months all of our Link traffic will be running on Flex Link — thanks for reading!