June 06, 2023

ConfigDB: from chaos to confidence with our unified app config stack

Tim Ruffles

TL;DR we built a unified config system, watch the demo.

Configuring applications is a problem that sneaks up on you. For example, Plaid’s config for connecting to financial institutions started simple. We stored the config as a blob of JSON in Git and deployed it alongside our services. But over time the number of services consuming it grew to the point where deploying them all to propagate changes proved too slow and error-prone. The edit flow became unworkable as the number of people editing it increased, and the config model grew more complex.

The system we’re discussing stores the config for how we display and communicate with financial institutions.

So next we migrated it to a database-backed service. This meant updates propagated at runtime, without deployments, and we could build an edit UI. This approach grew with us from 2017 to date, but over time we observed two pain-points, neither of which ever got to the top of our priority stack. First: a trickle of misconfiguration incidents - and a scary editing experience - because we had nothing like Git's history and atomic reverts. Second: a frustrating development experience when extending config. Despite its origins as a simple CRUD service, it has become intimidatingly complex. To quote two of the engineers who worked on changing the configuration model:

Developing in the service was stressful because the slightest misconfiguration or bad migration could result in downtime, with no easy way to revert. Years ago, a teammate and I spent a few long months rearchitecting the data models. We’re quite pleased to be doing anything else now.
David Fish, Engineer

The code – weathered with years of changing invariants and business requirements – was impossible to understand.
Joanne Lee, Engineer

If this had been our only config system this situation may not have warranted investment, but we had other systems for other config datasets with similar issues. All considered, we were doing a lot of work across the company to build and maintain multiple systems, and none of them were our desired combination of highly reliable, productive, and safe to use.

So we decided to replace them with a database specifically designed for application configuration: ConfigDB. The ‘DB’ name makes it sound like a huge project; thankfully I’m not here to tell you we built a competitor to Postgres! Instead we composed technologies we already trusted at Plaid - Git, GitHub, protobuf and S3 - into a system that met our application configuration needs across languages.

Watch the demo

This demo walks you through creating a new configuration type, authoring some data, validating it, and then exposing it over a gRPC service. Below we’ll walk through the system - feel free to watch and read in whichever order you prefer.

Constraints & Desired Features

Our system had to meet these constraints to be acceptable as a platform for all our use-cases:

Availability - must guarantee availability of config in the critical path of our apps
Can handle our read load (our financial institution config has 12,000 reads per second on the client-side)
Runtime propagation of updates within ~1 minute
Handle datasets on the order of ~100MBs
Programmatic edits, for services and to enable UIs for non-engineers

Beyond that, ideally it would have features we felt we’d benefit from, but had been able to operate without:

Full history, with easy reverts
Semantic data validation (e.g. mins, maxes, uniqueness, string patterns)
High productivity - it should be possible to add new configuration types, or add fields to existing types, solely by editing the schema
Typed and structured data model: structs, maps and lists, rather than just scalars.
Review tooling to preview, discuss and approve/reject changes
Configurable rollouts - ability to slowly roll out changes to config values

Comparing the various options we had in place in the table below, you can see that none of the existing configuration stacks did a great job against the majority of our requirements. That’s why we decided to build ConfigDB. The big idea was to again make GitHub our config system of record, giving us a full history and a review workflow ‘for free’. But now we’d add systems to provide runtime update propagation, and programmatic edits.

Tool	ConfigDB	MySQL-backed services + caching	LaunchDarkly	Environment Files (baked into images)	YAML etc baked into images	Hard-coded
Availability	😀	😀	😀	😀	😀	😀
Runtime updates	✅	✅	✅	❌	❌	❌
Propagation speed	😀<= 1 minute	😀~1 minute (client-side cached to meet SLA, and reduce load)	😀~200ms	😔~30 minutes	😔~30 minutes	😔~30 minutes
Dataset size	✅	✅	❌	❌	✅	❌
Programmatic writes	✅	✅	✅	❌	❌	❌
History and reverts	✅	❌	✅	✅	✅	✅
Pre-write validation	😀	😀	😐	😔	😔	😔
Productivity	😀	😔	😀	😐	😔	😀
Data model	😀 - protobuf	😀 - full MySQL	😔	😔 - scalars, CSVs	😐	😔
Approvals	✅	❌	✅	✅	✅	✅
% rollouts	❌	❌	✅	❌	❌	❌

Developer Experience

ConfigDB data is organized into tables with schemas. We define a table’s schema by writing a .proto file:

1// config/movies.proto
2
3syntax = "proto3";
4package config;
5option go_package = "github.plaid.com/plaid/go.git/lib/proto/configpb";
6
7import "configdb/configdb.proto";
8import "google/protobuf/duration.proto";
9
10message Movie {
11 option (configdb.table) = {
12   name: "movies",
13   primary_key: "slug",
14 };
15 string slug = 1;
16 string title = 2;
17 // data is denormalized in configdb
18 repeated Character characters = 3;
19  google.protobuf.Duration runtime = 4;
20}
21
22message Character {
23 string character = 1;
24 string actor = 2;
25}

Data is authored in YAML, which we parse and map into the protobuf schema (with support for nicer syntax for well-known types like durations and wrappers):

1# by default, filename is the primary key, so: movies/big-lebowski.yml
2slug: big-lebowski
3title: The Big Lebowski
4
5# well-known types like durations and dates have syntax support
6runtime: 1h57m
7
8character:
9 - character: The Dude
10   actor: Jeff Bridges
11 - character: Maude Lebowski
12   actor: Julianne Moore

This data is stored in a Git repository hosted on Github Enterprise.

We access it via each language’s ConfigDB library. Here’s what that looks like in Go:

1package yourgrpcserver
2
3import (
4   "context"
5
6   "github.plaid.com/plaid/go.git/lib/configdb"
7   "github.plaid.com/plaid/go.git/lib/proto/configpb"
8)
9
10func (s yourserver) getMovie(ctx context.Context, id string) (*MovieStats, error) {
11   tx := s.cdb.Tx()
12
13   movie, err := configpb.GetMovie(ctx, tx, id)
14   if err != nil {
15       return nil, err
16   }
17
18   stats := MovieStats{}
19   stats.CharacterCount = len(movie.GetCharacters())
20   return &stats, nil
21}
22
23type yourserver struct {
24   cdb configdb.DB
25}

The read API is transactional. This ensures that when a runtime update to the configuration arrives we don’t read rows from different versions and end up forming responses based on an inconsistent version of the dataset. The read is from memory, so the only error possible is a missing row.

You may have noticed that we have specific typed getters for each table. This gives us type-safe and ergonomic access to datasets, supporting things like composite primary keys. This is implemented via code-generation from the protobuf schema. Other supported languages - Python and TypeScript - get away with less code-gen as they’re dynamic, or have more expressive type-systems.

Architecture

The YAML data is stored in Git repo hosted on GitHub Enterprise. As new commits pass validation and are merged to the main branch of our config repo, the configpush service pulls them down, converts them to protobuf, and pushes them into S3:

The application services only rely on S3 to get access to configuration. The current version is determined by an object in S3, and all data is read from there. Neither Github nor any Plaid service needs to be up to allow readers to pull config - S3 is the sole read dependency. This is important as operations like upgrading GitHub Enterprise can make it unavailable for 30-60 minutes.

Application nodes poll S3 to pull down new versions and load the rows into memory. Most of the logic lived in a Go binary that we ran across languages to reduce duplication. Updates do not block queries: there is no networking in the query path, it’s a simple read from memory.

Per container architecture: application reads from library which reads from memory. Library reads updates from configpull binary, which reads from S3.

It’s also worth noting here how the design is shaped by the much looser constraints our configuration datasets have when compared against our application datasets:

They’re smaller: ~100 MB at most, easily able to fit into memory and very fast to pull down within AWS
They change much less frequently - at most a few writes a minute
Eventual consistency is far less of a problem - it’s not vital nodes agree (this also enabled us to run database-backed configuration systems with heavy client-side caching in the past)

An important principle in engineering: jump at opportunities to ‘cheat’, and solve an easier problem!

To summarize the important architectural attributes:

S3 is the only dependency for services to get access to an initial version of the data
Once a service has the initial version of the dataset, it never loses access to it. S3 can go down and the services continue to operate
Data is read from memory - there is no network request to fail or to impose unexpected latency

Validation & Approvals

We programmatically validate writes to either the schema or data before merging. To support this without redeploying the Go service that performs the validation, we read the schema protobuf dynamically using the jhump/protoreflect package. This schema is transformed into JSON Schema, an implementation we chose so we can ensure validation behavior on the front and backend matches.

The syntax supports declarative validation rules, such as regular expression patterns, uniqueness, or mins and maxes. For instance, let’s say we wanted to ensure the ‘slug’ field above was a good fit for URLs using a pattern constraint:

1string slug = 1 [
2  (configdb.column) = {
3    pattern: {value: "^[a-z]+(-[a-z]+)*$"}
4  }
5];

Storing the data in a Github repo allows us to use the same Github CODEOWNERS workflow we use for any other source code to enforce blocking reviews to configuration where necessary.

Why Protobuf?

We’re big users of protobuf and gRPC at Plaid. Anywhere we expose configuration data over an API we do it over gRPC. This meant our previous config systems required a large amount of code simply to take data from the database/disk representation and map it into protobuf generated types. So instead, it made sense for us to propagate the data as protobuf in ConfigDB. In cases where the data would be passed through RPCs, no conversion would be required, and we were already happy with our protobuf tooling for providing type-safe access to data across languages.

The forwards/backwards guarantees of protobuf are also ideal for our data-transfer protocol for propagating changed configuration values to application containers. Applications that had an older schema would have no issues loading data from a newer one - unknown fields are simply ignored. As elsewhere, our buf linting rules made sure that only safe changes could be made to the schema: not making an incompatible change in the type of a field, for instance.

Finally, investing more heavily in protobuf had benefited from and bolstered network effects: we benefited from our engineers’ existing knowledge of the protobuf model, and existing code-generation and linting tools.

Roads Not Taken

An alternative design for this system could use MySQL rather than Git as the system of record. Config could be stored as protobuf blobs in the DB and propagate via S3 in the same way. This would have provided a faster write path (with a better SLA) and a more familiar programming model. On the flip side, we’d lose the clarity of having config files for authoring and referencing, forcing us to rely on the UI even for local development. We’d also have to reimplement the features we wanted from Git’s history model and GitHub’s UI and review workflow.

Where We're At

We’re using ConfigDB in production for several services at Plaid and are happy with the results. For instance, having the ability to change our API rate-limits at runtime has already proved powerful in mitigating load spikes or urgent requests for more capacity. Operating it has been a lot more straightforward than our existing MySQL-based stack. Even if we take down GitHub or the configpush service, application services continue to have access to their configuration and run unaffected.

Our next investment in ConfigDB will be a generic edit user-interface. This will use the JSON Schema generated from the schema along with ui:schema to make CRUD edit UIs come “for free”, allowing for easy customisation in cases where more control is required.

Acknowledgments

Thanks to the rest of the team: Andrew Chen, Andrew Yang, Wil Fisher, John Kim, Ioana Radu and Mike Rowland! Beyond that, thanks to the product teams who highlighted this was a pain-point - Roemer Vlasveld and Aditya Sule in particular; engineers who shared experience with similar systems; the many reviewers of the spec; and our beta-testers.

We’re also grateful to other companies that shared their experience designing Git-backed config systems, e.g. Facebook’s Configurator. While we didn’t end up with a design that resembled them particularly closely, reading about them was invaluable and gave us confidence in the approach.

Since day one, Plaid has been built with developers in mind. We’re proud to continue that mission today as we innovate and solve fintech’s most long-standing, hard-to-solve problems. If that excites you, check out our open job positions here.

engineering

For consumers