December 19, 2019
Benefits of writing our own application bootloader
Nathan James Tindall
There's a lot of variability in Plaid's 75+ internal services. Most of our services are written in Go, TypeScript, and Python. They can communicate with each other synchronously (through gRPC and HTTP), as well as asynchronously (with message queues and scheduled jobs), and are deployed on multiple environments (internal testing, staging, or production). In addition to Plaid-owned components, we use several third-party tools (like ELK, Prometheus, and Mongo), adding operational complexity to our infrastructure.
Despite these variations, running each service requires a common set of configuration parameters: things like environment variables, secrets, TLS certificates, and public keys. And although each service has its own unique set of dependencies, the initialization for a given dependency like Logstash or Mongo is similar across services.
Plaid's platform team provides solutions to shared infrastructural problems, like how to write common configuration and management logic for these diverse services. While there are several open source solutions for application supervision (e.g. supervisord) and configuration (e.g. Consul or Zookeeper), we decided to build our own: despite the relatively simple set of requirements that we had, there was no out-of-the-box solution that met all of our needs, and we were not thrilled about the prospect of integrating with a whole swath of new tools. In this post, we'll talk about a layer of software (the "bootloader") that we wrote to solve this problem, and why that decision has been beneficial to Plaid.
A few years ago, we embarked on an effort to migrate away from our legacy deployment tooling (running Chef with Opsworks) to a more modern container orchestration system backed with ECS and CloudFormation. We had two main goals: bringing more of our system into Infrastructure-as-Code, and improving dev-prod parity by using the same Docker images locally as we do in production.
We wanted to avoid making changes to the applications as we were migrating them to the new containerized runtime environment. Thus, we decided not to change the application-configuration relationship: our existing solutions were fine, we just needed to port them to be compatible with our new Dockerized environment. While prototyping, we developed various bespoke scripts for performing this runtime configuration for every service we migrated, for example:
As we migrated a handful of services, the above code was copy-pasted a few times, with slight modifications depending upon the requirements of each service. While acceptable for the MVP, this setup was far from ideal; making more universal changes or improvements later would have been cumbersome and fragile in the long term.
To address these problems simply and uniformly across our stack, we wrote a small "bootloader" binary in Go. The binary, which is available in our base Docker image, takes an input command to run and executes the requested service with the provided configuration. With bootloader, the above script can be replaced with the following invocation:
This command tells bootloader to initialize envvars from a given directory, to load public keys, and to create an auth'd Mongo URL prior to running the apiv2 command. As time went on, our decision to create this process wrapper had several benefits. We've extended the bootloader to handle several other use cases, which we'll now describe.
One core function of the bootloader is provisioning environment variables and secrets to applications. An application will have an env var file per environment (e.g.
as well as a single
secrets.json file specifying what should be read from Parameter Store:
In this scenario,
APIV2_INSTITUTIONS_REDIS_HOST have hardcoded values in the production environment, and
APIV2_INSTITUTIONS_REDIS_PASSWORD will be read from Parameter Store.
Additional configuration flags instruct the bootloader to provision other resources.
--mongo tells it to make a colocated mongos instance available to the application.
--public-keys indicates where the relevant TLS public keys should be placed, and so on. Many of these options reflect the specific way in which we provision our infrastructure, and we benefit from having them implemented in a way that's testable and extensible.
We make heavy use of the ELK stack for logging and dashboards, relying on long retention periods for business as well as engineering use cases. Maintaining the health of our logging system is a high priority for our team. Bootloader enabled us to dramatically improve the reliability and feature set of our logging infrastructure.
In the initial phases of our migration to a containerized infrastructure, we used the Docker syslog driver to forward logs to a cluster of Logstash instances deployed on ECS, which caused some issues. The bootloader would simply forward subprocess logs to stdout, and service containers would push the logs to the Logstash instance via the syslog driver. Docker daemon implementation details made this setup brittle: log consumption backups and Logstash redeploys would block or crash applications; additionally, failing to establish a connection between Docker and Logstash at startup would prevent deployments from completing because applications were unable to start. We were able to mitigate the first issue – by using the internal log buffer configuration in the Docker Syslog logging driver – but not the second, and we accepted this compromise while looking for a better solution.
The bootloader helped us move quickly to solve this problem. We decided to modify our logging pipeline to use Kinesis streams, which could be reliably consumed and processed out of band. Doing this across all of our services was only feasible due to the bootloader: we programmed it to write batches of logs to Kinesis instead of sending them directly to stdout for consumption by syslog. This resolved the second issue (of blocked deployments) by removing the Docker → Logstash dependency. As an added benefit, this approach also gave us visibility into when outgoing logs were being dropped (e.g. due to Kinesis throttling), whereas our previous system lacked this level of visibility.
This new architecture, with the bootloader as the universal logs processor, enabled us to implement several new logging instrumentation and sanitization features across our stack. Previously, our applications could cause logging cluster outages when the log volume increased significantly, when log entries were too big, or when they had extremely high field name cardinality. In the logs-management abstraction layer in the bootloader, it was fairly easy to implement a data transformation pipeline before flushing the logs out to Kinesis. We have implemented normalizations, like injecting common log fields, in addition to several language-agnostic safeguards. These guardrails have indeed prevented applications from adversely affecting our logging infrastructure since they've been deployed.
We were also able to reduce log volume by implementing "trace sampling" logic in the bootloader. In services with high logging volume, bootloader groups logs together with a "trace_id", and can drop entire traces unless the trace has any log message with an "error" or "warn" level. This allows us to dynamically reduce the ingress to our Elasticsearch cluster while still keeping important logs for debugging.
The bootloader executes the provided startup command as a subprocess; this positions it well to supervise the underlying application, similar to the role usually played by a sidecar process. The bootloader instruments the lifecycle of the subprocess by intercepting signals sent to the process and, before forwarding them to the application, sending out infrastructural logs and events to our Prometheus-based monitoring system.
Signal processing in the bootloader enabled us to greatly reduce deployment times by implementing custom behaviors when application code can be safely reloaded. It has also helped us understand and improve the way in which we drain traffic when signals like
SIGTERM are received. Overall it gave us better visibility into, and more control over, the lifecycle of our services.
Not only are we using the bootloader to forward all application logs to ELK; it also lets us log from Go code when something is wrong at the infrastructure level. A secret that fails to load, an encryption key that can't be provisioned, or a database connection string that can't be found will sound the alarm instead of going by undetected.
Why we wrote our own
We take pragmatism and simplicity very seriously. We usually want to write less code and avoid falling prey to "not invented here" syndrome. However, rolling out a custom solution can be worth it when the requirements are simple and specific enough to a company’s infrastructure. While there are multiple solutions in the OSS ecosystem, they are mainly geared toward individual use cases:
Configuration. Consul (a true sidecar, operating entirely out of the process tree of the application) and Zookeeper (a distributed system built on top of the Paxos algorithm) are attractive options, but are distributed systems that need to be operated and maintained in their own right.
Logging. Logstash, fluentd, and Flowgger are all popular options similar to one another in architecture, but would require authoring custom filter plugins to implement log transformation and batching pipelines similar to those we implemented in the bootloader.
Application supervision. supervisord is officially recommended by Docker as PID 1 for multi-process applications in a single container and has similar signal-notification and logging functionality, but would have required extension to integrate lifecycle events with our other systems.
There is no single system which encompasses all three feature sets into a cohesive out-of-the-box package, which strengthened the argument that this tooling was a worthwhile investment for us. Additionally, by writing the code in our Go monorepo, we were able to reuse existing code we had written for plugging into external dependencies. As a consequence, the cost to build was relatively low versus the cost of integrating with and maintaining new systems. Ultimately, the bootloader is just thin enough to give us what we need, and we are still planning to lean on open source tooling or managed solutions where it makes sense.
Simple as our bootloader may be, its importance sometimes makes it feel like a scary point of failure. That's why we test it very carefully and roll out new features in a controlled way, using our feature flagging system. With great power comes great responsibility! We're keeping its footprint small, carefully scrutinizing new changes, and upholding thorough unit and integration test suites. Having mentioned these caveats, our bootloader has served us well and we would recommend this paradigm to teams trying to standardize their application configuration across different first-party language runtime environments and third-party tools.
While initially developed to aid us in a migration, the bootloader's unique position in our stack – just between the operating system and the application – has proven a point of technical leverage, and we are excited to have it in our toolchain. It has given us the ability to change infrastructure from a unified code location, rather than through several shell scripts or libraries that would perform the same task in different languages and deployment environments. It's proven to be extensible, and we're now using it to start applications in environments other than ECS, like AWS Lambda and Jenkins. Additionally, the cost of our in-progress migration to Kubernetes is lower, as application configuration is standardized and separate from the implementation details of application deployment.
If you're interested in building impactful technical infrastructure on our platform team, we're hiring!