June 28, 2018
Scaling a monitoring platform to over 9,600 bank integrations
Joy Zheng & Jeeyoung Kim
Updated on October 12, 2018
At Plaid, one of the things we do is we maintain integrations with more than 9,600 banks. Our core promise to customers is providing reliable and homogenous data from all of these financial institutions. From an operational perspective, that requires us to monitor 9,600 external dependencies on top of our own infrastructure. This is an inside look at monitoring at Plaid.
As the number of integrations grew from a handful to nearly ten thousand over the last few years, we increasingly found that we had outgrown our legacy, logs-based monitoring system. But given the diversity of institutions, we struggled to determine what the right replacement would be and how to get it built. This post dives into how we answered these questions and what we had learned once we were done.
Step 1: Understanding the Problem
Our primary monitoring challenge lies in the combination of integration count with the heterogeneity of the integrations monitored. The financial institutions with which we integrate range from the largest banks in the United States to small, regional credit unions. In practice, this means that similar metrics have different meanings across integrations. While any monitoring system should be able to handle differential configuration, doing so across a large number of metrics becomes more complicated.
In Hour 1, both integrations look fine; there are some failed linkages, which is expected since users often misremember their password on the first try. In Hour 2, however, we see that all the linkages for Tattersall Federal Credit Union have failed, but the low count makes it hard to tell whether this is a problem with the integration or just a string of incorrect passwords. More generally, due to varying traffic levels, it makes more sense to monitor success/failure counts for some integrations and percentages for others. We also look beyond success/failure metrics for failure modes such as latency spikes and data quality degradation.
Our ultimate monitoring solution needed to be able to support cases like this across all 9,600 of Plaid’s supported institutions.
Step 2: Do We Need a Rewrite?
The first step in building out a new monitoring pipeline was convincing ourselves that we needed a new system—that our legacy system could not scale for the next year of growth.
This legacy monitoring system, which we’d implemented more than two years ago when we had only a handful of in-house integrations, consisted of Nagios sending queries to our ElasticSearch log cluster and forwarding alerts to PagerDuty.
But this system lacked the customizability to extend cleanly as our system grew, and we found ourselves extending it with duct tape in the form of an ever-growing configuration file. Here are some of the problems we faced:
As our traffic grew, we were forced to accept dwindling log retention periods in the ElasticSearch cluster
Log queries could measure the success rate during this time interval, but there was nowhere to track metrics over time past the retention period in order to alert against gradual changes
The configuration of alerting thresholds was manual on a per-institution basis
Otherwise innocuous logging changes risked breaking monitoring queries
The same logs were used for business analytics and quality of service, despite very different latency and granularity requirements between the two
Step 3: Determining Metrics
Having convinced ourselves that our existing system would not allow us to adequately monitor all of our bank integrations, we moved to the second stage of the process: identifying a full list of areas we would ideally want to monitor. As discussed above, an especially relevant challenge was varying degrees of traffic per bank—thousands of requests per second for some but only a couple of dozen per day for others. Monitoring a low-throughput system is a very different problem: The orders-of-magnitude difference between institutions challenges traditional assumptions about success and failure rates.
Creating our metrics wishlist made the complexity of our domain evident. Working over different metric types (such as absolute vs. percentage failures), error types, and services where errors could be thrown, we ended with over 20 categories of metrics. We bucketed these metrics into high/medium/low priority based on customer impact and cost of instrumentation, and focused our implementation on only the high-priority areas.
Step 4: Technical Requirements
Finally, given the product metrics to monitor, we determined the system’s technical requirements:
Scalability: We needed to handle a large metric count due to the multiplicative factor of [9.6k banks * N metrics each], so we wanted a system that could handle a high cardinality of metrics
Latency: We set ourselves goals of at most 30s lag from event to metric creation, and 1s query time
Usability: Engineers should be able to easily update or create metrics and alerts with a minimum of monitoring-specific domain knowledge
Then we researched options for each stage of the system:
Event Transport: We chose Kinesis (over Logstash and SQS) due to its combination of buffering in case of consumer downtime, allowance for multiple readers for event-stream reuse, and existing client libraries
Time Series Database: We chose Prometheus because it met our scaling needs without requiring an external dependency like HBase or an enterprise edition; other options included Graphite, OpenTSDB, and InfluxDB
Alerting: We chose Alertmanager over Nagios and Grafana due to the amount of flexibility it allowed for notification and alerting rules—for example, grouping and suppressing alerts based on tags—even though it required configuration via config files instead of a UI
Visualization: We chose Grafana because it could integrate not only with Prometheus, but also with other sources of data at Plaid (namely, CloudWatch metrics and ElasticSearch logs)
Now we could wire up all of those stages into a full pipeline. Working through system requirements taught us that we needed custom components for the sake of metrics generation, but not for scalability.
In particular, Prometheus operates on a pull-based model, where it queries individual servers for metrics on a regular basis; this allows for redundancy by running multiple independent Prometheus servers. For example, individual API servers might export an api_call_count metric. Then, the Prometheus query sum(rate(api_call_count[5m])) adds up the rate of API calls over all servers.
The pull-based model works well for metrics which can be easily aggregated across multiple servers. But it doesn’t always work well for low throughput events where each server sees only partial data. For example, each task runner may only see a handful of requests for Tartan Federal Credit Union over the course of a day, making it impossible for that task runner to individually generate an accurate latency histogram.
As a result, we added custom components to enable metrics generation from a full event stream. For improved usability, we built a monitoring pipeline which could be used either with or without those custom components.
On the one hand, services that can export straightforward metrics by themselves make use of a standard metrics pipeline:
This pipeline leverages Prometheus to poll services for metrics and forward alerts to AlertManager, which sends notifications to PagerDuty and Slack. Engineers using the standard pipeline do not need to interact with or understand custom components, and often do not need to even add instrumentation. Instead, we automatically export metrics from many of our shared libraries, such as server middleware or database transaction wrappers.
This pipeline could handle all of our initial scaling requirements—such as hundreds of thousands of metrics—out of the box while being compatible with custom metric generation.
Meanwhile, other services use the custom metrics pipeline, which has the following additions:
Services send events to an Amazon Kinesis stream instead of generating their own metrics. The event types are specified by protobufs, which has the extra benefit of providing a well-defined event stream which other services can read
An event consumer with its own database reads the Kinesis stream, generates metrics, and exposes those metrics to Prometheus. This lets us add custom metric transformations as well. This ability to easily add or plug in custom, advanced metrics is critical to the system’s adaptability to domain-specific situations
After this point, Prometheus pulls the metrics and the rest of the pipeline is unchanged.
Outcomes and Ongoing Work
Nine months later, the monitoring system has reached:
More than 700 events per second processed
190k+ metrics exported
17 services monitored
31 engineers who have contributed monitoring configuration changes (on a team of 45)
< 5 second average delay from events to metric generation
Continuing work on the monitoring system has happened in multiple areas:
Scaling: as the number of metrics grew, we added federation and pre-aggregation of metrics to Prometheus
Tooling: while Prometheus makes it easy to set alerts across entire categories of metrics, a follow-up project built more tooling to templatize tuning alerts on a per-integration basis without having to copy-paste alert configuration
Developer education: we underinvested initially given the number of monitoring components, making it hard for other developers to understand how they should interact with the monitoring system. Improved education has taken the form of internal tech talks and documentation on how to use the system
Over time, the monitoring pipeline has been used by an increasing number of services and teams; this means we can instrument services more thoroughly than before and catch regressions faster. Though a few services take advantage of the custom metrics pipeline, most instead export server metrics for the standard pipeline. We are also integrating the monitoring pipeline more directly into other parts of our workflow; for example, filing and autotriaging JIRA issues for regression and automatically gating deployments on monitoring.
What We Learned
We ended up with these key takeaways from building the system and watching its initial usage:
Build the end-to-end pipeline first to start learning from real usage and avoid building unnecessary complexity. When narrowing down the metrics during the speccing phase, we prioritized some metrics that required a stateful system view, and consequently added an extra feature of backing SQL databases to track state to our metrics generator. But it turned out that in many cases we were able to either deprioritize these metrics or fetch them with higher latency from a separate analytics system instead. If we had built the end-to-end pipeline before tackling the stateful metrics above, we likely would have realized that there were alternative ways to get these metrics without writing complex custom database logic.
Make monitoring components independently usable. Some services obtain significant value from pieces of the monitoring pipeline rather than the full solution, a testament to our success in building the system to be modular.
Our fraud detection service also reads from the Kinesis event stream to obtain a full view of all API calls but skips Prometheus by generating its own metrics and sending notifications straight to AlertManager.
A dynamic load-balancing service queries Prometheus for system health metrics and routes tasks away from failing servers in response.
We didn’t anticipate this level of reuse, but we’ve taken this as a learning for future designs.
Tailor custom aggregation to the narrowest areas which require it. Our event consumer performs roll-ups that would be hard to do at a per-server level, but does not otherwise prevent servers that can perform their own calculations from exporting metrics directly.
Use standard components where possible. Custom components have a higher implementation cost and create increased need for developer education to teach other teams after the system. After we shipped v1 of the new monitoring system, we realized that our teammates had a much harder time working with the custom pipeline than the standard pipeline. Keeping the two separate allowed us to make the most common tasks as simple as possible without sacrificing flexibility where needed.
Developer education is just as important as implementation. We significantly under-invested in documentation about how to interact with the monitoring system, such as wiring it to new services or adding alerts. This became one of the biggest barriers to adoption.
In conclusion, we successfully built a monitoring pipeline for a thousands of heterogeneous dependencies using mostly open-source components. While system requirements necessitated a modicum of customization, they suggested that most technical requirements could be met using open-source Prometheus and AlertManager. By minimizing the number of custom services in the pipeline, we increased usability and decreased development cost.
Monitoring is just one of the many challenges we face when integrating with nearly 10,000 financial institutions. If you think you’d enjoy tackling these challenges, take a look at the roles we're hiring for!