May 29, 2024

Plaid's journey to a multi-cluster Elasticsearch architecture to improve reliability

Authors: Ben Masschelein-Rodgers, Santi Santichaivekin, Max Zheng, Will Yu, and Brady Wang

Plaid’s application services emit a huge quantity of logs: over 50,000 per second across thousands of Kubernetes containers. Making these available to engineers and our support team in a timely and reliable fashion is critical for everyday operations and debugging. We’ve used Elasticsearch for many years to enable this, but after years of growth, the single 120-node cluster was showing serious strain. The observability team was plagued with constant pages about delayed ingestion, unhealthy nodes, and dropped logs, and our engineers and support team were suffering from these consequences.

Our remedy was to rethink the architecture and move to a multi-cluster Elasticsearch setup with per-service log streams, allowing us to isolate our most critical services and more flexibly manage the load across the hundreds of internal services that emit logs. Using cross-cluster search with a gateway cluster, we were able to maintain a single front-end for log queries and dashboards and saw no impact on query performance. In the months since launching, log delay has reduced by over 95%, giving our customers reliable and instant access to insights. At last, our observability team can sleep quietly at night.

In this post, we’ll cover the decision-making process, details of the new architecture and migration, and lessons learned since migration.

Background

Every engineer understands the importance of good logging for system observability. At Plaid, we rely heavily on logs for both our engineering teams and our customer support operations, which uses log-based dashboards extensively to monitor system health and diagnose customer issues.

"log": {
  "message": "entry 123456789 unavailable",
  "timestamp": 1713300000000,
  "level": "warn",
  "hostname": "activity-logs-server-abcd-1234",
  "method": "foo.Bar.GetEntry",
  "service": "SERVICE_FOO_SERVER",
  "client_id": "asdfghjkqwertyui123",
  "environment": "testing"
}

A simplified example of a log. Typically every log has a textual message, level, timestamp, and various additional useful key/value metadata.

The industry standard for logging and monitoring is the ELK stack (Elasticsearch, Logstash, Kibana). This combination of tools offers powerful capabilities for collecting, parsing, and visualizing log data. Plaid initially adopted a single Elasticsearch cluster hosted on AWS to store its logs. As we experienced rapid growth, this setup began to exhibit limitations, despite performing regular upgrades and scaling the number of nodes to match the load.

The above is a visual representation of our previous architecture. Plaid services emit to one of ~10 streams, each of which is consumed by a Logstash Kubernetes pod that writes the logs to our Elastic cluster.

Challenges

One significant challenge we faced was a common one for Elastic users: the occurrence of logging spikes or mapping explosions, when the number of fields in a document index grows exponentially, leading to increased resource consumption and performance issues. A simple bug causing excessive logging in one service, out of hundreds writing to the cluster, could overwhelm the cluster in minutes. Increased node pressure where a given index was stored led to degradation on others and cascading failures, causing ingestion delays for everything writing to the cluster.

We tried to address these issues by scaling the cluster with more nodes, increasing sharding, and tuning settings, but these efforts proved insufficient in alleviating the underlying problems. We were experiencing two critical incidents per week of log ingestion delays as the system buckled under the pressure. Mitigating these incidents was a time-consuming process at best, a wild goose chase at worst. Multi-hour delays and reams of lost logs were common.

For our engineers, these delays meant lost productivity. Debugging issues and safely deploying new features became difficult without access to fresh logs. Resolving incidents took significantly longer for our support team. Due to this loss of crucial insights into our operations, the impact to customers was a growing concern. Something had to change.

Choosing a new logging platform architecture

Although the problems we faced were far from unique, the path forward was not obvious. For smaller companies, the standard ELK stack is sufficient. Larger companies typically switch to more elaborate architectures or novel in-house solutions. Some similar-sized companies seemed to have tried other logging solutions and vendors, such as ClickHouse, Splunk, or DataDog. We considered these, but they didn’t align with our needs. Wholesale migration to an entirely new platform would be much slower with no guarantee of improvement. An internal survey of our users also made it clear they were content and familiar with the experience and tools of Elasticsearch and Kibana. The real problem was instability and delays.

Solution

We decided to switch to a multi-cluster architecture using cross-cluster search. This would give us the benefits of fault isolation, minimizing the blast radius if logging spikes occurred. It would allow us to separate out critical logs from less important ones, and we’d be better equipped to fine-tune cluster settings and resourcing.

To ensure a unified front-end, the other key piece for this architecture is a ‘gateway’ cluster to route all the queries to the appropriate sub-clusters. Luckily, there is a simple solution to this as Elasticsearch supports cross-cluster search. This allows any cluster to route a query to another cluster in the same network (with the right security configuration).

During this process we learned that Elasticsearch changed its licensing terms such that it was no longer available on AWS. OpenSearch is the open source replacement and is more or less feature-identical. The rest of this article will refer instead to OpenSearch and OpenSearch Dashboards (the Kibana equivalent).

Our new architecture

lib/plog is our client logging library, implemented in Go, Python, and Node, which allows engineers to emit rich logs to the backend.

Bootloader implements common logic that is needed to start and run Plaid’s application services in Kubernetes. At a high level, it wraps any executable and runs it as a subprocess. It also implements a logging transport that reads the output of the subprocess, performs common transformations and filters, and then batch writes messages to Kafka.

Kafka is our transport layer for log messages. Each service writes to a dedicated topic.

Logstash is part of the ELK stack. We run it with some custom plugins on our Kubernetes clusters. It reads from the per-service Kafka topics and writes out to the appropriate OpenSearch cluster.

OpenSearch clusters hold the indexed log data. We currently use eight of these, with each one holding a subset of the overall logging data. Our most vital services live on independent clusters, maximizing fault isolation.

Gateway cluster is a regular OpenSearch cluster that hosts the OpenSearch Dashboard and conducts all the cross-cluster searches to the backing ‘data’ clusters. The gateway holds no data itself aside from dashboard metadata. Searches are routed to the correct cluster by way of the index patterns, and each service has a corresponding index pattern that points to the cluster holding its indexes.

Managing complexity

More clusters means more operational complexity, so managing this would be crucial. Our approach was to be as config-driven as possible and use Terraform for everything.

The data clusters are defined in a small Golang config file that determines their resourcing and that of the Logstash pods that feed them:

1type OpenSearchConfig struct {
2	InstanceTypeData   string
3	InstanceTypeMaster string
4	NodeCount          int
5}
6type GroupConfig struct {
7	OpenSearchConfig    OpenSearchConfig
8	LogstashPodCount    int
9	LogstashCpuRequests string
10}
11var Tier1Group = GroupConfig{
12	OpenSearchConfig: OpenSearchConfig{
13		InstanceTypeData:   "im4gn.2xlarge.elasticsearch",
14		InstanceTypeMaster: "r6g.large.elasticsearch",
15		NodeCount:          4,
16	},
17	LogstashPodCount:    2,
18	LogstashCpuRequests: "4000m",
19}

A simplified config file for setting up a data cluster and the Logstash instance that feeds it.

Once these are all defined, we created a config file for all Plaid services, defining which cluster they would go on, as well as some optional settings for the indexed data such as retention and custom mappings:

1type ServiceConfig struct {
// The enum value for the Service emitting logs
Service coretypespb.Service
// Determines in which cluster the logs for this service reside
Group GroupConfig
// Optional settings
IndexConfig     *IndexConfig
RetentionConfig *RetentionConfig
9}
10
11var AllServices = []ServiceConfig{
{
Service:     coretypespb.Service_SERVICE_BAR,
Group:       Tier1Group,
IndexConfig: BarCustomMappings,
},
{
Service:         coretypespb.Service_SERVICE_FOO,
Group:           Tier2Group,
RetentionConfig: NDayRetention,
},
22}

A simplified services config file showing two services mapped to different clusters

A small script converts these configs into terraform code, which is applied after code is merged. When an engineer is setting up a new service, they simply add a definition to this config, run the script, and merge their code. All the infrastructure is updated seamlessly to start ingesting logs for their new service. This includes things like a Kafka topic, a new data stream on the target cluster, and an index pattern on the gateway cluster so queries resolve to the correct data cluster.

Migrating to the new system

With the design and tooling in place, all that remained was to migrate to the new setup. With some careful planning, this ended up being a fairly straightforward process. We provisioned new infrastructure using the Terraform generated by the config files and script, dual wrote data to achieve parity, and did rigorous testing to ensure high performance and data quality.

Results and Takeaways

Since cutover day four months ago, the results have far surpassed expectations. Critically, the new setup has been much more stable, and we’ve seen a reduction in log delay of over 95%.

Fresher logs empower our support team to diagnose and resolve customer issues more efficiently. Additionally, engineers benefit from shortened development cycles and quicker incident mitigation, ultimately delivering a smoother experience for our customers. And observability engineers are freed from incident management to focus on higher-leverage work to support our engineers elsewhere. Close to zero high-urgency alerts also means much less stress!

The new system launched in December 2023. Since then, incidents have fallen dramatically.

The only two substantial incidents so far were both due to a transient node death after memory spikes, which self-resolved with minimal impact. Since tuning the node configuration this issue seems to have been resolved. Multiple smaller clusters seem to be fundamentally more stable than a monolith, despite the same traffic.

The more granular clusters allowed us to properly rightsize them for the workloads, and we reduced infrastructure costs by 40%. This granularity also meant much safer upgrades as we can target a single cluster at a time, starting with less critical ones. This ultimately allows us to iterate faster and more safely on further improvements.

Downsides

It wouldn’t be fair to gloss over the potential downsides of this new architecture, but compared to the benefits we think they are really rather minor:

Index patterns are more verbose due to the gateway cluster, which caused minor confusion with our users at first.
Manual operations on indexes are more complex as you need to first identify the owning data cluster, but simple tooling has mitigated this.
There’s a hard limit of 20 clusters for cross-cluster search on AWS. We only use eight currently, so we have plenty of headroom.

Lessons

Optimize first, replace later

When platform engineering teams face challenges scaling their existing architectures, the temptation to rip and replace them is strong. We look for the greener grass of new open source tools, get distracted by shiny SaaS offerings with fancy marketing brochures. But whole new systems bring headaches of their own and can take much longer to adopt. Sometimes the best path is to invest in what you have. Although we did overhaul our existing logging setup, we used the same basic building blocks to dramatically reduce implementation and adoption time. This meant faster resolution of our current issues, which was ultimately better for our customers.

Multi-cluster architectures are flexible and powerful

Single-cluster Elasticsearch designs make sense at first, especially since they offer both horizontal and vertical scaling by adding or upgrading nodes. But there are some inherent noisy neighbor problems with this architecture, since failures can always cascade across the cluster. After migrating to a multi-cluster setup, the advantages of fault isolation, targeted rightsizing and tuning, ease of upgrades, and general stability became clear. This is a lesson we’ll take to other systems since it’s not unique to this one.

Relatedly, don’t fear the complexity of multi-cluster. With the right upfront planning and investment in tooling, the burden is minimal. Our choice of a regular Elasticsearch cluster as a gateway to the sub-clusters made this especially easy.

Config is king

This one is more obvious, but it’s always worth a reminder. Especially as systems get more complex, having the state of the system encoded in configuration makes them much easier to understand and reason about. Having this in a single, centrally accessible place is also important, rather than scattered in several repositories. It also results in major time savings down the line, as any tweaks just require small modifications. Make everything a config, script it out, and reap the benefits!

Focus on impact

Our team used to live with the burden of high-maintenance systems. We hadn’t thought deeply about investing in what seemed like merely KTLO reduction, as there was always something else to work on. However, after reviewing key on call metrics such as our alert count, we decided to treat these as KPIs for the team and invest in solving them as a priority. This data-driven approach to uncover impact has yielded real benefits for us.

Conclusion

Before hitting the reset button, consider how you can breathe new life into your systems by rethinking their architecture. For us, embracing the power and complexity of a multi-cluster design coupled with a config-centric strategy has reaped enormous benefits for our logging platform, which ultimately led to a smoother experience for our customers.

Acknowledgments

This work was possible with support from Plaid’s Data Infrastructure, Developer Efficiency, Techops, Customer Support, and Core Infrastructure teams.

engineering

For consumers