February 03, 2026

Build less, merge faster: avoiding diamond merges with a merge queue

Nikita Chepanov
Tech Lead Manager,
Developer Efficiency team

Oleg Dashevskii
Senior Software Engineer,
Developer Efficiency team

As the Plaid monorepo grew to hundreds of commits per day, the Developer Efficiency team found that operating without a merge queue had become increasingly costly. We set out to build a merge queue with a target median latency of five minutes. Read on to learn why the naive “build everything” approach failed, and how the work led us to contributing a bug fix to the rules_go project.

This work builds on our earlier effort to speed up CI by migrating our Go services and container builds to Bazel, which cut CI times in half and significantly reduced build overhead – Goodbye Dockerfile, Hello Bazel.

Why we needed a merge queue

Operating without a merge queue

It might have seemed too late for us to invest in a merge queue. For context, the Plaid monorepo is home to hundreds of projects written in Go, TypeScript, Ruby, Bash, Terraform, and more. Go alone accounts for at least 75% of the codebase. Most engineering teams reach for one much earlier, often once they have a few dozen active contributors. We were able to postpone it thanks to two pragmatic guardrails that kept our main branch stable:

Auto-merge in CI – the monorepo CI brings the PR branch up to date with the latest default branch at the start of every run, before running any validations. This keeps tests aligned with the current state of the codebase. The trade-off is that CI may fail even if everything passes locally, because new changes on the default branch might conflict with the work-in-progress code in the PR.

This is effective at reducing diamond merges when the PR is merged shortly after. A stale PR, on the other hand, was becoming more likely to break the main branch the longer it stayed open.

PR recency enforcer – a simple daily job that checks all open PRs by comparing their merge base against the latest commit on the release branch. Any PR more than two days behind is marked as stale and requires the author to rebase before it can merge. The job was introduced in 2021 to help with diamond merges.

With these guardrails, we were able to scale to roughly 700 commits a week, with only minutes between commits during peak hours. In early 2025, however, we began to see multiple breakages a month, each requiring hours of engineering time and coordinated effort to resolve. A large portion of breakages were due to diamond merges.

What’s a diamond merge?

Also known as a “green-green” merge, a diamond merge occurs when two PRs each pass CI on their own, but break the main branch once they are both merged. Let’s look at a simple example to illustrate the problem.

PR 1 renames a helper method as part of a refactor, while PR 2 adds a new callsite to the old method name. Individually, both PRs pass CI. But if both merge in rapid succession, the result is broken:

LoadConfig no longer exists (PR 1 removed it)
Init() still calls it (from PR 2)
Main branch fails to build after both PRs land

1// Main branch in helpers.go
2func LoadConfig() string {
3    return "config"
4}
5
6// PR 1: Rename the helper in helpers.go
7func LoadConfiguration() string {    // renamed
8    return "config"
9}
10
11// PR 2: Add a new callsite in logic.go
12func Init() {
13    _ = helpers.LoadConfig()   // new usage of the old name
14}
15

This is a well-known problem that has finally caught up to our Plaid monorepo. A merge queue is the standard solution: once the first PR merges, the next one is rebased onto the updated main branch; if it fails validation in the merge queue, it is removed until the author updates it to align with the latest main branch.

First attempt: build everything

At Plaid, we use GitHub as our source control system provider, so in theory, enabling a merge queue could be as easy as clicking a checkbox. In our case, the merge queue would run the entire CI pipeline – linters, image builds, unit tests across multiple language ecosystems, as well as integration tests. However, new commits are merged every few minutes during peak hours, and the complete set of CI tests may take up to 30 minutes, making it impossible to process all changes in time. Enabling the merge queue wasn’t going to be “just a checkbox” if we wanted to keep the overhead under 5 minutes.

As a first step toward simplifying what ran in the merge queue, we decided to exclude complex integration tests and focus on builds instead. Considering the Go ecosystem was migrated to Bazel and that Go accounts for the majority of the codebase, it seemed reasonable that if we just ran bazel build on all relevant targets, we could benefit from Bazel caching and incremental builds, cover roughly 95% of common build failures, and still stay under the 5-minute target.

What we observed was surprising. Once the merge queue was enabled, CI runners began exhausting disk space and hitting I/O throttling. The runners were provisioned with 400 GB gp3 EBS volumes capped at 8,000 IOPS, which had previously been sufficient, but quickly became a bottleneck under the increased load. The remote Bazel cache also saw a sharp increase in writes: we expected a 5–10% load increase, but ended up having to double the caching cluster capacity to keep it stable after briefly overloading our vendor-managed infrastructure. This showed up directly in merge queue performance, with a median duration of around 6 minutes and p95 exceeding 12 minutes – well beyond our 5-minute target.

Simply building all affected targets wasn’t a good enough solution. We had to find a way to further reduce the scope of our merge queue while keeping it effective.

Second attempt: reducing the build scope

We decided to look at the type of diamond merges we had encountered in recent months. Interestingly, nearly all diamond merges originated in the Go ecosystem. Other languages and tools almost never caused merge queue failures, largely due to a smaller number of contributors. When failures did occur, they were overwhelmingly trivial compilation errors and, less frequently, Go linting issues. Diamond merges rarely manifested as test failures.

The naive approach built all affected Bazel targets, including Go libraries, tests, binaries, and the container images that package those binaries. In practice, many of these targets were redundant in a merge queue context. Go-related Bazel targets were highly correlated in terms of buildability: if a service binary fails to build, the container image that includes it will also fail. Conversely, when an image fails to build, the root cause is almost always the underlying binary. Therefore, building container images added cost without providing additional signal. Removing image builds reduced disk pressure and resource contention, but the merge queue was still too slow.

Key optimization: compile-only validation

To further reduce the scope, we focused on what actually needed to be built.

Consider a simple Go library with a single no-op main function and two no-op test functions (one in package foo, one in package foo_test).

1load("@io_bazel_rules_go//go:def.bzl", "go_binary", "go_library", "go_test")
2
3go_library(
4    name = "foo",
5    srcs = ["foo.go"],
6    visibility = ["//visibility:public"],
7)
8
9go_binary(
10    name = "foo_binary",
11    srcs = ["main.go"],
12    deps = [":foo"],
13)
14
15go_test(
16    name = "foo_test",
17    srcs = ["foo_test.go"],
18    embed = [":foo"],
19)

In Bazel, a go_binary is produced from a go_library; in practice, failures to build a binary almost always originate in the underlying library. Go static binary linking adds overhead, but is unlikely to surface diamond merge issues, so go_binary targets can be safely skipped.

The same principle applies to go_test. Under the hood, go_test produces object archives and a linked test binary. Bazel exposes these as separate output groups. By building only the compilation_outputs output group, we were able to validate that tests compile without paying the cost of linking.

Simply adding the --output_groups argument significantly reduces the size of the output and the total number of required actions. Compare:

Before: bazel build //lib/foo/...

23 actions
4.6M output size
- foo_binary: 1.5M
- foo_test: 3.1M
- foo.x: 724B

After: bazel build //lib/foo/... --output_groups=compilation_outputs

10 actions
58K output size
- foo.a: 3.9K
- foo_binary.a: 21K
- foo_test.internal.a: 33K
- foo_test_test.external.a: 558B

This approach was noticeably faster and less disk-intensive while maintaining the same tradeoffs in terms of correctness.

Implementing this optimization required fixing a previously unknown, long-standing bug in rules_go, which we contributed upstream, making the approach available to all Bazel Go users (bazel-contrib/rules_go#4437).

When we evaluated --output_groups=compilation_outputs, we discovered that it failed to catch compilation errors in Go tests (e.g. foo_test). As a result, syntax or type errors in *_test.go files could silently pass a compile-only merge queue run. We traced the issue to how go_test populated the compilation_outputs output group and confirmed with the maintainers that this behavior was unintended. Surprisingly, the bug had existed since the output group was first introduced in 2018 (bazel-contrib/rules_go#1756), unnoticed until this use case surfaced.

Results

The progression in merge queue latency reflects the sequence of changes we made over time:

In June 2025, we introduced a merge queue in no-op mode, which established a baseline overhead of roughly 40 seconds per merge.
Shortly after, we enabled a naive “build everything” configuration, assuming Bazel caching would keep validation fast. Instead, merge queue latency spiked, with median duration climbing to ~6 minutes and p95 exceeding 12 minutes.
Over the following weeks, we iteratively narrowed the scope of what ran in the merge queue.
By late August, after tightening the scope, we brought the median merge queue time down to 1.8 minutes, where it has remained stable since.

This project is representative of the kind of work the Platform teams do at Plaid: operating at scale, identifying systemic bottlenecks, and making targeted investments that improve reliability and velocity across the organization. If you’re interested in solving these kinds of problems, keep an eye on Platform engineering roles across Plaid at plaid.com/careers.

engineering