March 04, 2025

Goodbye Dockerfile, Hello Bazel: Doubling Our CI Speed

Nikita Chepanov
Tech Lead Manager,
Developer Efficiency team

Oleg Dashevskii
Senior Software Engineer,
Developer Efficiency team

In the first half of 2024, Plaid’s Developer Efficiency team set out to speed up our largest CI pipeline without disrupting developer workflows—and ended up cutting CI times by 50%, shrinking container images by 90%, and making local iteration up to 5x faster.

In this post, we’ll walk through how we migrated 150+ Go services from a single Dockerfile to Bazel’s rules_oci, leveraging ephemeral BUILD files to minimize migration overhead. Along the way, we’ll share tips for

Choosing a remote cache
Dealing with performance hurdles in Gazelle
Minimizing impact on our developers’ day-to-day workflows

Motivation

In December 2023, our Go monorepo was home to 150+ Go services. They were thoroughly tested in CI with the help of dozens of high-fidelity end-to-end integration tests based on docker-compose. To guarantee reproducibility, every integration test was designed to build all affected components before every run. Historically, all Go services shared a single container image with a single binary capable of running any one of the product’s required services. While controversial, this design worked well for years, as Go is known for its fast compilation times. Additionally, building a single container image is faster and easier to maintain than building hundreds of container images before each test.

CI Slowness

As clever designs go, this one reached its limits. Internal developer experience surveys showed that developers were not happy with CI runtimes. Our monorepo CI was bottlenecked by the time it took to build this mega container image, and a p50 (median) duration of 20 minutes and p95 duration of 35 minutes for testing changes was the tipping point for engineers. There had been multiple attempts to optimize the image over the years: there was already a complex multi-stage Dockerfile, pre-warmed Go build caches, and other tricks. Nevertheless, the image build alone still took up to 10 minutes and produced a gigantic 1.9 GB image capable of running any of the services.

It became apparent that our existing approach might be hitting its limits, though simply splitting the monolithic Dockerfile into many smaller ones would bring its own challenges. While building a single container image ensures all services remain up to date, multiple images require a selective building approach that identifies which of the 150 images must be rebuilt for each pull request and which can be pulled from the internal registry. Without a robust way to handle this, 150 smaller images can be just as slow—if not slower—than a single large image.

Can Bazel Help?

At the same time, we kept hearing about Bazel. Success stories from companies of all sizes painted a picture of migrations that slashed build times by 90%. These stories often focused on repositories centered around one or two main languages, with the speed gains coming from improved caching for executables or test results. But then we discovered that Bazel could also build container images—thanks to the open-source project rules_oci. If successful, Bazel’s strict dependency graph could bring several advantages to our CI process, including:

Accurate dependency tracking. With Bazel (and tools like bazel-diff), each pull request can be precisely mapped to the affected service images.
Remote caching for faster builds. Bazel supports remote caching, reducing how much code needs to be rebuilt across CI workers.
Optimized container image build. With rules_oci, Bazel can build container images faster and with a more granular level of control without Docker.

Proof of Concept

To test our hypothesis, we started with a single Go service image fully built with Bazel using BUILD files generated by Gazelle. It took a senior engineer with no prior Bazel experience about 4 weeks to complete this step, and the results looked promising: the container was much smaller and faster to build. The next step was to scale the success to all services.

We aimed to minimize how much of Bazel’s complexity developers would need to see in the early stages of the project. That said, the user experience of managing the BUILD files was a concern from day one. The rough migration plan was to capitalize on the existing abstraction called devenv that our engineers interact with. Instead of running Docker commands directly, engineers were accustomed to running devenv service build <service-name>.

After enabling Bazel builds for a handful of services, we duplicated our monorepo’s CI pipeline and ran the copy for all changes to the monorepo while being completely invisible to engineers. This shadow pipeline was configured to run only a subset of all integration tests focusing on those with Bazel-enabled services. It proved to be invaluable throughout the migration: it helped us estimate the impact, increase operational maturity, and continuously iterate on our evolving Bazel setup. All of this was done before a single service was cut over to the new container image in production.

From the proof of concept to having migrated all relevant services in production, it took us about 7 months with a team of 4 developer efficiency engineers and relatively minimal involvement from teams that operated the services. In this blog post, we picked three themes we believe would be most helpful to those attempting a similar migration:

Moving from a single Dockerfile to a per-service container images built with rules_oci
Choosing and integrating with Bazel remote cache provider
Optimizing developer experience of maintaining Bazel BUILD files with Gazelle

From Dockerfile to rules_oci

When we decided to split one large Dockerfile into over 150 per-service container images built with Bazel, we followed a few key principles:

Images must be as small as possible.
Shifting from one image to over a hundred risked ballooning network and storage costs, so keeping each image compact was important.
Maintaining Bazel image targets must be as automated as possible.
We needed to match the smoothness of existing native Docker workflows—otherwise, engineers might reject Bazel if it made their daily tasks more cumbersome.
Each image must include all data files needed by the service from the monorepo.
Even a single missing file hidden deep in a dependency could cause runtime failures, so we required a method that guaranteed completeness.

To minimize image size, we combined the service binary and additional in-house tools (using an approach similar to BusyBox) into one multi-call binary. Due to Go’s static linking nature, the multi-call binary size is always smaller than the sum of its members' sizes. A “boilerplate” macro based on write_source_files was created to automatically generate main.go files for multi-call binaries and wiring scripts that accompany them.

For image targets, we created a macro that adds necessary layer targets (pkg_tar) and an oci_image target. For bootstrapping new services, we also extended Gazelle with a custom plugin that generates a macro call in service BUILD files, allowing engineers to have their new services buildable by Bazel without having to do anything beyond running Gazelle.

Iterating on Data Files

In the beginning, we used macro attributes to enable specific container layers with data files. E.g. for services that require translation files, specifying i18n = True in the service macro call would add YAML files with translations to the image.

The first iteration of the macro looked something like this (some details omitted for clarity):

1load("@rules_oci//oci:defs.bzl", "oci_image", "oci_tarball")
2load("@rules_pkg//pkg:tar.bzl", "pkg_tar")
3
4KNOWN_FEATURES = [
  "environment",
  "i18n",
  "migrations",
  # etc...
9]
10
11
12def plaid_go_service_image(
  name,
  # derived from name by default
  service_name=None,
  # the main service binary. Defaults to ":${service_name}"
  binary=None,
  # additional files to be added to the service harness layer
  files=[],
  # additional tars to be added to the service harness layer
  deps=[],
  # derived from name by default
  image_name=None,
  image_base="@base",
  ** kwfeatures,
26):
27
  # Implementation details omitted for brevity...
  # Layers are created using pkg_tar()
  # according to macro arguments and "known features"
31
  oci_image(
      name="image",
      base=image_base,
      tars=[
          ":monorepo_structure_tar",
          ":service_harness_tar",
          ":service_binary_tar",
      ],
      workdir="/app",
  )
42
  oci_tarball(
      name="image_tarball",
      image=":image",
      repo_tags=[
          "plaid/devenv/service/{}:latest".format(image_name)
      ],
  )

Example invocation in a BUILD file:

1# service/example/BUILD.bazel
2
3plaid_go_service_image(
4    name = "example_image",
5    service_name = "example",
6    files = ["//services/example/server:templates"],
7    # one of the "known supported features", that controls image contents 
8    migrations = True,
9    # another "known feature"
10    i18n = True,
11)
12

We found that this approach could be fragile and prone to error in practice. It was too easy to add a new library dependency without specifying the proper flag in the macro call, causing the service to fail on file access (either at startup or later if the files were lazily loaded, the worst-case scenario). Therefore, we switched to a different model in which all data files were added to runfiles through the data attribute in the relevant go_library rule.

This made data files from all Go packages that are transitive dependencies of a service automatically included in the service image. At runtime, there was a much higher likelihood that the data files would be available without any extra configuration. The importance of this automation is hard to overestimate, given that most engineers were unaware of the ongoing migration and didn't realize that they might need to add a flag to plaid_go_service_image(...).

See the improved runfiles-based version of the macro. Note how a custom rule collects runtime dependencies and includes them in the final image.

1load("@rules_oci//oci:defs.bzl", "oci_image", "oci_load")
2load("@rules_pkg//pkg:tar.bzl", "pkg_tar")
3
4def _collect_runfiles_impl(ctx):
5    inputs = ctx.files.srcs[:]
6    all_files = depset(direct = inputs, transitive = [
7        src[DefaultInfo].default_runfiles.files
8        for src in ctx.attr.srcs
9    ])
10    # Only use source files here
11    source_files = [f for f in all_files.to_list() if f.is_source]
12    return [DefaultInfo(files = depset(source_files))]
13
14# This rule creates a target
15# that lists all transitive runfiles of srcs as its outputs
16collect_runfiles = rule(
17    doc = "collect source runfiles",
18    implementation = _collect_runfiles_impl,
19    attrs = {
20        "srcs": attr.label_list(
21            allow_files = True,
22        ),
23    },
24)
25
26def runfiles_tar(name, srcs):
27    runfiles_name = "_{}_runfiles".format(name)
28    collect_runfiles(name = runfiles_name, srcs = srcs)
29
30    pkg_tar(
31        name,
32        package_dir = "/app",
33        srcs = [":{}".format(runfiles_name)],
34        strip_prefix = ".",
35    )
36
37def plaid_go_service_image(
38        name,
39        service_name = None,
40        binary = None,
41        srcs = [],
42        deps = [],
43        image_name = None,
44        image_base = "@base"
45    ):
46    # Some details omitted for brevity...
47
48    runfiles_tar(name = "runfiles_tar", srcs = [binary])
49
50    # Use runfiles_tar as an oci_image layer... (omitted)

You can find a sample invocation below. The macro is added by our Gazelle extension based on repository conventions. Most files in srcs attribute are detected and added automatically. Custom files can be added manually (see the # keep directive).

1# service/example/BUILD.bazel
2
3plaid_go_service_image(
4    name = "example_image",
5    srcs = [
6        "ensure.sh",
7        "start.sh",
8        "//services/example/examplecron/environment:files",  # keep
9    ],
10)

Choosing a Remote Cache Provider

Bazel is not automatically fast. Since Bazel wraps native tools, it cannot be faster than those tools. To be faster than the combination of Docker and Go and benefit from the improved dependency organization, we knew we needed to use a Bazel remote cache setup. However, we weren’t quite sure whether we should deploy and maintain one of the known open-source solutions or rely on a vendor to run the caching cluster for us, so we began experimenting.

Networking Is Expensive

The first remote cache provider we worked with charged per terabyte downloaded. The shadow pipeline revealed that, once the migration was complete, our CI cluster would handle about 350TB of inbound traffic each month, on top of comparable AWS egress cost. This model was well beyond our budget, and sending that much data over the Internet was very inefficient.

We couldn’t find an open-source solution that was both highly available and cost-effective for our scale. Specifically, we needed:

All traffic to stay within our private network
Minimal maintenance overhead for our team
Reduced infrastructure costs (network, storage, and compute)

With these requirements in mind, after more research and a thorough build-buy analysis, we went with EngFlow (www.engflow.com). They offered a co-managed model: EngFlow engineers maintain the cache cluster hosted in our AWS sub-account, where it can benefit from the savings plans we already had with AWS. With this setup, we no longer had to worry about optimizing our network patterns and could instead focus on creating a seamless, secure, and efficient integration.

The Final Cache Setup

Automated certificate management. We built a system on top of AWS Private CA for automatic renewal and creation of leaf certificates needed to access the cache. The design was focused on preventing service disruptions and being as invisible to the users as possible.
Restricted write permissions. Our CI runners were the only environment allowed to write to the cache, leaving all developers with read-only access. This setup is considered a good practice in preventing accidental cache poisoning,
Isolated deployment cache. If the deployment pipeline shared the same cache as CI, an attacker with write access (e.g., anyone who can open a PR) could craft a malicious cache entry. Bazel would then use it and unknowingly build a compromised container for production. We spun up a dedicated cache cluster for deployments, restricting write access to the secure deployment pipeline. By isolating deployment caching, we eliminated the risk of compromised artifacts entering production.

Even with the cache clusters fully managed by EngFlow, a seamless yet secure integration was a challenging task.

Maintaining BUILD Files

Gazelle is an open-source BUILD file generation tool that natively supports Go and has a large ecosystem of plugins that extend its functionality to other languages. During the POC, we timed Gazelle to finish generation in under 10 seconds on average. While this wasn’t a particularly long time, we figured it was long enough to make our developers unhappy, especially given that it needs to run on almost every commit. Later, we learned that in some situations, the generation can get even slower. Additionally, with over a hundred services needing migration, we were expecting it to take multiple months, during which all developers would be forced to endure an extra generation step, slowing them down without any obvious benefit.

Ephemeral BUILD Files

Inspiration came from the 2023 talk Bazel Migration using Fully Ephemeral BUILD Files. Instead of asking all engineers to start regenerating the BUILD files, we changed our tooling to run Gazelle before running Bazel builds. We still chose to regenerate and check-in the BUILD files once a week manually throughout the migration. This way, the BUILD files were allowed to be somewhat out of date with the code in the main branch. Otherwise, they were fairly representative, and on-the-fly regeneration on the remote developer machines had proven to be fast enough. Going fully ephemeral was discussed and rejected mainly due to various customizations that were needed in the BUILD files and the confusion it would cause.

Gazelle on Engineers’ Laptops

Fast forward to July 2024. All relevant services had been migrated from Dockerfile to Bazel and successfully deployed to production without disruption. It was time to introduce the BUILD files to our engineers. We added a check to CI to enforce BUILD file correctness by running gazelle update -mode=diff and announced the new required check and the exact command needed to regenerate the BUILD files. It was the first time engineers would run Gazelle locally on their laptops.

A week later, reports started coming in from users complaining that running the tool was taking too long, sometimes multiple minutes. This took us by surprise – the team had not encountered any slowness in the 6 months leading up to that moment, and the generation was only taking a handful of seconds in CI. Once we added instrumentation to our tooling, we were surprised to find a median duration of about 20 seconds and a p95 duration extending to several minutes.

Learnings

It took us multiple months of trial and error before we were able to take the situation under control. Here’s what we learned:

Avoid running Gazelle with bazel run; use a prebuilt version of the Gazelle binary instead. Although Bazel excels at incremental builds, it can add noticeable latency when fetching updated external repositories or rebuilding its analysis cache. Moreover, if you maintain a custom Gazelle plugin in the same repository, you may encounter a “Catch-22” situation: Gazelle is needed to generate the correct BUILD files, but the BUILD files must be correct for Gazelle to run.
Review and exhaustively populate your .bazelignore file. Gazelle will always scan the entire source tree, even if only a single BUILD file update is needed. This can be especially painful if the repository also contains Node.js projects since node_modules directories will likely have hundreds of thousands of files that Gazelle will dutifully scan to build its internal index.
Encourage IDE on-save behavior to avoid manually running Gazelle. Both IntelliJ and VSCode-based IDEs can be configured to run arbitrary commands every time a file is saved. With the caveat of needing to index the entire source tree, Gazelle runs faster when invoked on just the folder with the changed file. This can be further optimized using --index=false, but it can lead to subtle differences in generated BUILD files. We abandoned this line of inquiry after some experimentation but may revisit it in the future.
Beware of Endpoint Detection and Response (EDR) software. Security tooling on developer laptops runs in kernel mode. While designed to have minimal impact, it can disproportionately affect all programs with extensive I/O, such as Gazelle.
Hide bazel-*/ convenience symlinks from the workspace root. We discovered that both git and IDEs became noticeably slower or glitchy as they began indexing unrelated files inside of the Bazel work tree. Hiding folders that engineers are unlikely to use may resolve speed-related complaints. See Command-Line Reference | Bazel.

After several rounds of improvements, we reached a point where generating all BUILD files across the repository was fast enough. Over 60% of users chose to configure their IDEs to run Gazelle on-save. In hindsight, gathering performance metrics from the start would have helped us avoid surprises since laptop performance can vary widely by model and usage pattern, affecting generation times in practice.

Results

The overhaul was pulled off by just four engineers without overwhelming the rest of the organization. We split the work into two phases:

First, we built the foundation—crafting the macros, Gazelle plugin, and updates to our devenv tooling.
Then, we scaled the approach to every service using a straightforward boilerplate.

The methodical rollout let us iron out issues early and avoid big headaches when we converted dozens of services at once.

Notably, we didn’t have to teach many people how to use Bazel or write complex BUILD files. For the most part, things have just gotten faster. The only trade-off was the BUILD file maintenance, but over time, we made it faster, and it became more accepted.

Faster CI

By June 2024, we had shifted all relevant services to build with Bazel, and the difference was striking. Our CI pipeline, which once sat at a 20-minute median duration, now reliably finished in about half the time.

Faster Local Development

Locally, engineers are used to running devenv service reload <service_name> when iterating on a service. This command stops the container, rebuilds the image, and starts the service again. Despite not focusing on improving the local development experience, we quickly discovered that local iteration time has significantly improved. Developers saw their build loop shrink by as much as fivefold.

As one engineer jokingly said:

“When am I supposed to take my coffee breaks now?!”

Other Results

Bazel, as a build tool for Go, is a solid foundation for further CI optimizations. After the service image migration, we were able to do more:

Better dependency management. Bazel’s accurate build graph helps detect and eliminate unnecessary dependency chains. In Go, importing a single constant from a package can inadvertently pull in dozens of unneeded dependencies. This increases the size of Docker images and leads to excessive test runs. With Bazel, we’ve already identified and mitigated several such inefficiencies.
Faster unit testing without Docker. Bazel’s sandboxing capabilities serve as an effective alternative to using Docker for unit test isolation. Following the service image migration outlined in this blog, we’ve also accelerated unit test execution. Now, Plaid engineers receive their first CI feedback in under a minute, significantly improving iteration speed.

With Bazel’s caching and strict dependency tracking, we’ve turned inefficient container image builds into a streamlined, scalable system. We met our primary goal – faster CI – while also improving local development and uncovering fresh opportunities for future improvements.

Resources

Throughout the migration, these presentations had a big impact on the project, and we’d like to share them here.

BazelCon 2019 Day 2: Bazel Migration Patterns - a very helpful overview of Bazel migration patterns
Markus Hofbauer – Bazel Migration using Fully Ephemeral BUILD Files – BazelCon Community Day 2023 - an example of using ephemeral BUILD files
Optimizing Gazelle for Scale and Performance in Uber's Monorepo - Tyler French, Uber - running Gazelle at Uber scale. Many similar themes to what we have discovered.

engineering

For builders

For users