March 04, 2026

Building a smarter chatbot

Todd Kerpelman
Developer Relations Engineer

Zach Keller
Machine Learning Engineer

This is Bill

Bill is an AI-powered chatbot who likes to read our documentation for fun. We built Bill during Plaiderdays, Plaid’s annual hackathon, back in the summer of 2023, and it was promising enough that we decided to put Bill into production, where he’s been keeping developers happy and informed ever since. And if you want to find out more, we have a whole other blog post that goes into more detail around how we originally built Bill.

Now when we first put Bill into production, we all assumed that he would eventually be lapped by commercial solutions and replaced within a year. Turns out though, that hasn’t happened. Bill is still pretty good! In fact, we ended up forking a version of Bill that lives in our support flow and performs tasks for signed in users, like searching logs, investigating known issues, and checking integration health. So, yes, I’ll say it: Bill may also be slowly becoming an agent.

ChatGPT's attempt to turn Bill into an agent.

Once we realized that the public version of Bill was going to stick around for a while, we decided to spend some time making him smarter. These included some basic improvements like:

Updating our underlying AI models
Optimizing some of our prompts
Adding additional context to our code samples
Adding a reranker

And those all worked pretty well, but there was one issue that Bill always struggled with… reference docs.

The trouble with reference docs

So what we're referring to here are the documents that fall into the “API Reference” section of our documentation – these are typically big tables that contain a description of every property in every request and response object for every endpoint. As for our other documentation (like our overviews and getting started guides), we'll refer to those as prose documents from now on.

At Plaid, our reference docs tend to contain a lot of information. Almost definitely more than your typical API provider. Let’s compare a snippet of our reference docs to that of another well-respected developer platform.

Wow. That’s a lot of text!

Now, there’s a good reason behind this – much of the complexity in using Plaid isn’t really in making the calls to the API; it’s understanding the data you get back. Some of the objects that Plaid returns from the API are pretty huge. And there’s nuance here – these fields often require some explanation for you to correctly interpret the data you’re receiving. If we tried to cram all of that explanation into our getting started guides, a developer would be understandably overwhelmed. So leaving those details in our reference documentation makes the most sense.

The problem is that this didn’t make sense to Bill. It turned out that he almost never used our reference documentation to answer questions, even in situations where it definitely would have been useful.

So why is that? Well, we weren't entirely sure, but we had a few theories, which led to a few solutions that we attempted over the last several months. Let’s see how they turned out!

Attempt #1: Give Bill More Context

First, a mini-review

So before we get into our different solutions, let’s quickly review how traditional RAG (or retrieval-augmented generation) models work.

With a traditional RAG-based solution, you start by fetching all of the documentation your LLM might potentially need. For us, that included documentation, help center articles, video tutorials from Plaid's YouTube channel, and so on. You then break each of these documents into smaller chunks that are roughly 3-to-5 paragraphs large.

Next, you assign a “meaning” to each chunk of text by using an embedding API – essentially a process that distills the meaning of a bunch of text into a really big vector. Then you store the vector and the original text in a vector database (a database that’s really good at finding similar vectors).

So when a user asks a question, you use the same embedding API to find those chunks of your documentation that are closest in meaning to your user’s question, and therefore most likely to be relevant.

You take those chunks of information, and feed it to an LLM, with the hope that it can answer the user’s question based on the information you provided

Now this approach works quite well for our prose documentation – that is, all of our documentation that doesn’t include our reference docs, But we had a theory that maybe when you take a chunk of information from our reference documentation, you lose some of the surrounding context that helps an embedding model (or an AI model) understand what it means.

For example, let’s take a look at this chunk of text grabbed from the middle of our reference docs.

Does it make sense? Kinda? But it makes a lot more sense when you understand that you’re looking at the response object of the /investments/transactions/get call, and that this is part of the securities object.

This is something that we make clear visually to a human, thanks to some sticky multi-level headers (which probably merit a whole other blog post, because these were way harder to build than you might think).

But what about an LLM? It doesn’t know what part of the page this chunk of text belongs to. So what happens when it’s trying to distill all of this down into a single meaning? Will that meaning encapsulate “The data that gets returned from a /investments/transactions/get call”? Maybe not.

So the first solution we tried was to create a separate version of our documentation that contained all of this relevant context for every property in every request and response object – something that would be too verbose for a human, but probably just right for an LLM trying to parse this information a few paragraphs at a time. So the previous page snippet ends up looking like this when Bill reads it in:

(Keep in mind, this was back before llms.txt and the idea of automatically serving markdown files was a thing – we just re-rendered our HTML in something close to plain text.)

So, how did it work?

Honestly, improvements were small at best. Part of this is probably because, for the earlier example, you really had to cherry-pick an example where you do lose the context of the larger document. A lot of our reference documentation does give you a good idea of what part of the API you’re looking at, so any gains here were modest at best.

But also, there was another problem with how we were handling these chunks of text…

Attempt #2: Thousands of Properties

There was also a disconnect between how we humans kept thinking about embedding vectors and how they actually work. For example, let’s take a look at this piece of reference documentation that tells you a little bit about the data we get back from our Liabilities product:

You might think this would be close in embedding-vector space to user questions about “APRs” or “Overdue Payments” or “Interest Charges”, because all of that is right there in the text, right? The problem is that a traditional RAG model doesn’t work this way – we don’t assign this chunk of text to several different vectors.

How people tend to think that RAG models work.

Instead, this gets converted into a single vector that somehow tries to combine all of these different concepts together into one.

How they actually work.

At some intellectual level, of course, we always knew it worked this way. But it’s hard not to think about it working the other way.

So our next attempt was to change the way we break up these chunks of text. Instead of splitting our reference docs into chunks of text, which might encapsulate several different properties with several different meanings into a single vector, we split individual properties in our reference documents into separate chunks. In other words, lots and lots of smaller chunks.

The nice thing here is that the vector that gets generated from individual properties is much closer to the actual meaning of this property, rather than one that tries to combine 3 or 4 different properties together into one vector.

The drawback here is that this didn’t work great with our original strategy of “Find n documents that are most relevant to the user’s question,” because the amount of useful information you get from n pieces of prose documentation will probably be much greater than the useful information from n tiny little pieces of reference documentation, simply due to the size.

So we tried a slightly different approach. We put our reference documents into a different namespace in our vector database, separate from our prose documentation.

Then, when the user asks a question, we grab a combination of “prose” documentation and some tiny chunks of API reference documentation.

Actually, this is a bit of a lie because this was also shortly after we added a reranker. That's probably a whole other blog post, but a reranker is essentially a slower-but-more-accurate way to determine a piece of text’s relevance. You pass a reranker the user’s question and several documents, and it will use a sophisticated strategy to more accurately tell you which of these documents is actually likely to answer the user’s question. It’s typically slow enough that most people don’t use it over their entire corpus of knowledge, but it’s a great way to re-rank a couple of dozen documents.

So, when a user asks a question, we grab around 30 pieces of prose documentation and about 50 small chunks of API reference documentation. We rerank them all, take the top 4 results from prose documentation, the top 7 results from the reference documentation, and use all of that data to answer your question.

So, how did we do? Probably a little better. Obviously, because we are now always including some kind of reference documentation, we couldn’t measure progress based on how many times reference documents were being included in our search results. It looked like there was a small bump in our internal evaluation scores, but most of the improvements that customers saw were likely due to having a better reranker at around the same time.

Attempt 3: Go Big or Go Home

So we had attempts #1 and #2 running together for a while and, while they were fine, it still didn’t feel like Bill was doing a great job of incorporating reference documents into his answers. But we also took a look around at the AI landscape and noticed that things had changed.

Specifically…

Models were able to handle larger and larger context windows (that is, the amount of information you threw at them at once)
Models were getting less expensive to use. Throwing several thousand tokens at a model cost much less than it did a year ago.
They were also getting better at the needle-in-the-haystack problem. That is, you could include more irrelevant data alongside relevant data, and models were able to pick out the important pieces and not get confused by the other data.

So we decided to try one other experiment: What if, instead of looking for just the tiny chunks of reference documentation that might be relevant, we took the opposite approach? What if we took the entire documentation for a single endpoint (or webhook) and gave that to the model?

We thought this would be easy enough to try out. Like, it’s just different chunk sizes right? Well, it turns out there are a couple of problems when you’re dealing with very very large reference documents:

Our reference docs were so big, that our embedding models couldn’t handle them
The text portions of our reference docs were so big, that they didn’t fit in the typical “metadata” portion of our vector database, where they normally get stored.

There are some work-arounds here, but we decided to take this opportunity to really take a step back, look at our initial assumptions, and see if there was a better approach. Maybe the answer here isn’t to use a vector database at all, but just ask the LLM what documents it wanted to see.

So this is what we tried for content ingestion: For each of our endpoints and webhooks in our reference documentation, we grab the entire documentation for that endpoint. (These days, we’re also serving markdown versions of our docs, which also helps.) We also ask the LLM to create a brief summary of the endpoint and what it does.

We store the name of the endpoint, the summary, and the original text content in a plain ol’ relational database.

Now, when a user asks us a question, we still grab the top pieces of re-reranked prose documentation from our vector database. But then we also retrieve our pre-computed list of endpoints and summaries. We hand all of that off to our LLM and we ask it, “Based on our user’s question, this documentation, and our list of reference docs, what reference documentation would be most useful in helping you answer this user’s question?”

The LLM returns 0-2 reference documents. We then retrieve the full content of those documents from our database, and we send that information (along with those original prose chunks) to our LLM and have it generate the final answer.

How did this work? Really really really well. This system is smart enough to include the right reference documentation most of the time, and when it does, it’s able to find the details that it needs to answer the user’s question. There was a significant bump in our internal evaluation scores. And, just based on my own testing, it did a much better job of giving me the correct answer when the information it needed was buried somewhere in our reference documentation.

Wait a second! What about…

So there’s two fairly obvious drawbacks here:

Problem #1: Latency

We are significantly increasing the time until our user first sees a response from Bill. Instead of simply querying a vector database and then getting a response from an LLM, we’re querying a vector database, querying a relational database to generate our list of reference docs, getting a response from our LLM, querying our relational database again, and then getting a second response from our LLM as the final answer.

In practice, the database queries are so fast that they don’t matter much (especially with aggressive caching). But that additional trip to the LLM? It caused a noticeable delay. However, because the response from the first LLM call is short (it’s just listing a couple of document names), it’s not as bad as you’d think.

In fact, the extra delay has been small enough that there was no change in the rate of people quitting before they receive an answer from Bill. (And it’s not nearly as long as that one time we tried switching to a thinking model!)

Latency chart : the green line is the time until the user first received a response.

Problem #2: Cost

The other problem is that much more data is sent to the LLM. In that first query, we’re essentially sending a catalog of every Plaid endpoint, along with their associated summary and the prose documentation chunks. And in the second query, we’re sending up to two entire reference documents for our endpoints, which, as we’ve noted earlier, are big honkin’ documents.

The good news is that with some clever rearranging of text, we’re able to take advantage of prompt caching, which saves us a good chunk of change (at least on that first query). Our costs did go up by a couple of cents per answer, but it’s still much cheaper than Bill’s original cost structure two years ago. And in the bigger scheme of things, decreasing onboarding friction, reducing support ticket volumes, and helping teams build higher-quality integrations is far more valuable than a small increase in per-query cost.

Conclusion: RAG is dead!

What? No; no of course not. That’s the kind of thing said by mediocre blog authors who need controversial headlines. At the end of the day, we’re still retrieving data and serving it to an LLM. Or, to put it another way, Generating text Augmented by data we Retrieved. And a lot of the work around fetching that initial batch of prose documentation is the same approach that we were using back in the early days of Bill.

Instead, the important lesson we learned from this is that the landscape is changing so frequently that it’s a good idea to take a step back and question your assumptions. Our initial approach to Bill was certainly the Best Way to Do Things back in 2023. But two years in AI advancement and some of the workarounds that were needed to deal with models’ shortcomings are no longer an issue.

And who knows what things will look like in two more years! Will we even need to break up our documentation into pieces anymore? Will Bill be answering more questions from AI builders than humans? Will we ever get around to trying out hybrid search, which might be an even better solution? Maybe!

But that's the exciting part about working with LLMs. The landscape is changing so fast that we’re continually figuring things out as we go and learning new techniques along the way. And the engineering team at Plaid is always trying out new things in the AI space, whether it’s new techniques or completely new tools. Hey, what a perfect segue to mention that if you’re interested in building new things incorporating cutting edge AI, we are hiring!

engineering