August 24, 2017

Making sense of messy bank data

Baker Shogry

Updated on November 21, 2018

At Plaid, we collect data from more than 9,600 financial institutions. That means 9,600 ways of classifying an ATM deposit, a refill at a gas station, an online purchase at Amazon, or that important bi-weekly paycheck. It also involves dealing with 9,600 ways of formatting everything from a transaction description to the business’ name to its branch address to whether the payment is recurring. Our customers use this data to build everything from budgeting tools to fraud analytics, so it’s not enough for us to return all this raw, disparate data. We need to be sure it's consistently cleaned and classified across all financial institutions. This consistency ultimately makes it possible for customers like Coinbase, Lyft, or American Express to offer great products to all consumers, regardless of where they bank.

But this is far from easy. Take this example of data before and after going through our system.

One of the more challenging aspects of this normalization task is transaction categorization. In the example above, we need to make sure that a purchase at QuikTrip is properly categorized as a payment at a gas station, no matter what bank or type of payment was used.

Developers depend on proper categorization to deliver experiences that improve consumers’ lives. For example, applications like Clarity Money depend on our categories to accurately track spending patterns over time. Apps for managing credit card points rely on merchant classification to suggest which card will provide the best deal. For most of these applications, accurate transaction categories are the difference between a delightful user experience and a product that simply doesn’t work. If we get it wrong, these products won’t deliver the value their users expect.

But at best, the information that we have to go on (the transaction description and amount) is incomplete and ambiguous. There are often many businesses that correspond to the merchant names provided by banks. Take the following example.

1CHECKCARD 1234 HARRYS TX 12345678901234567890

A quick Google search reveals that this could be a sports bar, a greek restaurant, or a cowboy boot store.

In some cases, banks may categorize transactions, but these vary greatly in terms of accuracy, coverage, and taxonomy across institution, and the mapping from each of these categories into a standardized set is its own headache of MCC codes, NAICS codes, SIC codes, and bank particulars. Finally, in many cases, merchants can fall into a number of different categories: is Home Depot a Construction Supply Store, a Hardware Store, or something else?

In this post, we’ll take you through how we use machine learning to standardize categories across thousands of institutions.

Naive approach

The most straightforward approach to categorizing transactions would be a hand-curated set of rules. So if a description includes the word “McDonald,” we could classify it as Fast Food; if it includes the word “DirecTV”, we could classify it as a Cable Provider; and if the description includes “overdraft” we could classify it as Overdraft Fee.

But even simple cases like these quickly run into problems. Take the following examples:


The naive approach would classify this as a cable provider (if you squint you’ll see “DIRECTV” in there), whereas it’s actually a donation.


This transaction could easily be labeled as Fast Food just because the account owner happens to share a last name with the founder of the restaurant chain.


Under a naive rule that only looked for the presence of the words “foreign” and “fee” this would be classified as a foreign transaction fee, when it is in fact a transfer from a foreign bank account.

We could always craft increasingly complex rules to account for these edge cases, but it quickly becomes clear that a naive keyword classifier just isn’t up to the job. When you’re writing rules around these types of ambiguities and you’re managing misspellings, this deterministic approach becomes unwieldy. And because we’re growing so quickly, adding new banks with new customers in new regions with new businesses, this approach would just get harder and harder as we scale. We would have to write new rules for every previously unseen merchant that appeared in our system. Finally, the rules can break because a bank arbitrarily decides to change the format of their transaction descriptions, meaning that we would have to restart the whole process all over again.

The Rise of the Machines

Fortunately, this is the type of task that is well suited to a machine learning solution. Rather than dealing with the pain of creating and constantly maintaining a near-infinite list of rules, we aimed to offload this pain onto a classifier, which can learn this complex set of rules all by itself and update them in response to changes in the underlying data.

The simplest type of model is a classifier based on a bag of words. As the name suggests, bag-of-words models treat a description like an unordered collection of words. They are the probabilistic counterpart to the rule-based approach mentioned above, generating predictions based on the presence of certain keywords in certain combinations.

But this model is suboptimal for our needs. First, because the model ignores ordering between words, it would fail on many of our most common transactions. For example, if “usa” is the first word in a description, the category is likely to be a vending machine, whereas if “usa” is at the end of the description it is likely to refer to the country of origin; but the bag-of-words models would treat those descriptions in the same way.

Second, bag-of-words models have no concept of the degree of similarity between words. We would need a model that can successfully classify the word “hotel” to also be able to classify words with similar meanings like “hostel” or “lodge” or “pension.” That way the model would not necessarily have to see a word during training to be able to classify it correctly at prediction time.

Many of these issues could be resolved through what is called feature engineering: manually tailored inputs that give our model hints of how to behave in specific situations. An example of an engineered feature might be an indicator of whether “usa” appears at the beginning of the description. The problem, of course, is that as edge cases built up, we would be forced to create and update a long list of features. But then we’d be in the same situation as before, managing a ton of rules at one level higher of abstraction. It would be more efficient, but it’s still not nearly good enough.

Going deeper

Fortunately, these shortcomings can be rectified by turning to our future overlords: artificial neural networks.

First, we use word embeddings, which express words as low-dimension vectors of real-valued numbers. Once we have a numeric representation of words, we can calculate distances between words and group them according to their similarity. These techniques do seem to capture some fundamental aspects of the meaning of words. In an example from word2vec, a popular neural-network-based word embedding model, if the vector representation of “man” is subtracted from “king” and then “woman” is added, we get a vector very similar to “queen”! We use these word embeddings rather than the words themselves as the inputs into our model, and this allows us to classify words even if we’ve never seen them before in our training data.

Next we construct a neural network on top of these word embeddings. We use an architecture that considers the words of the transaction description in small, sequential groups, thus learning local connections between these words. In contrast to the bag-of-words model, this allows us to take account of the ordering of words. The nice thing about neural networks is that you don’t have to bother with time-consuming feature engineering. As layer is piled upon layer in a neural network, the model becomes capable of learning complex, non-linear combinations of the original inputs, crafting its own features automatically.

Our system has made some interesting predictions. It can account for ordering: “usa*golden valley” is classified as a vending machine, whereas “checkcard 1111 usa 23456 palmdale ca” is labeled as a gas station. It can classify merchants it’s never seen before: we correctly labeled “Johnston’s Saltbox” as a restaurant despite having never previously seen either of those words. Our system can deal more robustly with spelling errors and truncations: we classified “Pirates Cove Car Wa” as a Car Wash rather than some other car-related (or pirate-related) category.

Better, faster, more productive

These enhancements (and many others) account for our best-in-class accuracy and coverage. We’ve achieved these improvements despite the fact that there is huge variation over time in our underlying data as we add more and more users.

Because the data is constantly shifting as we add new banks and new customers with new types of users, the challenge of transaction categorization is always evolving and changing.

So we will too. We continue to make improvements, monitor performance, constantly retrain on a growing set of data, all the while experimenting with new model architectures.

At the same time, we’re extending machine learning methods to other data quality issues. We’re currently working on an engine for extracting and normalizing merchant name, address, and other bits of information from the transaction description. Check back here soon.