May 25, 2023
Why Plaid built the Transaction Enrichment Engine–and how it can help you innovate
Chris Jin and Hansen Qiu
From cash flow management and budgeting tools to rewards programs, there’s no shortage of use cases that rely on insights from an individual’s transaction history. As a consumer, one glance at a bank statement, though, might leave you wondering what some transactions are about due to confusing descriptions made up of jumbled letters and numbers. That pain point is equally felt by financial innovators that want to deliver personalized insights and experiences to their users, but have trouble turning the messy data into accurate, actionable insights.
To solve this problem for our customers, Plaid developed a multi-layered Transaction Enrichment Engine which powers our Transactions and Enrich products. Each day, it processes and enriches over 500 million new transactions for the world’s leading fintechs, enterprises, and financial institutions. With the volume of data processed and our deep experience and expertise working with enrichments, Plaid is better positioned than any other platform to tackle the challenges of understanding and enriching transactions. Plus, with the significant system improvements we've made over the past few quarters (which we’ll highlight below), we’re now able to provide an even wider range of best-in-class transaction insights.
Below, we will walk you through the high-level design of our Transaction Enrichment Engine. By gaining a comprehensive understanding of the transaction insights Plaid offers, you’ll be able to build the innovative products and solutions you need to deliver more value to end users.
Terminology & System Overview
Key terms to know before we dive into details of the Transaction Enrichment Engine
Token: A sequence of characters grouped together as a useful semantic unit for processing. A word could be split into multiple tokens.
Entity: A real-world object such as a merchant, person name, or location, identified and categorized within a body of text such as a transaction description.
Named Entity Recognition (NER): A method that detects and categorizes meaningful entities from a body of text.
Named Entity Linking (NEL): A method that links extracted entities with their corresponding records in a knowledge base.
The chart below shows the process through which our Transaction Enrichment Engine extracts various valuable signals and insights, such as merchant, location, and category, from a transaction. Throughout this blog post, we’ll walk you through an example transaction with the description: Debit Purchase -visa Card 1234wm Supercenter #morriltonar.
This description appears messy on the surface but actually contains a wealth of valuable information. We’ll break down our approach step by step, and demonstrate how to turn that raw data into actionable insights.
The first step to standardizing and cleaning messy transaction data
Since transactions are all organized in distinct formats with different levels of detail, a pre-processing layer is applied to normalize the description before further steps. Although applying common techniques such as lower-casing, whitespace stripping, and punctuation processing works to some extent, these traditional techniques may not be sufficient for more complicated cases, such as the Walmart example above, where tokens are fused together with random punctuation marks.
This can make it challenging for downstream entity recognition algorithms to properly extract important entities. To overcome this challenge, we use a subword-based tokenization technique that enables us to cleanse messy transaction descriptions. The chart below illustrates how our pipeline works on a high level:
Named Entity Recognition
Translating transaction descriptions into valuable information
Named Entity Recognition (NER) is a crucial component of the transaction enrichment process, where we extract valuable information from transaction descriptions. Our NER engine supports detecting various types of entities including location, counterparty, store number, and check number and strikes a balance between accuracy, efficiency, and interpretability. We applied diverse approaches to achieve the desired performance.
First, we developed several rule-based mechanisms and algorithms, which enable us to quickly and accurately extract entities from transaction descriptions.
Regex Rule Matching: Regex is a powerful tool when we need to match a transaction against a template, especially given that some financial institutions and payment processors tend to construct transaction descriptions with consistent formats, such as “[merchant name] [store id] [date] [location]”. This also enables us to quickly roll out hotfixes for any inaccuracies from the model.
Fuzzy String Matching: While regex rules are powerful for matching predefined patterns, they come with limited flexibility. For example, if a transaction description includes the truncated merchant entity 'wm supercen’' instead of the full pattern 'wm supercenter', a regex rule searching for the exact string would not return a match. Attempting to capture all of the string truncation, abbreviation, and misspelling cases within regex rule would inevitably lead to increasing rule complexity that makes the regex rules difficult to comprehend and manage. Therefore we developed a string matching algorithm with built-in fuzziness to capture the aforementioned case.
In our example, we can easily identify “wm supercenter” as the merchant name, “morrilton” as the city, and “ar“ as the state through applying our rule-based approach.
In addition to the rule-based approaches, we also host a Transformer-based language model trained on the transaction description data, which we have discussed in a previous blog post. The ML model complements the rule-based approach, which inherently has limited coverage and doesn’t cover unseen patterns in transaction data. By incorporating both approaches, our entity recognition engine strikes an optimal balance between accuracy and coverage.
Named Entity Linking
Unlocking comprehensive insights beyond transaction strings
NER’s potential is capped by the information provided within the input text. In transaction descriptions, we often find entities that are mis-spelled, aliased, abbreviated, or truncated. Furthermore, there’s a plethora of additional data related to a merchant, such as the merchant logo and website url, that would be useful to have but is not often found within the transaction data itself.
To address this challenge, we’ve created an internal Knowledge Base that hosts structured entity metadata entries, accompanied by a Named Entity Linking (NEL) search engine that connects parsed entities to their corresponding records in the Knowledge Base. This setup allows us to deliver comprehensive insights beyond just what is available from the transaction string itself.
Similar to the NER problem, we utilize a layered approach with multiple methods to strike a balance among performance, accuracy, and ease of use.
To normalize extracted entity names that don't perfectly match their corresponding Knowledge Base records, we employ regex rules and fuzzy string matching techniques before mapping the matched entity to its associated structured metadata. Here is the match we get for the Walmart transaction case:
Although we are capable of matching a substantial number of transactions made with popular merchants by the rule-based approach, we also aspire to provide comprehensive coverage for consumer purchases made at local convenience stores and restaurants. To offer best-in-class transaction insights, we needed to develop mechanisms to cover the long-tail.
However, the inherent messiness of long-tail merchant names has made it difficult to maintain high precision while expanding coverage. To overcome this hurdle, we've developed a candidate selection engine that ranks the records based on a range of attributes, including but not limited to the string similarity between the parsed and standardized merchant names, and the proximity between the geographical locations. Once all candidates are ranked, the top candidate would be returned as the final enrichment result.
With the effective named entity recognition and linking approaches, we were able to achieve over 90% coverage for merchants involved in the transaction while maintaining 99% precision.
Enrichment Insights - Categorization
Understanding how and where users are spending their money
After performing all the upstream tasks, we have developed a comprehensive understanding of the transaction by extracting all the relevant attributes and entities available. This allows us to derive advanced insights that tell a more interesting story about the transaction, such as the Personal Finance Category (PFC). In 2021, we updated our categorization taxonomy by distilling over 600 categories in our legacy category taxonomy into 16 primary and 104 detailed sub-categories. Accurate PFC predictions enable our customers to understand where users are spending their money. This information is crucial for conducting cash flow analysis, building budgeting tools, and performing other complex insights computations.
To determine the PFC, we make use of a multi-layered approach (again!). Since insights computation is the last step within the enrichment pipeline, it can make use of all the transaction enrichments generated along the way. Here are a couple quick examples of how we leverage computed enrichments:
Merchant records matched from our Knowledge Base come with a set category. For the Walmart example, we could easily classify it as GENERAL_MERCHANDISE_SUPERSTORES.
Heuristics help us make categorization decisions based on specific combinations of transaction attributes. For example, a small outflow amount and some weak token indicators would lead to a vending machine category prediction.
Finally, to achieve 100% coverage for transaction category prediction, we also developed an ML model to make PFC predictions of transactions that are not covered by our heuristics and rules. The model architecture is unsurprisingly similar to that of our NER models, with the main difference being attaching fully connected layers followed by a softmax layer to the base language model to solve the multi-classification problem. We employed an innovative data augmentation technique to generate a vast amount of high-quality training data that is required to ensure high accuracy and prevent overfitting. As a result, we achieved an overall PFC accuracy of over 90% with the recent improvements.
Building the future of finance
Providing sustainable high-quality enriched transactions that best serve our customers’ needs is a continuous effort, and we have an exciting roadmap for the rest of 2023 and beyond! We plan to launch highly-demanded product features such as:
Identifying the income sources counterparty, which will provide greater transparency and understanding of inflow transactions.
New confidence level fields for our key enrichment fields, such as PFC category and merchant name, to provide customers more control over how they would like to use the transaction insights.
More advanced ML-based searching techniques to further improve our ability to intelligently grow our Knowledge Base and improve the counterparty match rate.
Plaid’s Transactions and Enrich products are built on our bespoke and finely honed Transaction Enrichment Engine, enhancing external and internal sources of transactions data, respectively. Interested in seeing how we can help bring transactional insights to your unique use cases? Reach out to our team or connect with your Plaid Account Manager.
If you’re excited about joining the Plaid team to solve these questions and many others, check out plaid.com/careers.