September 27, 2023
How we use machine learning to power accurate, real-time income verification
Wen Yao, Jeet Nagda, Akshit Annadi, & Rohan Sriram
Verifying and understanding a potential borrower's income is a critical part of the lending process. However, income verification can be time consuming and error prone, especially when relying on traditional methods like pay stubs or tax returns.
To facilitate the process, Plaid Income brings together a verification suite of three products: Payroll Income, Document Income, and Bank Income. While the two former products can extract income information directly from payroll or pay stubs, the latter relies on machine learning to automatically identify a borrower's income from their bank transactions.
In this post, we'll dive into the machine learning (ML) models that power Bank Income’s verification process and how they work together to identify and categorize income sources in a matter of seconds.
How Bank Income works
Before we get into the technical solution behind Bank Income, let's have a quick look at the product flow:
Figure 1. Bank Income flow
The consumer permissions access to their bank accounts that receive income.
The consumer selects their income from potential sources detected by Plaid.
The consumer reviews and permissions the income information to be shared with a lender.
Plaid provides the lender with the selected income sources, along with metadata such as income stream category and frequency.
All metadata are detected as part of step 2, during which we leverage a suite of both supervised and unsupervised machine learning models to identify potential income sources for consumers to choose from. Let’s take a closer look below.
Technical solution overview
At the heart of Bank Income is a series of machine learning models that work together to extract income insights from a user's bank transactions. Here's how it works at a high level:
Figure 2. Family of ML models that powers Bank Income
Transaction extraction: Our first step is to extract transactions from the user's bank accounts by leveraging Plaid’s connectivity to over 12,000 financial institutions.
Filtering: We then filter out transactions that are not income, such as account transfers and reverse charges.
Clustering: Once we have the transaction data that may constitute income, we use a clustering algorithm to group similar transactions into source streams. Take, for example, a person who works for an insurance company and also drives for a ride-sharing platform. The algorithm would detect two streams for this person: one that would group transactions from the insurance company paychecks and another that would group transactions from the ride-sharing platform payouts. This helps us gain insights into the income recurring frequency.
Frequency detection: The ability to determine the frequency of income sources is crucial for lenders during the credit decisioning process. We use a hybrid approach to detect frequencies of source streams, based on bank transaction posted dates. The model combines frequency rules and a variable frequency detection (VFD) algorithm based on auto-regression, achieving an impressive frequency detection rate of 95% for salary streams and 97% for government-related income streams.
Income identification: Presenting all the source streams in the income selection pane can be overwhelming for users. It may also require scrolling down and individually selecting incomes, which can be cumbersome and time consuming. To enhance the user experience, we employ a binary classification model that predicts the probability of a source stream being income, for a more streamlined income selection experience.
Income categorization: To categorize income sources, we developed a classification model consisting of 13 income categories (see API response), such as SALARY, PENSION_AND_RETIREMENT, and GIG_ECONOMY. To overcome the challenge of noisy labels, we also developed an innovative Label Enhancement model that identifies and rectifies incorrect labels, enhancing the quality and reliability of our ground truth data.
Bank Income identification and categorization: A deep dive
The problem of classification
We initially approached income identification as a hierarchical classification problem, assuming that stream sources with certain income categories would be classified as income and selected by the user, while all other sources would likely be non-income and not selected by the user. The data, however, presented a different story.
As depicted in Figure 3, the majority of SALARY streams are selected as income, while a small fraction of sources are not. This often arises when spouses share a joint account for depositing their earnings, but only select their own salary to share with the lender—a scenario that can occur with other categories as well, though less typically.
Conversely, sources categorized as invalid income, such as TRANSFER and CASH_DEPOSIT, can still be selected by users as their income sources. This is often the case for small business owners who get paid via third-party applications or cash—income that should nonetheless be considered valid by lenders. Users might also select income categorized as invalid in an attempt to inflate their income for credit approval purposes.
Figure 3. Number of source streams (log scale) in each income category sampled from a historical time period
Drawing upon these observations, we structured the problem into two independent steps, first predicting if the user is likely to select a given income source to share and then categorizing the type of income source.
The modeling design
To tackle each step, we developed two models: an Income Identification Model (IIM) to classify a stream as valid income the user is likely to select and an Income Categorization Model (ICM) to categorize each income source into one of the 13 categories.
Let’s first take a look at the source stream data in Figure 4. Each user may have multiple source streams, each of which consists of one or more transactions (indicated by the `num_transactions` field). The two highlighted columns are the target variables we aim to predict.
Figure 4. Example source streams
With substantial similarities in training data preparation and featurization development, we designed the architecture as shown in Figure 5 for model training, to enhance reusability by leveraging shared logic and modules between two models. The featurization of transaction descriptions exemplifies this, as we utilize shared text embedding logic maintained in common abstractions. This approach optimizes efficiency, streamlines development, and ensures consistency and robustness throughout the model training pipelines.
Figure 5. Modeling design
Our source stream dataset comprises a mixture of numerical, categorical, time series, and text fields, each of which requires various featurization techniques. Below are the specific approaches for featurizing these different data types.
Transaction description embedding. For the transaction description field, we utilize classic embedding techniques that strike a fine balance between complexity and performance, and allows us to efficiently convert the textual data into numerical representations, as well as capture the semantic meaning of key tokens embedded within the transaction descriptions.
Categorical data featurization. We utilize encoding to transform categorical fields, such as frequency and financial institutions, into features. We also apply additional data processing to high cardinality features to avoid large dimensionality and prevent potential model overfitting.
Time series featurization. Each stream consists of a list of transactions with dates and amounts. We extract meaningful statistical measures from this time series data, and plan to include advanced temporal patterns in the future.
Source context featurization. This involves understanding the context of a source stream within the user-level data, such as the amount of the income compared to other streams. These features capture the relative importance of a specific source stream compared to other sources for a given user. To illustrate the underlying intuition, consider a source stream with recurring transactions from Zelle and a monthly amount of $1500. If this amount constitutes 90% of the total income for a user, it’s more likely to be classified as income compared to another user for whom the same amount represents only 5% of their total income.
Bank Income Identification Model
Before we had a machine learning model in place, we relied on simple heuristics to rank income sources using popular income categories and large amounts. Users were still required to scroll through the pane, examine each source, and make selection decisions individually.
Using the Income Identification model, we’ve designed a streamlined end-user experience that helps a user quickly identify and select what they want to share as income, improving experience and conversion.
We rely on user selection to flag income and consider this data as ground truth, but also recognize that it contains some level of noise. For instance, some users select every available source to rush the process, resulting in incorrect labels. These erroneous selections are filtered during data preprocessing. Users with less than three streams are also removed, since hiding sources to expedite the process is of little utility for these users.
Model performance evaluation
For performance evaluation, we developed an XGBoost binary classifier and used Average Precision (AP) and dollar amount weighted AP metrics. AP provides a comprehensive summary of the precision-recall (PR) curve by calculating the weighted mean of precisions achieved at each threshold. Notably, the model exhibits impressive proficiency in identifying high-income sources, aligning with our ideal expectations. This is evident from the substantial gap observed between the two PR curves in Figure 6 (left).
To pinpoint the areas where the model falls short in identifying user-selected income sources, we compared the PR curve for each income amount bucket in Figure 6 (right). Upon analysis, we observed that the model demonstrates satisfactory performance for sources exceeding 2.5k, but diminishes significantly as the amount decreases. This is likely due to the noisy nature of non-recurring sources with small amounts (e.g., from the gig economy or a Zelle payment), making it difficult to differentiate income from non-income.
Figure 6. PR curve (left) and weighted PR curves (right) for the Income Identification Model
Model in production
With the capability of predicting income probability, we’re now able to offer a streamlined experience for end users. We launched an A/B test to measure the impact of this experience in production, where it saved users 17% of selection time while keeping the total income shared consistent. In other words, it streamlined the user experience without compromising the number of streams or the total amounts selected.
Bank Income Categorization Model
Before creating this model, we initiated bank income categorization with regex rules, leveraging our in-house bank transaction domain knowledge. This approach yielded high precision as rules excel at detecting patterns commonly found in income transactions. However, the rules were limited in addressing the long-tail problem, where either no discernible pattern is evident or where creating custom rules for every corner case becomes impractical and time-consuming. This is where machine learning comes into play.
Label collection and enhancement
We use manually labeled income sources to train the income categorization model and monitor performance. The raw labels can often contain errors arising from random labeling, changes in labeling options, and variations in how labels are understood. However, high-quality labels and feedback are crucial for building accurate machine learning models. To address this issue, we developed an ML-based label enhancement process that can effectively enhance label quality. Through independent validation, we’ve seen an impressive 80% reduction in label errors.
Model performance evaluation
Similar to the identification model, we use both PR and weighted PR metrics to evaluate the effectiveness of the income categorization model, which is an XGBoost multi-class classifier. Through rigorous hyperparameter tuning and cross-validation, we were able to achieve an AP of 0.969 for SALARY and 0.739 for LONG_TERM_DISABILITY on test data (Figure 7). Notably, our model demonstrates better predictive capabilities for income sources with larger amounts, as reflected in the even higher weighted AP.
The model offers the flexibility of precision-recall tradeoff over the existing rule-based approach. When selecting a threshold cutoff to convert the model's probability scores into the actual category prediction, we leverage both curves for this decision making. As pointed by the arrows, we identified a sweet spot where the (weighted) precision plateaus when the (weighted) recall increases to a certain point. Following a similar analysis, we examined each category and determined the threshold cutoff accordingly, if appropriate. By strategically sacrificing a small amount of precision, we’ve achieved substantial gains in recall, resulting in a significant enhancement of the overall performance of our income categorization.
Figure 7. PR and weighted PR curves for two critical categories: SALARY and LONG_TERM_DISABILITY
Model in production
Since the model’s launch in production at the end of 2022, we’ve seen a significant 24% increase in salary recall. As a result, the categorization rate of selected income sources also goes up quite significantly, enabling our customers to efficiently verify user income based on selected source streams.
To mitigate potential issues such as data drift and parameter jumps that could result in model performance degradation, we’ve implemented a model retraining pipeline with hyperparameter tuning enabled. After each retraining job, we conduct an evaluation step to analyze the new model's performance and compare it with the previous version, which informs the decision of model promotion. With the growing labeled dataset, our model continues to learn and capture new signals over time.
Better loan underwriting
Plaid's Bank Income product is a game-changer for lenders looking to streamline the income verification process. Through advanced clustering and classification models that extract and categorize income data from a consumer's bank transactions, lenders can gain a quick and accurate view of a potential borrower's income sources. The result is a faster underwriting process and more informed decisions about creditworthiness.