Analyzing Credit Balance using User Data with Python

Bombay Brown Boy
6 min readJun 11, 2021

--

Credit! A microeconomic equalizer and a double-edged sword in the hands of the wrong regulator. Well thankfully, we’re better than that and in this short article, we will use Multivariable Regression to analyze Credit data of 400 customers in order to predict credit defaults, user profiles, and ratings in the future. The data set is available at http://www-bcf.usc.edu/~gareth/ISL/index.html as a part of the book ‘An Introduction to Statistical Learning with Applications in R’.

Before we get to work on the dataset, let’s emphasize the ‘why’ of this case. We wish to

  1. Profile Users: who’s the borrower? what trend does his credit rating/balance follow? This can help us create better fiscal products to cater to consumer needs at the correct time.
  2. Spot the movement of money: As credit providers, we want to know factors that significantly affect the income of customers which will, in turn, affect their nature of transactions.
  3. Lastly, THE OTHER DISCOVERIES: After all, Christopher Columbus’ unsuccessful search for a western maritime route to India resulted in the discovery of the Americas in 1492. Analyzing Customer credit data can give you access to inconsistencies and underlying market trends that you can let your ‘product owner’ defend in terms of relevance to organizational objectives!

Diving into the data, we see (note that all monetary units are dollars and Income is in $10,000 units) :

No null values in the set so we do not need to perform Missing Value Treatment and we see some variables are categorical like “Gender”, “Student”, “Married”, and “Ethnicity”. You can describe the data in order to further assess counts, mean, std deviation, etc. On getting the distribution plot of the “Balance” variable which is our KEY CONCERN we see:

The probability distribution indicates maximum users with a zero credit balance

we can also look at the skewness of the Credit Balance Column, which shows the asymmetry in the distribution of that variable. We can do this for both credit balance and active credit balance (i.e. credit balance > 0)

In the next step, we’ll take a look at how the different integer and float variables are correlated to each other by plotting the below matrix. The darker boxes indicate a higher degree of correlation and vice versa. :

As you can tell, there is a heavy correlation between — balance & limit, balance & rating, balance & income, limit & rating, limit & income, and rating & income. These are relations we’d like to understand deeper by calculating their correlation coefficient and p-value. In a desirable scenario, we want a high correlation coefficient (closer to 1) and p-value less than 0.05 for statistical significance. We get:

Limit and Rating have a high correlation coefficient. This means multicollinearity and suggests that the credit limit is calculated using the credit ratings.

To resolve the issue of collinearity, we can drop the limit or rating value. Let’s drop Limit values. But before that, let’s see the distribution plots for categorical variables.

Ordinary Credit Balance Distributions
Active Credit Balance Distribution

From the above graphs, we can infer that Gender, Ethnicity, and Marital Status don’t have a significant effect Credit/Active credit Balance of customers. But being a student makes a notable difference so let us further explore this dynamic using a boxplot

The median credit is considerably higher

Now we have a considerable idea of relationships within our dataset. Let us drop the Limit column and perform label encoding on the categorical variables. This has now prepared our data for Multiple Linear regression models. Below, we can see 2 MLR (credit & active credit) models with Balance as the target point Y and other variables as X1 and X2.

Now let's look at the p-values for each of the variables represented by X1 and X2 using statsmodel for each predictor.

income, rating, age, and student have a p-value<0.05 i.e. significance

Now that we’re down to 4 variables — income, rating, age, and student, let us identify correlation coefficients and intercepts for each of these cases.

MLR Models for significant predictors

on calculating the Coefficient of determination R², we notice that it decreases in cases we use the significant variables. Meaning, for the given event, the effect of all predictors on credit balance is more significant than that of the significant predictors alone.

We need to analyze all 4 variables at once to assess their effect on the credit balance of the consumer. To summarize, we plot below graphs

Our findings of the study show us that Credit Balance is affected by multiple variables that display the behavior discussed below:

  1. Income is significantly lower for students than non-students in the data set. However, being a student doesn’t affect credit ratings drastically with the 2 distributions largely overlapping.
  2. Income/ratings and Age don't seem to have a high positive correlation, meaning income and credit rating will not necessarily increase with the age of customers.
  3. Income and Credit ratings are positively correlated, especially at the denser region of income <$750,000 and Ratings <550. This high-volume segment can be targeted for maximum sales of debt service schemes using pre-approved loans above the credit limit of cardholders.
  4. Lastly, note that, unlike ordinary intuition, the age of students and non-students in the dataset isn’t differing greatly. This is a sign that the group of customers being analyzed contains a unique bracket of older individuals who are also students. This can be an interesting prospect that invites tailored student credit schemes that address key issues of health, economic prospects, and lifestyle.
physical extensions of human trust i.e. credit cards

As credit card providers advance towards ‘swipe as you pay’, ‘cashback’, ‘referral points’, and other reward schemes, the key practices in lending must be governed by the background of customers and their needs. If a large portion of your credit card holders is students aged 40-45, with lower incomes and credit ratings, their needs will revolve around paying tuition, cost of room and board, health insurance, etc. This knowledge must drive decisions on interest rates, billing cycles, and even further research into why these older individuals are enrolled as students and require access to credit. Their balances at any point in the future can be estimated using their Incomes, Age, Student status, and Credit Ratings

--

--

Bombay Brown Boy

Sr. Data Scientist @ Target | Duke MEng Artificial Intelligence