Credit Card Approval System using Machine Learning

Abhinav Saurabh
7 min readDec 20, 2020

In this project, we will try to make a Credit Card Approval System using Machine Learning via python.

The correct assessment for credit card approval is very important for banks and organisations who lend a credit card to the people. The recent years have seen a huge growth in credit cards and loans. The exact judgement of person to be approved for credit cards allows the organisations to minimize losses and the same time make suitable credit arrangements as per requirement. Due to the huge growth in the number of applicants, there is a need for a more sophisticated method to automate the process and speed it up.

Credit card approval can be beneficial for organisations that lend credit cards, and due to increase in a huge number of the applicant, there is need to automate the task and classify the applicants into if they are eligible for a credit card or not. This helps to avoid organisation losses by avoiding potential defaulters. Here we are not just looking into bank balance but into there personal attributes like gender, married, age, Occupation etc. We account for these personal attributes to evaluate if the given applicant is a good customer. This can also help cut down the weeks-long process into a few days. This gives benefit by cutting down costs on credit analysis and faster credit decisions.

Here we are using UCI Credit Approval Data Set to train the model.

1. Loading Dataset

import pandas as pddf = pd.read_csv(‘crx.data’,sep='\s+',header=None)
display(df)

Described the dataset respectively, i.e. Gender, Age, Debt, Married, Bank Customer, Education, Ethnicity, Years Employed, Prior Default, Employed, Credit Score, Driving License, Citizenship, Zip Code, Income, Approved. We add column name respectively.

headerRow = ['Gender', 'Age', 'Debt', 'Married', 'Bank Customer', 'Education', 'Ethnicity', 'Years Employed', 'Prior Default', 'Employed', 'Credit Score', 'Driving License', 'Citizenship', 'Zip Code', 'Income', 'Approved']df = pd.read_csv('crx.data', names = headerRow)display(df)

2. Knowing the data

# Print summary statistics
desc = df.describe()
print(desc)

This gives info about the numeric features in the dataset.

# Print DataFrame information
info = df.info()
print(info)

As we can see from above that Debt, Years Employed, Credit Score, Income are numeric attributes, while others are non-numeric by default.

# Inspect missing values in the dataset
print(df.tail(17))

We can see the null values are represented by ? in the dataset.

3. Handling the null Values

We will use NumPy to replace the null values by NaN then further replace it by the mean of the feature.

import numpy as np# Replace the '?'s with NaN
df = df.replace('?',np.nan)

Then we inspect the data frame.

# Inspect the missing values again
print(df.tail(17))

As we can see the? NaN has replaced values. Now we can fill the NaN values with mean values of that feature. Null values are important to be taken care of like it can affect the Machine Learning models drastically. Some of them may be not able to handle null values.

def handleMissingNumeric(df, colNames):
for col in colNames:
df[col] = pd.to_numeric(df[col], errors = 'coerce')
df[col] = df[col].fillna(df[col].mean())
def filterDf(df, colNames):
for cols in colNames:
d = {}
for i in df[cols]:
if i not in d:
d[i] = len(d)
df[cols] = df[cols].map(d)
handleMissingNumeric(df, ['Age', 'Debt', 'Years Employed', 'Credit Score', 'Zip Code', 'Income'])filterDf(df, ['Gender', 'Married', 'Bank Customer', 'Education', 'Ethnicity', 'Prior Default', 'Employed', 'Driving License', 'Citizenship', 'Approved'])

4. Filtering the data

We look into the data and try to find out which features matter most to the target variable, i.e. Approved. So here have a correlation graph for the same.

cMatrix = df.corr()
sns.heatmap(cMatrix, annot = False, cmap = 'coolwarm')

We can see from the above representation that Married, Bank Customer, Education, Citizenship, Zip Code doesn’t affect the target variable ‘Approved’.

So we are going to drop these features.

df = df.drop([ 'Married', 'Bank Customer', 'Education', 'Citizenship', 'Zip Code'], axis = 1)

We have retained the useful features that are going to help in the prediction of the target class, i.e. Approved.

5. Split data into Train-Test sets.

We split the data in training and testing sets. Here we have used 70:30 ratio.

x = ['Gender', 'Age', 'Debt', 'Ethnicity', 'Years Employed', 'Prior Default', 'Employed', 'Credit Score', 'Driving License', 'Income']
y = ['Approved']
xTrain, xTest, yTrain, yTest = train_test_split(df[x], df[y],test_size=0.30,random_state=2)

6. Applying Model-1 Decision Tree Classifier

We apply decision tree on the given dataset

list1=[]
for i in range(1, 16):

dtc = DecisionTreeClassifier(max_depth = i ,random_state = 0)
dtc.fit(xTrain, yTrain)trainPred = dtc.predict(xTrain)
trainAcc.append(score(trainPred, yTrain)*100)
testPred = dtc.predict(xTest)
testAcc.append(score(testPred, yTest)*100)

We have obtained an accuracy of training and testing according to the depth of the tree.

The following graph depicts the variation of testing and training accuracy according to depths.

The accuracy variation in table depicts this model doesn't solve the problem very well. The model appears to overfit. So we will further use a different model to improve the results.

Train Accuracy:89% , Test Accuracy: 84.6%

7. Applying Model-2 Logistic Regression Classifier

We further use Logistic Regression Classifier to improve the result.

clf = LogisticRegression(random_state = 0)clf.fit(xTrain, yTrain)
trainPred = clf.predict(xTrain)
testPred = clf.predict(xTest)

print('{}, {}, {}'.format(i, round(score(trainPred, yTrain), 4), round(score(testPred, yTest), 4)))

Train Accuracy:83.1%, Test Accuracy:85.52%

Here we obtain accuracies which are better than our previous model. We further use GridSearchCV to find the best parameters for this model.

8. Applying GridSearch on Logistic Regression

Here we apply Grid search on our previous model to get the best parameters to optimize the accuracies further and improve the performance of the model.

We create a list for each of the parameters and hunt down the best of the combination. Python has already made GridSearchCV for this purpose. We pass down the list and model to this function which further gives results.

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
tol = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]
max_iter = [50, 100, 150, 200, 250, 300]
param_grid = dict(tol=tol, max_iter=max_iter)
grid_model = GridSearchCV(estimator=LogisticRegression(), param_grid=param_grid, cv=5)scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(xTrain)
rescaledX_test = scaler.fit_transform(xTest)
rescaledX = scaler.fit_transform(xTrain)grid_model_result = grid_model.fit(xTrain, yTrain)best_score, best_params = (grid_model_result.best_score_, grid_model_result.best_params_)print("Best: %f using %s" % (best_score, grid_model_result.best_params_))

Best parameters obtained {‘max_iter’: 300, ‘tol’: 0.0001}

Train Accuracy:85.51% , Test Accuracy: 86.47% on these parameters.

We can see the accuracies are much better than the previous results.

9. Applying Model-3 Gradient Boost Classifier

We further make a better model using Gradient Boost Classifier. This model has significant improvement in the results.

from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=0)
clf.fit(xTrain, yTrain)
trainPred = clf.predict(xTrain)
testPred = clf.predict(xTest)
print('Train : {}, Test : {}'.format(round(score(trainPred, yTrain), 6), round(score(testPred, yTest), 6)))

Train Accuracy : 95.35%, Test Accuracy : 87.43%

10. Applying Model-4 Ada Boost Classifier

We apply Ada Boost Classifier by taking out the previous Model-2 as a base estimator. This is a meta-estimator technique.

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. (Scikit)

from sklearn.ensemble import AdaBoostClassifierclf = AdaBoostClassifier(base_estimator = LogisticRegression(tol = 0.0001, max_iter = 300),random_state=0)
clf.fit(xTrain, yTrain)
clf.predict(xTrain)trainPred = clf.predict(xTrain)
testPred = clf.predict(xTest)
print('Train : {}, Test : {}'.format(round(score(trainPred, yTrain), 6), round(score(testPred, yTest), 6)))

Train Accuracy : 85.92%, Test Accuracy : 87.92%

CONCLUSION

We tried several models to get maximum accuracy. We used the Decision Tree Classifier, which gives an accuracy of 84%.

Then we used Logistic Regression with optimal parameters and obtained an accuracy of 86.7%.

We also implemented Gradient Boosting Classifier to improve accuracy further and got 87.43% accuracy, which is better than the above models.

We further implemented AdaBoost Classifier where we used model-2 with grid search parameter as a base estimator which slightly improved the accuracy to 87.92%.

Link to the source-code:

https://github.com/abhinavsaurabh/Credit-Card-Approval-System

Blog authors:

  1. Abhinav Saurabh, MTech CSE, IIIT-Delhi (LinkedIn): Literature Survey, Coding, fine-tuning and Blog.
  2. Rahul Meena, MTech CSE, IIIT-Delhi (LinkedIn): Coding and Data Visualisation

Under the guidance of:

  1. Course Instructor: Dr Tanmoy Chakraborty(LinkedIn, IIITD profile)
  2. Teaching Fellow: Ms Ishita Bajaj
  3. Teaching Assistants: Shiv Kumar Gehlot, Pragya Srivastava, Chhavi Jain, Vivek Reddy, Shikha Singh and Nirav Diwan.

References:

  1. M. A. Sheikh, A. K. Goel and T. Kumar, “An Approach for Prediction of Loan Approval using Machine Learning Algorithm.”
  2. Zhang Lei-lei, HUI Xiao-Feng, WANG Lei, “Application of Adaptive Support Vector Machines Method in Credit Scoring”
  3. A. Gahlaut, Tushar and P. K. Singh, “Prediction analysis of risky credit using Data mining classification models.”
  4. Yu Li,“ Credit Risk Prediction Based on Machine Learning Methods”
  5. Lai Hui, Shuai Li, Zhou Zongfang, “The Model and Empirical Research of Application Scoring Based on Data Mining Methods”

GitHub Link

--

--