Credit card application model algorithm: application score card


I have implemented this algorithm in my work place for credit card application score card. I am looking for expert to review this algorithm and give their feedback.

For credit card application, we are looking to predict score ( probablity of default). So that we can process credit card  applications. We define one year observation window and one year performance window. Defaults (Bad) are defined a 90 days deliquency. Each observation is defined as either good (0) or bad(1).

Risk profile is discussed with stakeholders while collecting input from different stakeholders. You can manipulate  Logistic regression run to include those variable, considered necessary by different stakeholders. There are two component of algorithm - Modelling of good/bad using Logistic regression, then include the reject interference to model using  Logistic regression. You need to take the 2 datasets - one mix of good and bad, second with rejected cases.

Part A- Modelling of good and bad
First this dataset is analyzed for missing values if missing values exist then missing value indicator is inserted. Now this dataset is divided in to two stratified sample in ratio of 4:1 train and valid datasets respectively.

We prepare the data for modelling now.  If there are missing value, imput the missing value by median using procedure stdize. We do the outlier treatment by percent capping using P1 (percentile 1) and P99 (percentile 99).

Next step is collapsing the levels of nominal variable if it has too many levels. We use chi square reduction method for collapsing levels as per Green Acre method. It hierarchically clusters the levels (that is, the rows of the two-way
contingency table) based on the reduction in the chi-square test of association between the categorical input variable and the target.

Next step is to deal with multi collinearity. For dealing with multi collieanrity, we create cluster of redundant variables based on second eign value. On the right is the covariance matrix for the principal components. The principal components are produced by an eigenvalue-decomposition of the correlation matrix. Along the diagonal of the covariance matrix are the eigenvalues, which are the variances of the principal components. The eigenvalues are standardized so that their sum is equal to the number of principal components, which is equal to the number of variables. The first principal component explains the largest proportion of the variability. Each principal component explains a decreasing amount of the total variability. Within a cluster variable has high correlations with each other, same point of time, it has low correlations with other cluster's variable. We chose one variable with minimum chi square ratio.

Now we deal with non linearity of variable with target status. We run correlation anlysis with target and create spearman and hoffding matrixs. We merge the two matrixs, the low rank of spearman and high value of hoffding indicates non linear relationship. We plot empirical logit plot of bins and variables versus elogit respectively.

Now we calculate the value of WOEs  using the formula WOEs= Ln(DisrGood/DitsrBad), while DistrGood is equals to Goods/TotalGoods and DistrBad is equal to Bad/TotalBads. IV is caculated using the formulla (DistrGood-DistrBad)*(Ln (DistrGood/DistrBad). We draw the WOEs versus catagory plot in Excel after importing the data in to Excel. We compare IV values using the following criterion
    Less than 0.02: unpredictive
       0.02 to 0.1: weak
        0.1 to 0.3: medium
             0.3 +: strong
         0.5  : Need to investigated

Now we run Logistic regression with four different methods subsets selection, forward selection, backward selection and stepwise selection.  We select the method with highest c statistics. It helps to finalize the variable in final
model. Now we list of final variables in the model. We have now 8 variables for the final model.

We do the rank ordering to see if it's satisfying with current model with development sample, validation sample and out of time sample.  Additionally we run the model on validation and out of time to see if we get same sign of different coefficients. If it's not satisfied then rebuild the model until rank ordering is satisfied. You need to ensure that maximum KS is occurring in third decile.

Validate the model using the validation dataset (mix of good and bad), correct the sample bias by using prievent option in Logistic Regression step.  Use ROC, KS statistics and Gini coefficient for judging the model performance. KS should be nearly .5. For good application score card, Gini coefficient should be above 0.45. Cut off is calculated using the ROC  curve using Youden's index.

Part B– Including Reject Inferences
Score the rejection datasets using the model prepared in step 1 as good (1) and bad (0), correct sample bias by either using weights or prievent option. If an observation’s probability is above cutoff then it’s considered bad and if it scores below cutoff then it’s considered as good. Mark the observation in rejection datasets as good and bad this way. Now take the dataset with rejected observations; append them to training datasets used for modeling in Part A.

Now build the model using the steps similar to Part A.Validate the model using the validation dataset and out of time  dataset used in Part A.

Part C- Defining the score range

Now we run Logistic run with selected variables and save the association table with C statistics, parameterestimates tables with Betas coefficient, ROC table with sensitivity and specificity information. We run procedure npar1way with EDF option with scored datasets and save the KolSmir2Stats table with KS value. We run the analysis with different methods and compare different ROC curves  for different model. We import these tables including validation table for further analysis and verification in MS excel.

Factor = PDO/Ln(2)    PDO- Points to double the odds – parameter given by the user
                           
Score = Offset + Factor * ln (odds)
Offset = Score - (Factor × ln(Odds))

Score- scoring value for which you want to receive specific odds of the loan repayment parameter  given by the user
Odds- odds of the loan repayment for specific scoring value - parameter given by the user
Factor- scaling parameter calculated on the basis of formula presented above

If a scorecard were being scaled where the user wanted odds of 10,000:1 at 800 points and wanted the odds to double every 20 points (i.e., pdo = 20), the factor and offset would be:
Factor = 20/Ln(2) = 28.85
Offset = 800- 28.85 * Ln (10,000) = 534.29

When WoE coding is selected for given characteristic, score for each bin (attribute) of such characteristic is calculated as:
Score = (Beta*WOE+ alpha/m)*Factor + Offset/m
Where:
Beta- logistic regression coefficient for characteristics that owns the given attribute
alpha - logistic regression intercept term
WoE - Weight of Evidence value for given attribute
m - number of characteristics included in the model
factor - scaling parameter based on formula presented previously
offset - scaling parameter based on formula presented previously

Neutral score is the calculated as:
 Neutral score = sum(scorei*distri) where i = 1 to k, k- number of bins
                scorei- scoring assigned to the ith bin
                distri- percentage distribution of the total cases in the ith bin

Gini Cofficient = abs ( 1- sum((Gx(i) + Gx(i-1))*(Bx(i)-Bx(i-1))) i = 1 to k
  where k number of categories of analyzed predictor
     Gx(i) cumulative distribution of “good” cases in the ith category
     Bx(i) cumulative distribution of “bad” cases in the ith category
     Gx(0)=Bx(0)=0
Gini Coefficient = 2.AUC - 1

Monitoring the score card:
System Stability/Population stability report between recent applicants and
expected (from development sample):
index = sum of ((% Actual -% Expected ) × ln ( % Actual / % Expected))
An index of less than 0.10 shows no significant change,0.10–0.25 denotes a small change that needs to be investigated, and an index greater than 0.25 points to a significant shift in the applicant population.

Characteristic Analysis Report:
“Expected %” and “Actual %” again refer to the distributions of the development and recent samples, respectively. The index here is calculated simply by:
sum of ((%Actual - % Expected) * Points))

Septy Aprilliandary

Risk Modeler, Retail Rating & Modeling Dept., Credit Portfolio Risk Group at PT Bank Mandiri (Persero) Tbk.

5y

Hi, there. I'd like to know your view on another way of calculating neutral score apart from sum of distribution and score. What if my neutral score is far from 0, even as high as the highest score in the parameter? Is it common to just give 0 for neutral score of all parameters in scorecard when the condition above occurs? Thank you, looking forward for your view on this :)

Like
Reply

Thanks it helped me a lot ☺️

Like
Reply

hi there, can you share the excel model?

Like
Reply

modelling steps followed is great to look at.. insightful approach..

Like
Reply

To view or add a comment, sign in

More articles by Nitin Kumar

Others also viewed

Explore content categories