First Project Experience
The Business Use Case that i have taken into consideration:-
There is a revenue decline for a Portuguese bank and they would like to know what action to take. After investigation they found out the root cause is their clients are not depositing as frequently as before. So the Portuguese bank would like to identify existing clients that have higher chances to subscribe for a term deposit and focus marketing efforts on such clients.
Now looking into it what I have thought is how to convert this business case scenario into data science problem so that I can then inline my approach such as what kind of problem it would be i.e. linear or classification etc and then I come up with the following .
Data Science Problem Statement:-
Predict if the client will subscribe to a term deposit on the analysis of the market campaigns the bank performed.
So after reaching to the problem statement, I concluded that the prediction is about whether the customer will opt for term deposit or not which is a classification model is clearly. Now what I have to think of the evolution metric that I will use.
Evaluation Metric:-
Depending upon the problem statement and classification model I have opted for confusion metrics and roc-auc curve.
The risk of the business lies in the negative prediction so will try to minimize the False negative and False positive in the confusion metrics.
The roc-auc curve is between a number 0 and 1.More closer to 1 , better the model is and more closer to 0 , horrible the model is .
Data Dictionary:-
Data dictionary is a just a collection of column name, type of the data and the description about it so that we could understand the dataset and business better.
Whenever we fetch a data from SQL or any the other source the corresponding team is responsible for maintaining the data. In this case marketing team is responsible.
Libraries used:-
Libraries used are:-
Pandas
Numpy
Matplotlib
Seaborn
Sklearn
Data Cleaning and Exploration :-
Exploratory data analysis is an approach to analyzing datasets by summarizing their main characteristics with visualizations. It includes:-
· Load and prepare datasets
· Check Numeric and categorical Features
· Fill null values by imputing with mean/median in continuous features and mode for categorical features ,drop the column if it is >60%
· Check for class imbalance and if there is so then go for oversampling
· Detect outliers
· Univariate analysis of categorical and continuous columns
· Bivariate analysis –categorical columns basically distribution of dependent variable over independent variable
· Treating outliers
Graphs used for EDA:-
Univriate – for a categorical data we look for number of value per category so we can opt for bar graph or pie chart
For categorical data –we look for boxplot or histogram
Model building and insights :-
· Applying vanilla models on the data
· Function to Label Encode/One-Hot Encoding Categorical variables
· Fit vanilla classification models
· Logistic Regression
· Decision Tree Classifier
· Random Forest Classifier
· XGBClassifier
· Gradient Boosting Classifier
· Feature Selection-Recursive Feature Elimination(RFE) for each model except random forest because it has inbuilt .It is basically a wrapper method that uses the model to identify the best features. It matters a lot when you are dealing with large dataset such as having more than 1000 columns.
· Grid-Search & Hyperparameter Tuning –The job of the grid search is to take all the permutation and combination of the parameters that you passed into it and give the best result.
· Ensembling –It uses multiple machine learning models to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
· Prediction on the test data –After the above observation and plots it can be inferred that the best performing model was XGBoost giving an AUC_ROC score of 93.81%.While XGBoost is used a lot ,it is always prudent to start from simpler algorithms and then go to complex ones.
Congratulations!