BigqueryML and Logistic Regression
In this article will explore the Logistic Regression with SQL in BigQueryML, using the Heart dataset from Kaggle.
In BigQuery, we create the dataset as HeartDataset and then within this dataset we create the table HeartData. Will upload the heart.csv data in this table.
BigQuery Project ==> HeartDataset ==> HeartData
Explore the data with Datalab:
!pip install google-cloud-bigquery
Next we write the select on the HeartData table and store in pandas dataframe as:
query=""
SELECT
*
FROM
`HeartDataset.HeartData`
"""
from google.cloud import bigquery
df = bigquery.Client().query(query).to_dataframe()"
Data Visualization
Preparing the data for Training
Next we create 3 tables in BigQuery as traindata, evaluationdata and testdata from the files loaded in the google cloud storage.
We train Model1 with age as the feature and target. The evaluation tab shows the AUC and loss as:
Recommended by LinkedIn
Next model training, will add few more features as age,sex and cp. Keep adding the features as loss decreases and accuracy increases.
Final model with age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal as the feature and target, gives us AUC around 0.93.
Lets evaluate this model on the evaluation data with ML.EVALUATE as:
SELEC
roc_auc
FROM
ML.EVALUATE(MODEL `HeartDataset.model5`,
(
SELECT
age,
sex,
cp,
trestbps,
chol,
fbs,
restecg,
thalach,
exang,
oldpeak,
slope,
ca,
thal,
target
FROM
`HeartDataset.evaluationdata`))
AUC : 0.91
Next, predict on the test data:
SELECT predicted_target, predicted_target_probs, target actua
FROM
ML.PREDICT(MODEL`HeartDataset.model5`,
(
SELECT
age,
sex,
cp,
trestbps,
chol, fbs , restecg, thalach,
exang ,oldpeak, slope, ca, thal,
target
FROM `HeartDataset.testdata`))l
Confusion Matrix:
pd.crosstab(index=df['predicted_target'], columns=df['actual'])
How many predicted correctly out of the 61 rows in the test dataset? 49 got correctly classified.
SELECT COUNT(*)
FROM (
SELECT predicted_target, predicted_target_probs, target actual
FROM
ML.PREDICT(MODEL`HeartDataset.model5`,
(
SELECT
age,
sex,
cp,
trestbps,
chol, fbs , restecg, thalach,
exang ,oldpeak, slope, ca, thal,
target
FROM `HeartDataset.testdata`)))
WHERE
predicted_target = actual;