BigqueryML and Logistic Regression
Photo by Steven Aguilar on Unsplash

BigqueryML and Logistic Regression

In this article will explore the Logistic Regression with SQL in BigQueryML, using the Heart dataset from Kaggle.

In BigQuery, we create the dataset as HeartDataset and then within this dataset we create the table HeartData. Will upload the heart.csv data in this table.

BigQuery Project ==> HeartDataset ==> HeartData

No alt text provided for this image

Explore the data with Datalab:

!pip install google-cloud-bigquery        

Next we write the select on the HeartData table and store in pandas dataframe as:

query=""

SELECT

 *

FROM

 `HeartDataset.HeartData`

"""

from google.cloud import bigquery

df = bigquery.Client().query(query).to_dataframe()"        
No alt text provided for this image

Data Visualization

No alt text provided for this image
No alt text provided for this image

Preparing the data for Training

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Next we create 3 tables in BigQuery as traindata, evaluationdata and testdata from the files loaded in the google cloud storage.

We train Model1 with age as the feature and target. The evaluation tab shows the AUC and loss as:

No alt text provided for this image

Next model training, will add few more features as age,sex and cp. Keep adding the features as loss decreases and accuracy increases.

Final model with age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal as the feature and target, gives us AUC around 0.93.

Lets evaluate this model on the evaluation data with ML.EVALUATE as:

SELEC
  roc_auc
FROM
  ML.EVALUATE(MODEL `HeartDataset.model5`,
    (
    SELECT
      age,
      sex,
      cp,
      trestbps,
      chol,
      fbs,
      restecg,
      thalach,
      exang,
      oldpeak,
      slope,
      ca,
      thal,
      target
    FROM
      `HeartDataset.evaluationdata`))        

AUC : 0.91

Next, predict on the test data:

SELECT predicted_target, predicted_target_probs, target actua
FROM
  ML.PREDICT(MODEL`HeartDataset.model5`,
    (
      SELECT
         age,
        sex,
        cp,
        trestbps,
        chol,	fbs	, restecg,	thalach,
        exang	,oldpeak,	slope,	ca,	thal,
        target
       FROM `HeartDataset.testdata`))l        

Confusion Matrix:

pd.crosstab(index=df['predicted_target'], columns=df['actual'])        
No alt text provided for this image

How many predicted correctly out of the 61 rows in the test dataset? 49 got correctly classified.

SELECT COUNT(*)
FROM (
      SELECT predicted_target, predicted_target_probs, target actual
FROM
  ML.PREDICT(MODEL`HeartDataset.model5`,
    (
      SELECT
         age,
        sex,
        cp,
        trestbps,
        chol,	fbs	, restecg,	thalach,
        exang	,oldpeak,	slope,	ca,	thal,
        target
       FROM `HeartDataset.testdata`)))
WHERE
       predicted_target =  actual;        





To view or add a comment, sign in

More articles by Ruma Sinha

  • BigQuery and DataLab

    We can work directly with BigQuery from Jupyter notebook in DataLab. Can do the pip install as !pip install…

  • Loading data with csv file in Google Bigquery

    In Google Cloud BigQuery we can load data from the csv file. We can create a dataset under the Google Project.

Others also viewed

Explore content categories