BigqueryML and Logistic Regression

Ruma Sinha

Published Sep 4, 2021

In this article will explore the Logistic Regression with SQL in BigQueryML, using the Heart dataset from Kaggle.

In BigQuery, we create the dataset as HeartDataset and then within this dataset we create the table HeartData. Will upload the heart.csv data in this table.

BigQuery Project ==> HeartDataset ==> HeartData

Explore the data with Datalab:

!pip install google-cloud-bigquery

Next we write the select on the HeartData table and store in pandas dataframe as:

query=""

SELECT

 *

FROM

 `HeartDataset.HeartData`

"""

from google.cloud import bigquery

df = bigquery.Client().query(query).to_dataframe()"

Data Visualization

Preparing the data for Training

Next we create 3 tables in BigQuery as traindata, evaluationdata and testdata from the files loaded in the google cloud storage.

We train Model1 with age as the feature and target. The evaluation tab shows the AUC and loss as:

Recommended by LinkedIn

DBT: Incremental models with sharded tables on BigQuery

Bruno Granja Camarero 1 year ago

Handling Big Data with XGBoost and Azure Databricks:…

Chirag S. 2 years ago

Empowering Data Analysis with LangChain-OpenAI in a…

Jesus Lopez Martin 1 year ago

Next model training, will add few more features as age,sex and cp. Keep adding the features as loss decreases and accuracy increases.

Final model with age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal as the feature and target, gives us AUC around 0.93.

Lets evaluate this model on the evaluation data with ML.EVALUATE as:

SELEC
  roc_auc
FROM
  ML.EVALUATE(MODEL `HeartDataset.model5`,
    (
    SELECT
      age,
      sex,
      cp,
      trestbps,
      chol,
      fbs,
      restecg,
      thalach,
      exang,
      oldpeak,
      slope,
      ca,
      thal,
      target
    FROM
      `HeartDataset.evaluationdata`))

AUC : 0.91

Next, predict on the test data:

SELECT predicted_target, predicted_target_probs, target actua
FROM
  ML.PREDICT(MODEL`HeartDataset.model5`,
    (
      SELECT
         age,
        sex,
        cp,
        trestbps,
        chol,	fbs	, restecg,	thalach,
        exang	,oldpeak,	slope,	ca,	thal,
        target
       FROM `HeartDataset.testdata`))l

Confusion Matrix:

pd.crosstab(index=df['predicted_target'], columns=df['actual'])

How many predicted correctly out of the 61 rows in the test dataset? 49 got correctly classified.

SELECT COUNT(*)
FROM (
      SELECT predicted_target, predicted_target_probs, target actual
FROM
  ML.PREDICT(MODEL`HeartDataset.model5`,
    (
      SELECT
         age,
        sex,
        cp,
        trestbps,
        chol,	fbs	, restecg,	thalach,
        exang	,oldpeak,	slope,	ca,	thal,
        target
       FROM `HeartDataset.testdata`)))
WHERE
       predicted_target =  actual;

To view or add a comment, sign in

BigqueryML and Logistic Regression

Ruma Sinha

Recommended by LinkedIn

More articles by Ruma Sinha

Others also viewed

How Big Tech Uses SQL Aggregation to Answer Million-Dollar Questions

Taming the Terabyte Titan: Optimizing Even the "Optimized" Spark Job

This is what I really do as a Data Scientist

Data Science and Self-Growth: The Correlation.

Super flying ninja data scientists... or the down-to-earth way

The Myth of the 'Citizen Data Scientist'

A Data Science Framework: To Achieve 99% Accuracy

- Tidy* my Big Data -

Why I moved from ML to the whole Big Data Pipeline!

BIG DATA and I

Explore content categories

Recommended by LinkedIn

More articles by Ruma Sinha

BigQuery and DataLab

Loading data with csv file in Google Bigquery

Others also viewed

How Big Tech Uses SQL Aggregation to Answer Million-Dollar Questions

Taming the Terabyte Titan: Optimizing Even the "Optimized" Spark Job

This is what I really do as a Data Scientist

Data Science and Self-Growth: The Correlation.

Super flying ninja data scientists... or the down-to-earth way

The Myth of the 'Citizen Data Scientist'

A Data Science Framework: To Achieve 99% Accuracy

- Tidy* my Big Data -

Why I moved from ML to the whole Big Data Pipeline!

BIG DATA and I

Explore content categories