BigQuery and DataLab
We can work directly with BigQuery from Jupyter notebook in DataLab. Can do the pip install as !pip install google-cloud-bigquery and we are good to go.
In the Jupyter notebook cell we can run the SQL query as:
%%bigquery --project $PROJECT
SELECT fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
FROM `<project>.winequality.winequality`
ORDER BY quality DESC
The project is the current GCP project we are working. The table winequality contains the Wine quality data from Kaggle. This table is within the dataset of the same name. Once the query completes the output is displayed similar to pandas dataframe.
We can run a parameterized query as:
PARAMS = {"quality_val": 4}
%%bigquery --project $PROJECT --params $PARAMS
SELECT
fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
FROM `<project>.winequality.winequality`
WHERE quality = @quality_val
LIMIT 10
During the query runtime the value 4 is passed to the parameter quality_val and the query result comprise of data where quality = 4
We can directly load the query result into pandas dataframe with " %%bigquery df"
%%bigquery df --project $PROJECT
SELECT
fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
FROM `<project>.winequality.winequality`
ORDER BY quality DESC
df.describe()
The describe function of the pandas dataframe return the mean,max,min etc for all the numerical columns.
The boxplot of the columns density and quality as
df.boxplot('density','quality')