Handling JSON Data in BigQuery: Extracting Data from JSON Arrays in String Format

Handling JSON Data in BigQuery: Extracting Data from JSON Arrays in String Format

Today, I’m excited to share a scenario where we’ll tackle JSON data in BigQuery stored as a JSON array in string format. This common situation requires us to handle and manipulate JSON not in its usual array format but as a string representation.

If you work with BigQuery datasets, you may encounter this type of data, and it's crucial to know how to extract useful information from it. In this article, we’ll walk through a step-by-step process for querying this data efficiently.


What Does a JSON ARRAY in STRING Format Look Like?

Let me introduce you to the sample JSON data we'll be working with today. Here's the data we’ll focus on:

{
  "Employee_ID": 101,
  "Name": "John Doe",
  "Position": "Software Engineer",
  "Department": "Engineering",
  "Salary": 85000,
  "Location": "New York",
  "Employment_Status": "Full-Time",
  "Projects": "E-commerce Platform Revamp",
  "Skills_Details": "[{\"skill\":\"Python\", \"level\":\"Expert\", \"certified\":\"Yes\", \"years_of_experience\":5}, {\"skill\":\"Java\", \"level\":\"Intermediate\", \"certified\":\"No\", \"years_of_experience\":3}, {\"skill\":\"SQL\", \"level\":\"Advanced\", \"certified\":\"Yes\", \"years_of_experience\":4}]",
  "Work_Hours": "9 AM - 5 PM",
  "Join_Date": "2021-03-15",
  "Gender": "Male"
}        

In this data, the "Skills_Details" field contains a JSON array, but it is stored as a string. This often happens when data is exported or loaded into systems that don't handle nested JSON properly, or simply due to formatting issues.


Lets look at the output table before we fix it :

Article content
As you can see, the 'Skills_Details' column has issues with parsing the data because it contains a JSON array stored as a string.


Let’s Tackle This Step by Step :

Step 1: Create the Table

First, we need to create the table in BigQuery where this data will reside. Since we’re working with JSON, we define all the fields as appropriate data types, but notice that the Skills_Details field is a string because, for now, it’s just a string representing a JSON array.

Here’s the SQL command to create the table :

CREATE TABLE `your_project_id.your_dataset.employees` ( 
  Employee_ID INT64,
  Name STRING,
  Position STRING,
  Department STRING,
  Salary INT64,
  Location STRING,
  Employment_Status STRING,
  Projects STRING,
  Skills_Details STRING, 
  Work_Hours STRING,
  Join_Date DATE,
  Gender STRING
);        


Step 2: Loading Data

Once the table is created, we’ll load the JSON data from a Google Cloud Storage (GCS) bucket into BigQuery. In this case, the JSON file is located in a GCS bucket. We can load it into our table using the following command:

LOAD DATA INTO `your_project_id.your_dataset.employees` 
FROM FILES (
  FORMAT = 'JSON',
  URIS = ["gs://bucket_using_python/emp.json"]
);        

Step 3: Querying the Data

In this step, we are trying to extract and query the data. Since the Skills_Details field contains a JSON array in string format, we need to :

  • Extract the JSON array using JSON_EXTRACT_ARRAY().
  • Unnest the array with UNNEST() to break it into individual rows.
  • Query the individual elements within the JSON, such as skill, level, certification, and years of experience.

Here’s how we can query the data:

WITH emp_data AS (
  SELECT Employee_ID, Name, Projects, 
  JSON_EXTRACT_ARRAY(Skills_Details) AS skill_details
  FROM `your_project_id.your_dataset.employees`
)
SELECT Employee_ID, Name, Projects,
  JSON_EXTRACT_SCALAR(skill, '$.skill') AS skill,
  JSON_EXTRACT_SCALAR(skill, '$.level') AS level,
  JSON_EXTRACT_SCALAR(skill, '$.certified') AS certified,
  CAST(JSON_EXTRACT_SCALAR(skill, '$.years_of_experience') AS INT64) AS years_of_experience
FROM emp_data,
UNNEST(skill_details) AS skill;        
1. JSON_EXTRACT_ARRAY(Skills_Details): This function extracts the JSON array from the Skills_Details string. It treats the string as a JSON object.
2. UNNEST(skill_details): This is used to flatten the array into individual rows, each representing a skill.
3. JSON_EXTRACT_SCALAR(): This function pulls out specific pieces of information from each JSON object, like skill, level, certification status, and years of experience. We also convert years_of_experience into an integer so we can work with it more easily and accurately.


Bonus Step: Creating a New Table from the Query Output

As a bonus step, once we’ve successfully extracted and transformed the data using the query above, it’s often useful to create a new table from the output. This new table can be used for further analysis, reporting, or even as an intermediate step in a larger data pipeline.

So, after running the query to extract and process the data, we can create a new table to store the results. Here’s the SQL to create the new table from the query output:

CREATE OR REPLACE TABLE `your_project_id.your_dataset.transformed_employees` AS
WITH emp_data AS (
  SELECT Employee_ID, Name, Projects, 
  JSON_EXTRACT_ARRAY(Skills_Details) AS skill_details
  FROM `your_project_id.your_dataset.employees`
)
SELECT Employee_ID, Name, Projects,
  JSON_EXTRACT_SCALAR(skill, '$.skill') AS skill,
  JSON_EXTRACT_SCALAR(skill, '$.level') AS level,
  JSON_EXTRACT_SCALAR(skill, '$.certified') AS certified,
  CAST(JSON_EXTRACT_SCALAR(skill, '$.years_of_experience') AS INT64) AS years_of_experience
FROM emp_data,
UNNEST(skill_details) AS skill;        

Conclusion :

By following this approach, we can easily handle and query JSON data stored as a string array in BigQuery. It's a powerful way to extract detailed, structured information from unstructured data. In real-world scenarios, especially when dealing with complex datasets, this kind of transformation and querying process is incredibly useful.

I hope this walkthrough helps you understand how to deal with such JSON scenarios. Stay tuned for more insights into working with BigQuery and other data processing tools!

To view or add a comment, sign in

More articles by Rahul Malviya

Explore content categories