Serverless Web Scraping using Google Cloud

Brock Tibert

Published Oct 6, 2021

I recently completed a project for a client that appeared fairly straightforward on the surface, but proved to be a bit more complex given some of the constraints of the project. In short, there was a need to scrape conversational data (i.e. user forums) from a public website. The task required that the bot would fire on a schedule to scrape the data. Simply, every 15 minutes, attempt to retrieve the newest messages and threads from the site and write the scraped results to a storage bucket on GCS.

Originally, I figured that this would be a straight-forward stack:

Python (requests)
Apache Airflow (for the scheduler and logging)
Google Cloud

The wrinkle in this project proved to be centered on the scheduling. Apache Airflow requires 4GB of memory, and when costing out the compute required to run the stack, the annual bill to run this project would have exceeded the client's budget. While I do really like Airflow, I needed to find a (much) cheaper alternative.

Enter serverless architecture.

In the end, the answer was really straightforward, and frankly, enjoyable to deploy.

I needed to make a small change to the crawler to behave as a function that could accept the POST request. This code was then added as a Cloud Function to the project.
The Cloud Function is then triggered via an HTTP request. Cloud Scheduler is Google's managed Cron service, which made firing the Cloud Function every 15 minutes via HTTP painfully simple.
When executed, the Cloud Function writes the parsed data to a Cloud Storage bucket as a JSON file.

That's it!

Recommended by LinkedIn

The Best of Cloud-Native: 10 Years of Lessons in…

Prathap Chowdry 5 months ago

Bootstrapping OpenSearch in a Local K8S Environment

Richard Pal 1 year ago

🔥Forget Microservices: How We 30x'd Our .NET API…

Muthukumar T. 11 months ago

The reality is that for this task, Apache Airflow would have been overkill, as the engine would sit idle for that vast majority of the time, yet that compute time would rack up costs for the project. Moreover, Google's pricing tiers are very reasonable, as the services offer a free tiers that make smaller projects very inexpensive, if not free, to deploy. In the end, after factoring in very large assumptions about growth, the annual cost is expected to be less than $10 USD. With a monetization strategy in place, the ROI potential of this data pipeline is through the roof.

DataOps engineers should really consider reviewing the pros and cons of a serverless approach in future data pipeline projects. My approach above uses Google Cloud, but I am certain that the benefits would extend to any of the other major cloud providers with proper design. Oh, did I mention that you have logging built in and the projects code can be kept under source control?

Waydson Vanderlei 3y

provide the code man. Plxx

Scott McCoy 4y

Hey Brock, nice write up. It’s cool to read about how the financial benefits of the cloud aren’t just for big companies, anyone could do this!

Serverless Web Scraping using Google Cloud

Brock Tibert

Recommended by LinkedIn

More articles by Brock Tibert

Others also viewed

How I Quickly Built a Hit Counter for My Portfolio Website Using Serverless Technology

Unleashing the Power of Pixels: Crafting a Serverless Telegram Text Extraction Bot with AWS

Building A Serverless Post Scheduling Backend

AWS Polly + Serverless Microservices: Text-to-Speech Web App

AWS Project: Build & Deploy AI Chatbot @Bedrock & Lambda

Serverless Speech-to-Text with AWS Transcribe and S3 Event Trigger using Lambda and CloudWatch.

RisingWave Newsletter March 2024

AWS MWAA — Testing AWS managed airflow with a free tier account

Exploring AWS Lambda Durable Functions: Revolutionizing Multi-Step Workflows

System Design for a platform like YouTube

Explore content categories

Recommended by LinkedIn

More articles by Brock Tibert

Convert College Scorecard Files to Tableau Hyper Format using pypeds

Introducing pypeds

Marketing Streams and Content Recommendations in Higher Ed

Project Announcement: Higher Ed School Code Crosswalk Database

Others also viewed

How I Quickly Built a Hit Counter for My Portfolio Website Using Serverless Technology

Unleashing the Power of Pixels: Crafting a Serverless Telegram Text Extraction Bot with AWS

Building A Serverless Post Scheduling Backend

AWS Polly + Serverless Microservices: Text-to-Speech Web App

AWS Project: Build & Deploy AI Chatbot @Bedrock & Lambda

Serverless Speech-to-Text with AWS Transcribe and S3 Event Trigger using Lambda and CloudWatch.

RisingWave Newsletter March 2024

AWS MWAA — Testing AWS managed airflow with a free tier account

Exploring AWS Lambda Durable Functions: Revolutionizing Multi-Step Workflows

System Design for a platform like YouTube

Explore content categories