Serverless Web Scraping using Google Cloud
https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.gpstrategies.com%2Findustries-we-serve%2Ftechnology%2Ftechnical-performance-solutions%2F&psig=AOvVaw2BlvvfbyqBbhhXUkEt-UbT&ust=1633639930438000&source=images&cd=vfe&ved=0CAsQjRxqFwoTCODo25DVtvMCFQ

Serverless Web Scraping using Google Cloud

I recently completed a project for a client that appeared fairly straightforward on the surface, but proved to be a bit more complex given some of the constraints of the project. In short, there was a need to scrape conversational data (i.e. user forums) from a public website. The task required that the bot would fire on a schedule to scrape the data. Simply, every 15 minutes, attempt to retrieve the newest messages and threads from the site and write the scraped results to a storage bucket on GCS.

Originally, I figured that this would be a straight-forward stack:

  • Python (requests)
  • Apache Airflow (for the scheduler and logging)
  • Google Cloud

The wrinkle in this project proved to be centered on the scheduling. Apache Airflow requires 4GB of memory, and when costing out the compute required to run the stack, the annual bill to run this project would have exceeded the client's budget. While I do really like Airflow, I needed to find a (much) cheaper alternative.

Enter serverless architecture.

No alt text provided for this image

In the end, the answer was really straightforward, and frankly, enjoyable to deploy.

  1. I needed to make a small change to the crawler to behave as a function that could accept the POST request. This code was then added as a Cloud Function to the project.
  2. The Cloud Function is then triggered via an HTTP request. Cloud Scheduler is Google's managed Cron service, which made firing the Cloud Function every 15 minutes via HTTP painfully simple.
  3. When executed, the Cloud Function writes the parsed data to a Cloud Storage bucket as a JSON file.

That's it!

The reality is that for this task, Apache Airflow would have been overkill, as the engine would sit idle for that vast majority of the time, yet that compute time would rack up costs for the project. Moreover, Google's pricing tiers are very reasonable, as the services offer a free tiers that make smaller projects very inexpensive, if not free, to deploy. In the end, after factoring in very large assumptions about growth, the annual cost is expected to be less than $10 USD. With a monetization strategy in place, the ROI potential of this data pipeline is through the roof.

DataOps engineers should really consider reviewing the pros and cons of a serverless approach in future data pipeline projects. My approach above uses Google Cloud, but I am certain that the benefits would extend to any of the other major cloud providers with proper design. Oh, did I mention that you have logging built in and the projects code can be kept under source control?







provide the code man. Plxx

Like
Reply

Hey Brock, nice write up. It’s cool to read about how the financial benefits of the cloud aren’t just for big companies, anyone could do this!

To view or add a comment, sign in

More articles by Brock Tibert

Others also viewed

Explore content categories