Serverless Web Scraping using Google Cloud
I recently completed a project for a client that appeared fairly straightforward on the surface, but proved to be a bit more complex given some of the constraints of the project. In short, there was a need to scrape conversational data (i.e. user forums) from a public website. The task required that the bot would fire on a schedule to scrape the data. Simply, every 15 minutes, attempt to retrieve the newest messages and threads from the site and write the scraped results to a storage bucket on GCS.
Originally, I figured that this would be a straight-forward stack:
The wrinkle in this project proved to be centered on the scheduling. Apache Airflow requires 4GB of memory, and when costing out the compute required to run the stack, the annual bill to run this project would have exceeded the client's budget. While I do really like Airflow, I needed to find a (much) cheaper alternative.
Enter serverless architecture.
In the end, the answer was really straightforward, and frankly, enjoyable to deploy.
That's it!
Recommended by LinkedIn
The reality is that for this task, Apache Airflow would have been overkill, as the engine would sit idle for that vast majority of the time, yet that compute time would rack up costs for the project. Moreover, Google's pricing tiers are very reasonable, as the services offer a free tiers that make smaller projects very inexpensive, if not free, to deploy. In the end, after factoring in very large assumptions about growth, the annual cost is expected to be less than $10 USD. With a monetization strategy in place, the ROI potential of this data pipeline is through the roof.
DataOps engineers should really consider reviewing the pros and cons of a serverless approach in future data pipeline projects. My approach above uses Google Cloud, but I am certain that the benefits would extend to any of the other major cloud providers with proper design. Oh, did I mention that you have logging built in and the projects code can be kept under source control?
provide the code man. Plxx
Hey Brock, nice write up. It’s cool to read about how the financial benefits of the cloud aren’t just for big companies, anyone could do this!