Scaling and Scraping

Scaling and Scraping

In our last post, we discussed how to use Selenium and BeautifulSoup to scrape job skills from LinkedIn. That will work for maybe 20 jobs... maybe. That's baby stuff. How do we scrape 100 or even 1,000 jobs quickly and without bugs? How do we avoid memory and storage issues? How do we store the data in a MySQL database? My friend, Adit Jain, and I spent the last 2 weeks figuring out the answers to some of those questions. If you're interested in learning how to make a production-grade process... read on!

At this point, we have a web scraper class that ingests markup from the webdriver and spits out a JSON object containing the unique ID, Company, and Skills of a job. We could make the MySQL connector directly insert data into the respective fields in the database but that would eat into valuable compute time. As it is, it already takes a while to scrape LinkedIn without exceeding its rate limit and allowing it to load our target assets. Instead, we can use the publisher/subscriber design pattern to manage our scraper.

Before we do that, let's address the long term behavior problem of LinkedIn. After scraping around 300 jobs, LinkedIn introduces a button that must be clicked to load more jobs. Simply scrolling to the bottom of the page won't cut it anymore. Be sure to add a selenium command and an if statement that checks for the presence of a load button and then clicks it. Another problem is that the browser has to be open (not in a minimized state) for some of these things to work. You could use a headless browser and configure it properly to solve this problem if you want. Now back to the main content.

The goal is to have 2 processes running simultaneously. The Publisher will scrape data and "publish" it to a directory. The Subscriber will collect the data and insert it into the database. In this way, the two processes are independent, modularized, and scalable. We could develop a proxy server that has multiple scrapers and multiple connections to the database which will allow the server to collect data extremely quickly. It would probably need to make use of IP Addresses to work at scale. Another solution is to use cloud computing (Azure for instance) so LinkedIn can't block a range of IPs. That's a bit out of scope for this project. For now, all we need are 3 classes: a publisher manager, a subscriber manager, and a directory manager.

First, the Directory class. Using the os library, we can manage a folder (let's call it 'temp') to house our intermediary files. The initialization will create 'temp' if it doesn't already exist. We then need a method to count the number of files in the folder. Using shutil.rmtree() we can build a destroy method that will remove 'temp' and its contents if necessary. Lastly, we need a push method and a pull method. The push method will open a uniquely named file in 'temp', write a json.dump(), and close the file. The pop will do the opposite. The pop will scan 'temp' for JSON files. If it finds one, the pop method will open the file, perform a json.load(), close the file, and os.remove() it. The contents of the JSON file, a dictionary, is returned.

Normally, we would write something to start and join our 2 processes. Because I'm working in Jupyter Notebook, simply having 2 notebooks running simultaneously takes care of this; however, be wary that a "proper" implementation would require managing your processes and locking the directory manager to prevent multiple processes accessing the same resource at the same time. We don't need to worry about that for now.

Next, we have to write a Publisher management class. Its task is to start the scraper and manage how the scraper JSON results are put into 'temp' via the directory class. It sounds a whole lot more complicated than it is. Basically, there are 4 main inputs: the number of jobs to scrape, the batch size, the max queue size, and the query. The query is pretty self explanatory but very important. It's the GET query that we start scraping on. The number of jobs is the total number of jobs we want to scrape from our query. It looks like LinkedIn sets a hard cap at 999 from my own experimentation. The batch size represents the number of jobs that need to be scraped before throwing them into a file that contains them all. This can potentially cut down time by putting more load on memory. The advantage of this depends on what kind of computer/server/device you're using and it only really matters when scaling up. Finally, the max queue size represents how many files can exist in 'temp' at any one time. The way we have this hooked up, our scraper consumes minimal resources because it pauses scraping until there is available space on the device to push more data. Whether you're using this on a RaspberryPi or an industrial server, it's not a problem.

Finally, we need to tackle the Subscriber manager. Its task is to wait for files to come into 'temp', pop them out, and send the data to our MySQL server. The listener part is really easy. We can use the methods from the Directory class to pop and read files when they become available. The resulting dictionaries are then passed into a function that handles the MySQL connection with our database.

MySQL is the database server we chose to use for this project. MySQL is often used interchangeably with SQL, and it should be mentioned that these are two separate things. SQL is simply a query language that acts as a syntactical framework for communicating with relational databases. The data on our MySQL server is hosted on Adit Jain’s hard drive (temporarily). The operating system (OSX) takes care of managing the storage, while SQL is used as the communications protocol between the SQL server and our Subscriber. Using the mysql.connector library in python made it friendly to manage a database largely without having to touch any server-side tools, aside from configuring the physical database and database security via MySQL workbench. Essentially, our Subscriber will call SQL INSERTs on the JSON dictionary. The data is then stored in the database.

Voila! We have a pretty awesome scraper. As mentioned above, you can only squeeze out a maximum of 999 jobs from a single query. It looks like "Data Scientist in United States" has 15,967 results at the time I'm writing this. The only way to get at all of them is to segment queries by major city. I would Google the top 16 tech hubs and use those locations in my 16 queries to capture most of those results.

Now... what to do with the data??? Hmm... yeah... no, sorry... until next time. Stay tuned :D

Bill Hefley, Ph.D., COP, CMBE this is sort of a teaser for the next 1-2 parts. Let's see what we can do with all of this data!

To view or add a comment, sign in

Others also viewed

Explore content categories