Getting Data from the Web
Software engineers tend to hate web scraping. But the first rule of Data Science is: if you want to solve a data problem, you need to get the data – so ignore them and do it anyway. In this article I discuss one effective way to get data from the web.
Every project begins with a need, and those needs often arise extemporaneously. One day, a colleague from Marketing brought me his data problem. He had discovered a good set of information he needed on a website, but the records appeared across many pages and could not be easily downloaded.
The identification of the ‘Data Science’ role has helped bring focus to data problems by indicating clearly to colleagues whom they can approach for help with their information challenges. I find small problems like this to be great opportunities to learn how different roles in your company consume data and what they need. They are also good occasions to try out new tools you may be interested in. In this case, the problem called for a smart web scraper. There are several options but, ever since a colleague brought it to my attention, I had been looking to try the web parser from Kimono Labs [Note: the Kimono Labs team has been aqui-hired by Palantir and has discontinued their server product, but still offers a desktop version for download here. There are also alternatives such as ParseHub]. Kimono provides a plug-in to Chrome that captures data from your web browser and turns it into a JSON API or CSV download.
Three problems arose while gathering the data. The first was the need to aggregate the results across multiple pages. Sometimes you can just set the number of results displayed to be sufficiently high, but in this case we needed to handle pagination. Fortunately, after I identified the “next page” button to it, Kimono’s parser was able to cross pages properly.
The next issue was the presence of “master-detail” records, where the search results display an overview and you must click into each record to view its detail. Kimono handles this in the following way. First, you create one API for the master records. Then you create another API that collects the fields from the detail view. Finally, you connect the APIs together and use the output from the master API to drive the detail API. In this way, Kimono was able to deal with the challenge of extracting data from all the records.
Finally, I came to a problem that Kimono could not handle without some help. It works by extracting data from the DOM – the page data organization – and occasionally when you select a field onscreen that is actually composed of other subfields, it can return the wrong result. However, the parser allows you to manually specify, via a bit of CSS code, how to identify what you really want. A little web debugging and manual overriding fixed this problem. If this proves a stumbling block, it’s one of those problems that the web designer on your team can solve in just a few minutes – data problems tend to foster collaboration.
So, three small problems later, I had the archive of data in hand; I could hand it off to my grateful coworker and I had tested out a tool that would prove useful in other problems.
Software engineers tend to hate web scraping – try mentioning it as a possible solution in front of a group of them and see the horrified looks you will get. Scraping depends on the visual presentation of the page and the scraper’s crawl can break when a web designer comes through to refresh and re-organize the site’s presentation. Software engineers reflexively prefer a proper API, such as a REST-ful JSON API. If a proper API is available, you should use it – but if not, don’t stop there. Data Science lives in the real world, and the real world is messy. Data Scientists have learned some truths from experience:
- The browser is the dominant interface to the web. It is the only interface that is guaranteed to have the data you want; APIs often are missing some or all of what you need.
- Most data is not that clean, regardless of the source. Data from even a highly structured SQL database still usually contains all sorts of problems that require cleanup, despite the best efforts of front and back-end engineers to prevent this (users are tremendously clever at entering data in odd ways).
- Even “stable” APIs break regularly.
- Visual presentation of data changes less often than people think.
The first rule of Data Science is this: if you want to solve a data problem, you need to get the data. So as a Data Scientist or Data Engineer, you become skilled at getting the data you need through whatever means necessary. Sooner or later, solving a problem will require web scraping. But the good news is that web-scraping tools have improved greatly in sophistication. If the data you need is only available through a web page, don’t let that stop you – just scrape it and move on.
Note that scraping may violate a website’s terms of use, and this may restrict your ability to use the data. Or it may not – check to see if it has a policy, and if what you are going to use the data for actually violates the policy.
I'm a consulting Data Scientist. If you are grappling with Data Science or advanced technology needs and could use some help, let me know at drostcheck [at] leopardllc.com.
Nice read. Have you used any tools that run on Linux? I am looking for good recommendations.