Python 'Beautiful Soup 4'... Yummy
Beautiful Soup Module
This is the second article of a Web Scraping series. If you haven't read the first article which can be found here, just go and read it from here.. you'll have fun (I guess ?!!). This article will be about Beautiful soup 4, an easier, simpler and more awesome module than requests. If you read the first article, you will find that the requests module has a huge disadvantage which is that it treats the parsed HTML page as a string which is a very tedious thing to do and can be overcome using our today's module Beautiful Soup 4.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations. You should know that Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects.
Parse HTML file
Parse is a fancy word for “read”. So, here we are going to discuss how to read an HTML. To read HTML, we have to use module called “request” and you must have also an internet access. In the following example, we are going to read the following URL http://www.behindthename.com/names
We are going to use this website because it's crawl-able according to its robots.txt file. We are going to talk about robots.txt files and web scraping terminology in another article. But for now, let's see how we are going to crawl this website:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> page = requests.get("http://www.behindthename.com/names")
>>> soup = BeautifulSoup(page.content, 'html.parser')
Now, soup is and object of the HTML page that we have parsed. Some people suggest using "lxml" instead of "html.parser" because it's a bit faster, so it's worth mentioning.
Now, you can use Beautiful Soup amazing functions to navigate the HTML file:
>>> soup.prettify()
#Prints the whole HTML page with indentation and stuff.
>>> soup.title
#Returns the title object of the html page
<title>Behind the Name: All Names</title>
>>> soup.title.string
#Prints the whole HTML page with indentation and stuff.
Behind the Name: All Names
>>> soup.title.parent.name
# Returns the “name” of parent object of the “title” which is;
head
We should also learn about the tags that exists inside any HTML, and how to extract it:
>>> soup.head
# Returns the head tag as an object.
>>> type(soup.head)
<class 'bs4.element.Tag'>
>>> soup.body
# It’s basically the whole HTML page.
>>> soup.title
# It’s the title of the HTML page as we have said before.
>>> soup.h1
# This is the h1 headers in the file.
>>> soup.p
# This is the paragraphs <p> tag inside the file.
>>> soup.div
# This is basically a tag for any block inside the HTML file.
>>> soup.img
# This tag represents images inside the HTML file.
>>> soup.a
#This refers to any clickable link inside the file.
Extracting links
You can extract the links inside your HTML file easily using few functions. Let’s see these how:
>>> soup.a
#Returns the first link tag <a> in the html file.
>>> soup.find_all("a")
#Returns all the link tags in the html file in a list form.
>>> soup.a.get('href')
#Returns the URL of the extracted link.
We can extract all the URLs found within a page using just two lines which are:
>>> for link in soup.find_all("a"):
... print link.get("href")
Extracting using “id” or “class”
But for each page, there are a ton of <a> tags, <p> tags and <div> tags. How are we going to distinguish between them? And how are we going to get the only one that we are interested about. This is easy, we can do that using either the "id" attribute for these tags or "class" attribute.
Here, we are going to extract all the <div> blocks that have “class” attribute equal to “info”
>>> names = soup.find_all('div', class_= 'info')
#OR
>>> names = soup.find_all('div', {"class" : 'info'})
I, myself, prefer the second context because it's more general as you can specify any other attribute. So, if you want to extract a <div> block that has an attribute called "role" which is equal to "banner", we can do that like so:
>>> names = soup.find_all('div', {"role" : 'banner'})
And, of course, the same goes with “id” attribute.
Extracting Table
When you want to extract a <table> from a website page, things are a little different. Let’s extract the table in the following link which looks like this:
Let’s see how we can do that using the following snippet:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = " https://www.behindthename.com/name/aabraham/related"
>>> page = requests.get(url)
>>> page.encoding = 'utf-8'
>>> soup = BeautifulSoup(page.content, 'lxml')
>>> body = soup.find('div', class_= 'body')
>>> related = []
>>> table = body.find('table')
>>> if table == None:
... pass
>>> else:
... #get the table rows
... table_rows = table.find_all("tr")
>>> for tr in table_rows:
... #get the table columns
... td = tr.find_all('td')
... for i in td:
... #get the text inside each column
... related.append(i.text)
>>> #print the whole list
>>> print related
Now, I think you have enough knowledge to start scraping!!! In my opinion, web scraping can be pretty easy with enough practice. I recommend getting your hand dirty by scraping these two websites because:
- There are open to scraping (according to their robots.txt files).
- They have very useful data that can be used for any reason.
- I've scrapped them myself, which can give you a certain degree of guide-lining.
These two websites are:
- Behind The Name which contains about 21,193 names according to the last update which has been done in 31 May 2018. My Github repo that contain my scrapper and the extracted data can be found here. I have made this scrapper about two years ago, so excuse my modest attempt.
- Project Gutenberg which contains about 56,920 books according to the last update in April 2018. This is my Github repo that contains my multi-threading scrapper which scrapped the whole website in about 5 hours.