A Simplistic Approach to Web Scraping

A Simplistic Approach to Web Scraping

Have you ever wondered of scrapping a website?

Yes, I did. And here is my attempt of scrapping data and fetch it down to a tabular format.

Let me share my experience, to help wider audience achieving that smoothly.


I thought of scrapping any apparel website but not wanted to go for amazon, or flipkart.

Then I thought let me scrape “Koovs.com”(just a wild thought).


Basic steps:

1.      URL/Website you want to scrape.

2.      Make drivers available in your system as per the browser you want to use. As I am using Chrome, the link to chrome driver is:

https://chromedriver.chromium.org/downloads

Make this driver available in folder, where your python code is kept.

3.      Keep “BeautifulSoup” package in your python system. One can create object of Beautifulsoup to use its functions like find, findall etc. to get content related to particular tag of HTML.

4.      Keep basic understanding of HTML, that will help you in extracting data, which is in form of tags embedded in website pages.

5.      Use exception handling while extracting parameters. So, in case that attribute is not available for a particular item, “AttributeError” gets handled by code itself.

6.      Build three functions:

a.      get_url => This will get you the url for the item you planning to extract info of.

For example => your search term is “Shoes”

So, your url should be: https://www.koovs.com/ Shoes

url pattern varies websites to websites. One needs to manually search it on website, to get the exact url of an item

b.      extract_record => This will help you extract info like Item name, item brand, price etc using related HTML tags. One needs to analyze unique way of segregating search records and then embed that in code to extract relative tags.


c.      main => This will be driver function which initiates get_url, extract_record. Main function will carry loop code to iterate over every page to extract item and item detail information.


Python Code:

import csv

from bs4 import BeautifulSoup

from selenium import webdriver


driver = webdriver.Chrome(executable_path='C:/Users/i505860/OneDrive - SAP SE/Documents/My Folder/Personnel/Extras/Extra_Dataset/Amazon_review_webscrape/chromedriver.exe')

url = 'https://www.koovs.com'

driver.get(url)

def get_url(search_text):

   template = "https://www.koovs.com/{}"

   return template.format(search_text)


def extract_record(item):

   """Extract and return data from a single record"""

   atag = item.a

   description = atag.text.strip()

   url = 'https://www.koovs.com' + atag.get('href')

   try:

       # product price

       Product_Info_parent = item.find('div', 'infoView')

       brandName = Product_Info_parent.find('span', 'product_title clip-text brandName').text

       productName = Product_Info_parent.find('span', 'product_title clip-text productName').text


       discountPrice = Product_Info_parent.find('span', 'discountPrice').text

   except AttributeError:

       discountPrice = ''

      result = (description, brandName, productName, discountPrice, url)

   print(result)

   return result


def main(search_term):

   """Run main program """

   # start the webdriver

  driver =

webdriver.Chrome(executable_path='C:/Users/My Folder/Personnel/Extras/Extra_Dataset/Koovs_webscrape/chromedriver.exe')

   

   records = []

   url = get_url(search_term)

   for page in range(1, 3):  #iterate over 3 pages.

       #print(page)

       driver.get(url.format(page))

       soup = BeautifulSoup(driver.page_source, 'html.parser')

       results = soup.find_all('li', {'imageView'})

       for item in results:

           record = extract_record(item)

           if record:

               records.append(record)

      driver.close()

   

   # save data to csv file

   with open('Koovs_Scrapped_results.csv', 'w', newline='', encoding='utf-8') as f:

       writer = csv.writer(f)

       writer.writerow(['description', 'brandName', 'productName', 'discountPrice', 'url'])

       writer.writerows(records)

      

       

var = input("enter search term ")       

main(var)


Output after scrapping:

No alt text provided for this image

Thanks “Izzy Analytics” (Youtube). 

Happy Learning!

To view or add a comment, sign in

More articles by Shalini .

Others also viewed

Explore content categories