Web Scraping Using Python

Web scraping is the process through which data is extracted using software from websites. Within Python, this is often executed with libraries such as BeautifulSoup, Scrapy, and Selenium. Below is a descriptive explanation of the process and its tools.

1. Why Use Web Scraping?

Web scraping is useful for:

Gathering data for analysis
Automating repetitive tasks such as price tracking
Collecting information for machine learning projects

Caution: Always check the website’s robots.txt file to make sure that scraping is allowed, and respect the site’s terms of service.

2. Tools for Web Scraping in Python

a. Requests

The requests library is used to send HTTP requests to a webpage and fetch its content.

import requests

url = "https://example.com"
response = requests.get(url)
print(response.text)

b. BeautifulSoup

The BeautifulSoup library is used to parse HTML or XML documents and extract information.

Installation

pip install beautifulsoup4

Basic Example

from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract specific data
title = soup.find('title').text
print("Page Title:", title)

Common Methods

find(): Extracts the first matching tag.
find_all(): Extracts all matching tags.
.text: Extracts the text within a tag.
.get('attribute'): Extracts a specific attribute (like href).

c. Scrapy

A powerful framework for advanced web scraping and crawling.

Installation

pip install scrapy

Basic Usage

Scrapy operates by defining a “spider” to scrape data.

scrapy startproject myproject
cd myproject
scrapy genspider example example.com

In the spider file:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com"]

    def parse(self, response):
        for item in response.css('div.item'):
            yield {
                'title': item.css('h2::text').get(),
                'link': item.css('a::attr(href)').get(),
            }

Run the spider:

scrapy crawl example

d. Selenium

Used for interacting with JavaScript-heavy websites.

Installation

pip install selenium

Setup with a Browser Driver

Download a browser driver like ChromeDriver.

Basic Example

from selenium import webdriver

driver = webdriver.Chrome(executable_path="path/to/chromedriver")
driver.get("https://example.com")

# Interact with the webpage
title = driver.title
print("Page Title:", title)

driver.quit()

3. Process of Web Scraping

a) Understand the Target Website

Identify the URL.
Inspect the website structure by using browser developer tools (Ctrl+Shift+I).

b) Fetch the Content

Use libraries like requests or Selenium to fetch the HTML of the webpage.

c) Parse and Extract Data

Parse the fetched HTML using BeautifulSoup or Scrapy to extract the required data.

d) Store the Data

Save the scraped data into a CSV file, database, or any preferred format.

Example: Save Data to CSV

import csv

data = [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}]
with open('data.csv', 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=["name", "age"])
    writer.writeheader()
    writer.writerows(data)

4. Best Practices

Respect Website Rules:
- Check robots.txt to see what is allowed.
- Avoid overloading the server with frequent requests.
User-Agent: Use a custom User-Agent header to mimic browser requests.

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)

3. Error Handling: Handle HTTP errors gracefully.

if response.status_code == 200:
    print("Success")
else:
    print("Failed:", response.status_code)

4. Rate Limiting: Pause between requests to avoid being blocked.

import time
time.sleep(2)

5. Example: Full Web Scraping Workflow

Here’s a complete example of scraping job listings:

from bs4 import BeautifulSoup
import requests
import csv

url = "https://example-job-listings.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

jobs = []
for job in soup.find_all('div', class_='job'):
    title = job.find('h2').text
    link = job.find('a')['href']
    jobs.append({'title': title, 'link': link})

# Save to CSV
with open('jobs.csv', 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=["title", "link"])
    writer.writeheader()
    writer.writerows(jobs)

print("Scraping complete! Data saved to jobs.csv")