Aggregating Articles Using Scrapy in Python

by | Python

Set Up the Development Environment for Scrapy

Install Python

Make sure you have Python installed. You can download it from the official website.

To verify the installation, run:

python --version

Create a Virtual Environment

Navigate to your project directory and create a virtual environment.

cd your_project_directory
python -m venv venv

Activate the virtual environment:

On Windows:

venv\Scripts\activate

On Unix or MacOS:

source venv/bin/activate

Install Scrapy

With the virtual environment activated, install Scrapy:

pip install scrapy

Verify Scrapy Installation

Run the Scrapy version command:

scrapy version

Set Up Scrapy Project

Create a new Scrapy project in your working directory:

scrapy startproject article_scraper

Navigate to your project folder:

cd article_scraper

Create a Spider

Generate a spider for scraping:

scrapy genspider article_spider example.com

Verify Directory Structure

Your directory structure should look like this:

article_scraper/
    scrapy.cfg
    article_scraper/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            article_spider.py

Your environment is now set up and you are ready to start scraping articles with Scrapy.

Practical Guide to Scraping Large Volumes of Articles Using Scrapy

Install Scrapy

pip install scrapy

Create a New Scrapy Project

scrapy startproject article_scraper
cd article_scraper

Define the Spider

Create a new spider in article_scraper/spiders/articles.py.

import scrapy

class ArticlesSpider(scrapy.Spider):
    name = 'articles'
    start_urls = ['http://example.com/documentation']

    def parse(self, response):
        for article in response.css('div.article'):
            yield {
                'title': article.css('h2.title::text').get(),
                'link': article.css('a::attr(href)').get(),
            }

        next_page = response.css('a.next-page::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Configure Item Pipeline

Enable and configure the item pipeline in article_scraper/settings.py.

ITEM_PIPELINES = {
   'article_scraper.pipelines.ArticleScraperPipeline': 300,
}

Create the pipeline class in article_scraper/pipelines.py.

class ArticleScraperPipeline:
    def process_item(self, item, spider):
        return item  # Customize as needed for data processing and saving

Run the Spider

Run the spider to start scraping articles.

scrapy crawl articles -o articles.json

This command will save the scraped data to articles.json.

Conclusion

This practical implementation covers installing Scrapy, setting up a basic spider, configuring the item pipeline, and running the spider to scrape articles from a documentation website. Customize as needed for your specific scraping requirements.

Create a New Scrapy Project

scrapy startproject articlescraper

Directory Structure

articlescraper/
    scrapy.cfg
    articlescraper/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Create a Spider

articlescraper/spiders/articles_spider.py:

import scrapy

class ArticlesSpider(scrapy.Spider):
    name = "articles"
    start_urls = [
        'https://www.documentationwebsite.com/articles',
    ]

    def parse(self, response):
        for article in response.css('div.article'):
            yield {
                'title': article.css('h2.title::text').get(),
                'author': article.css('span.author::text').get(),
                'date': article.css('span.date::text').get(),
                'content': article.css('div.content').get(),
            }

        next_page = response.css('a.next::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Update Settings

articlescraper/settings.py:

# Configure the user-agent to avoid being blocked
USER_AGENT = 'articlescraper (+http://www.yourdomain.com)'

# Configure maximum concurrent requests performed by Scrapy
CONCURRENT_REQUESTS = 16

# Configure a delay for requests for the same website
DOWNLOAD_DELAY = 1

# Enable and configure the AutoThrottle extension
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable and configure HTTP caching
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Run the Spider

cd articlescraper
scrapy crawl articles -o articles.json

Output

This will generate a file articles.json containing the scraped articles:

[
    {
        "title": "First Article",
        "author": "Author Name",
        "date": "2023-01-01",
        "content": "
Content of the first article
" }, ... ]

The implementation provided sets up a Scrapy project called articlescraper, creates a spider to scrape articles, configures necessary settings, and demonstrates running the spider to collect and store data in JSON format.

# Scrapy settings for defining target URLs for scraping articles

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# Spider to define target URLs for scraping articles
class DocumentationSpider(CrawlSpider):
    name = 'documentation_spider'
    
    # Start URL
    start_urls = ['https://example-documentation-website.com']

    # Rules for following links
    rules = (
        Rule(LinkExtractor(allow=('/articles/', )), callback='parse_article', follow=True),
    )

    # Method to parse each article page
    def parse_article(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'url': response.url,
            'content': response.css('div.article-content').get(),
        }

# Configure and run the spider
process = CrawlerProcess(settings={
    'FEEDS': {
        'articles.json': {'format': 'json'},
    },
})

process.crawl(DocumentationSpider)
process.start()
Define a start_urls list with the starting documentation website.
Use LinkExtractor within Rule to design specific URL patterns (allow) to identify article pages.
Spider method parse_article to extract and yield article data.
Configure CrawlerProcess to start the crawling process and save the output to articles.json.
# my_spider.py

import scrapy

class ArticleSpider(scrapy.Spider):
    name = "articles"
    start_urls = [
        # Add the initial URL(s) to start crawling from
        'http://example.com/documentation'
    ]

    def parse(self, response):
        # Extract article links and follow them
        for article in response.css('a.article-link::attr(href)').getall():
            yield response.follow(article, self.parse_article)

        # Follow pagination links if they exist
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_article(self, response):
        # Extract the necessary information from each article
        yield {
            'title': response.css('h1::text').get(),
            'content': ''.join(response.css('div.article-content ::text').getall())
        }

# Save this file in the spiders/ directory of your Scrapy project

# To run the spider, use the following command:
# scrapy crawl articles -o articles.json

# The scraped data will be stored in articles.json

Run the spider using the command below, which will save the scraped data into a JSON file:

scrapy crawl articles -o articles.json

#6: Parse and Extract Article Data

This section will provide a concise implementation to parse and extract article data using Scrapy.

Step 1: Define the Item Structure

Create an items.py file in your Scrapy project and define the fields for the article data you want to extract.

# items.py
import scrapy

class ArticleItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    publication_date = scrapy.Field()
    content = scrapy.Field()

Step 2: Parse and Extract Data in the Spider

Update your spider to parse the article pages and extract the desired data.

# my_spider.py
import scrapy
from myproject.items import ArticleItem

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['http://example.com/']  # Specify the target URL

    def parse(self, response):
        # Follow links to article pages
        for href in response.css('a.article-link::attr(href)'):
            yield response.follow(href, self.parse_article)

        # Follow pagination links
        for href in response.css('a.next-page::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_article(self, response):
        item = ArticleItem()
        item['title'] = response.css('h1.article-title::text').get()
        item['author'] = response.css('span.author-name::text').get()
        item['publication_date'] = response.css('time.pub-date::attr(datetime)').get()
        item['content'] = response.css('div.article-content').getall()
        yield item

Step 3: Pipeline for Storing Extracted Data

To store the extracted data, update the pipelines.py file.

# pipelines.py
import json

class JsonWriterPipeline:

    def open_spider(self, spider):
        self.file = open('articles.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Step 4: Enable Pipelines and Configure Settings

Update the settings.py file in your Scrapy project to enable the pipeline.

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.JsonWriterPipeline': 300,
}

Conclusion

This implementation allows you to parse and extract article data using Scrapy. The spider is configured to follow article links and pagination links, extract essential details, and store the data in a JSON file.

import scrapy

class ArticleSpider(scrapy.Spider):
    name = "articles"
    start_urls = [
        'https://example.com/articles/page/1',
    ]

    def parse(self, response):
        # Extract articles on the page
        for article in response.css('article'):
            yield {
                'title': article.css('h2::text').get(),
                'author': article.css('span.author::text').get(),
                'date': article.css('time::attr(datetime)').get(),
                'content': article.css('div.content').get(),
            }
        
        # Find the link to the next page
        next_page = response.css('a.next_page::attr(href)').get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Add this Spider class to your spiders directory in your Scrapy project.
Use scrapy crawl articles to run the spider.
The spider starts from the first page and follows the “next” link until no “next” link is found.
Articles are extracted from each page and stored according to the defined fields.
# spiders/articles_spider.py

import scrapy
import json
import csv
from scrapy.crawler import CrawlerProcess

class ArticlesSpider(scrapy.Spider):
    name = "articles"
    start_urls = [
        'http://example.com/page1',
        'http://example.com/page2',
        # Add all the target URLs or handle pagination in the parse function.
    ]

    def parse(self, response):
        for article in response.css('div.article'):
            yield {
                'title': article.css('h2.title::text').get(),
                'author': article.css('span.author::text').get(),
                'publication_date': article.css('span.pub_date::text').get(),
                'content': article.css('div.content::text').get(),
            }

        # Handle pagination
        next_page = response.css('a.next::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

# pipelines.py

import json

class JsonWriterPipeline:

    def open_spider(self, spider):
        self.file = open('articles.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

class CsvWriterPipeline:

    def open_spider(self, spider):
        self.file = open('articles.csv', 'w', newline='')
        self.csvwriter = csv.writer(self.file)
        self.csvwriter.writerow(['title', 'author', 'publication_date', 'content'])

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        self.csvwriter.writerow([item['title'], item['author'], item['publication_date'], item['content']])
        return item

# settings.py

# Enable Item Pipelines
ITEM_PIPELINES = {
    'myproject.pipelines.JsonWriterPipeline': 300,
    'myproject.pipelines.CsvWriterPipeline': 400,
}

# main.py

from scrapy.crawler import CrawlerProcess
from myproject.spiders.articles_spider import ArticlesSpider

process = CrawlerProcess({
    'USER_AGENT': 'my-crawler (http://example.com)'
})

process.crawl(ArticlesSpider)
process.start()

A Practical Guide to Scraping Large Volumes of Articles from a Documentation Website Using Scrapy in Python

Part 9: Handle Errors and Exceptions

Implementing robust error and exception handling in your Scrapy project ensures that your spider is resilient and can handle unexpected situations gracefully. Below is the practical implementation to handle errors and exceptions in your existing Scrapy project.

Edit the Spider to Handle Exceptions

Open your my_spider.py (replace my_spider with your actual spider’s name) and modify it to include error handling.

import scrapy
from scrapy import Request
from logging import getLogger

class ArticleSpider(scrapy.Spider):
    name = "articles"
    start_urls = [
        'http://example.com/start-url'
    ]

    def __init__(self, *args, **kwargs):
        super(ArticleSpider, self).__init__(*args, **kwargs)
        self.logger = getLogger(self.name)

    def parse(self, response):
        try:
            # Attempt to parse the necessary data
            articles = response.css('.article')
            for article in articles:
                try:
                    title = article.css('h2::text').get().strip()
                    url = article.css('a::attr(href)').get()
                    yield {
                        'title': title,
                        'url': response.urljoin(url)
                    }
                except AttributeError as e:
                    self.logger.error(f"Failed to parse article: {e}")

            # Handle pagination
            next_page = response.css('a.next::attr(href)').get()
            if next_page:
                yield Request(response.urljoin(next_page), callback=self.parse)
        except Exception as e:
            self.logger.error(f"Error parsing response: {e}")

    def errback(self, failure):
        self.logger.error(f"Request failed: {failure}")

Add Retry Middleware in settings.py

Modify settings.py to include and configure Retry Middleware to handle temporary network issues.

# settings.py

# Enable Retry Middleware
RETRY_ENABLED = True
RETRY_TIMES = 3  # Number of times to retry failed requests
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]  # HTTP codes to retry
DOWNLOAD_TIMEOUT = 15  # Timeout for each request

Enable Logging in settings.py

Enhance logging to track down issues easily.

# settings.py

LOG_ENABLED = True
LOG_LEVEL = 'ERROR'  # Change to 'DEBUG' for more detailed logs
LOG_FILE = 'scraping_errors.log'  # Save logs to a file

Global Exception Handling in the Project

Add a custom middleware for global error handling. Create a new file called middlewares.py and add the following:

# middlewares.py

from logging import getLogger

class HandleAllExceptionsMiddleware:
    def __init__(self):
        self.logger = getLogger('HandleAllExceptionsMiddleware')

    def process_spider_input(self, response, spider):
        return None

    def process_spider_exception(self, response, exception, spider):
        self.logger.error(f"Unhandled exception: {exception}")
        return None

# Update settings.py to include the new middleware
# settings.py

SPIDER_MIDDLEWARES = {
    'my_project.middlewares.HandleAllExceptionsMiddleware': 543,
}

This setup ensures that all exceptions are logged appropriately, allowing you to monitor and fix issues as they arise in your web scraping operations.

Run and Test the Scrapy Spider

Step 1: Command to Run the Spider

From your command line, navigate to your Scrapy project directory and execute:

$ scrapy crawl 

Replace with the actual name of your spider.

Step 2: Sample Code for a Scrapy Spider (Assuming you have one defined)

Navigate to your spiders directory and ensure you have a spider file (e.g., articles_spider.py).
Here is a minimalistic example of how it might look:

import scrapy

class ArticlesSpider(scrapy.Spider):
    name = 'articles'
    start_urls = ['http://example.com/']

    def parse(self, response):
        for article in response.css('div.article'):
            yield {
                'title': article.css('h2::text').get(),
                'content': article.css('div.content::text').get(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Step 3: Check Output in JSON or CSV (Based on Your Configuration)

If you want to store the output in a file, you can run:

$ scrapy crawl articles -o output.json

or

$ scrapy crawl articles -o output.csv

Step 4: Verify the Output

Open the output.json or output.csv file in any text editor or spreadsheet application to ensure the data is scraped correctly.

Step 5: Automated Testing (Optional)

Create a test script. In the same directory, create a file test_spider.py:

import unittest
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class TestSpider(unittest.TestCase):

    def setUp(self):
        self.process = CrawlerProcess(get_project_settings())

    def test_spider(self):
        self.process.crawl('articles')
        self.process.start()
        # Here you can add checks to validate the output file

if __name__ == "__main__":
    unittest.main()

Run the test using:

$ python test_spider.py

Conclusion

Following these steps, you will be able to run and test your Scrapy spider as part of your project to scrape large volumes of articles from a documentation website.

Related Posts