Set Up the Development Environment for Scrapy
Install Python
Make sure you have Python installed. You can download it from the official website.
To verify the installation, run:
python --version
Create a Virtual Environment
Navigate to your project directory and create a virtual environment.
cd your_project_directory
python -m venv venv
Activate the virtual environment:
On Windows:
venv\Scripts\activate
On Unix or MacOS:
source venv/bin/activate
Install Scrapy
With the virtual environment activated, install Scrapy:
pip install scrapy
Verify Scrapy Installation
Run the Scrapy version command:
scrapy version
Set Up Scrapy Project
Create a new Scrapy project in your working directory:
scrapy startproject article_scraper
Navigate to your project folder:
cd article_scraper
Create a Spider
Generate a spider for scraping:
scrapy genspider article_spider example.com
Verify Directory Structure
Your directory structure should look like this:
article_scraper/
scrapy.cfg
article_scraper/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
article_spider.py
Your environment is now set up and you are ready to start scraping articles with Scrapy.
Practical Guide to Scraping Large Volumes of Articles Using Scrapy
Install Scrapy
pip install scrapy
Create a New Scrapy Project
scrapy startproject article_scraper
cd article_scraper
Define the Spider
Create a new spider in article_scraper/spiders/articles.py
.
import scrapy
class ArticlesSpider(scrapy.Spider):
name = 'articles'
start_urls = ['http://example.com/documentation']
def parse(self, response):
for article in response.css('div.article'):
yield {
'title': article.css('h2.title::text').get(),
'link': article.css('a::attr(href)').get(),
}
next_page = response.css('a.next-page::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Configure Item Pipeline
Enable and configure the item pipeline in article_scraper/settings.py
.
ITEM_PIPELINES = {
'article_scraper.pipelines.ArticleScraperPipeline': 300,
}
Create the pipeline class in article_scraper/pipelines.py
.
class ArticleScraperPipeline:
def process_item(self, item, spider):
return item # Customize as needed for data processing and saving
Run the Spider
Run the spider to start scraping articles.
scrapy crawl articles -o articles.json
This command will save the scraped data to articles.json
.
Conclusion
This practical implementation covers installing Scrapy, setting up a basic spider, configuring the item pipeline, and running the spider to scrape articles from a documentation website. Customize as needed for your specific scraping requirements.
Create a New Scrapy Project
scrapy startproject articlescraper
Directory Structure
articlescraper/
scrapy.cfg
articlescraper/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
Create a Spider
articlescraper/spiders/articles_spider.py:
import scrapy
class ArticlesSpider(scrapy.Spider):
name = "articles"
start_urls = [
'https://www.documentationwebsite.com/articles',
]
def parse(self, response):
for article in response.css('div.article'):
yield {
'title': article.css('h2.title::text').get(),
'author': article.css('span.author::text').get(),
'date': article.css('span.date::text').get(),
'content': article.css('div.content').get(),
}
next_page = response.css('a.next::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Update Settings
articlescraper/settings.py:
# Configure the user-agent to avoid being blocked
USER_AGENT = 'articlescraper (+http://www.yourdomain.com)'
# Configure maximum concurrent requests performed by Scrapy
CONCURRENT_REQUESTS = 16
# Configure a delay for requests for the same website
DOWNLOAD_DELAY = 1
# Enable and configure the AutoThrottle extension
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable and configure HTTP caching
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Run the Spider
cd articlescraper
scrapy crawl articles -o articles.json
Output
This will generate a file articles.json
containing the scraped articles:
[
{
"title": "First Article",
"author": "Author Name",
"date": "2023-01-01",
"content": "Content of the first article"
},
...
]
The implementation provided sets up a Scrapy project called articlescraper
, creates a spider to scrape articles, configures necessary settings, and demonstrates running the spider to collect and store data in JSON format.
# Scrapy settings for defining target URLs for scraping articles
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
# Spider to define target URLs for scraping articles
class DocumentationSpider(CrawlSpider):
name = 'documentation_spider'
# Start URL
start_urls = ['https://example-documentation-website.com']
# Rules for following links
rules = (
Rule(LinkExtractor(allow=('/articles/', )), callback='parse_article', follow=True),
)
# Method to parse each article page
def parse_article(self, response):
yield {
'title': response.css('h1::text').get(),
'url': response.url,
'content': response.css('div.article-content').get(),
}
# Configure and run the spider
process = CrawlerProcess(settings={
'FEEDS': {
'articles.json': {'format': 'json'},
},
})
process.crawl(DocumentationSpider)
process.start()
start_urls
list with the starting documentation website.LinkExtractor
within Rule
to design specific URL patterns (allow
) to identify article pages.parse_article
to extract and yield article data.CrawlerProcess
to start the crawling process and save the output to articles.json
.# my_spider.py
import scrapy
class ArticleSpider(scrapy.Spider):
name = "articles"
start_urls = [
# Add the initial URL(s) to start crawling from
'http://example.com/documentation'
]
def parse(self, response):
# Extract article links and follow them
for article in response.css('a.article-link::attr(href)').getall():
yield response.follow(article, self.parse_article)
# Follow pagination links if they exist
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_article(self, response):
# Extract the necessary information from each article
yield {
'title': response.css('h1::text').get(),
'content': ''.join(response.css('div.article-content ::text').getall())
}
# Save this file in the spiders/ directory of your Scrapy project
# To run the spider, use the following command:
# scrapy crawl articles -o articles.json
# The scraped data will be stored in articles.json
Run the spider using the command below, which will save the scraped data into a JSON file:
scrapy crawl articles -o articles.json
#6: Parse and Extract Article Data
This section will provide a concise implementation to parse and extract article data using Scrapy.
Step 1: Define the Item Structure
Create an items.py
file in your Scrapy project and define the fields for the article data you want to extract.
# items.py
import scrapy
class ArticleItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
publication_date = scrapy.Field()
content = scrapy.Field()
Step 2: Parse and Extract Data in the Spider
Update your spider to parse the article pages and extract the desired data.
# my_spider.py
import scrapy
from myproject.items import ArticleItem
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['http://example.com/'] # Specify the target URL
def parse(self, response):
# Follow links to article pages
for href in response.css('a.article-link::attr(href)'):
yield response.follow(href, self.parse_article)
# Follow pagination links
for href in response.css('a.next-page::attr(href)'):
yield response.follow(href, self.parse)
def parse_article(self, response):
item = ArticleItem()
item['title'] = response.css('h1.article-title::text').get()
item['author'] = response.css('span.author-name::text').get()
item['publication_date'] = response.css('time.pub-date::attr(datetime)').get()
item['content'] = response.css('div.article-content').getall()
yield item
Step 3: Pipeline for Storing Extracted Data
To store the extracted data, update the pipelines.py
file.
# pipelines.py
import json
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('articles.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
Step 4: Enable Pipelines and Configure Settings
Update the settings.py
file in your Scrapy project to enable the pipeline.
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.JsonWriterPipeline': 300,
}
Conclusion
This implementation allows you to parse and extract article data using Scrapy. The spider is configured to follow article links and pagination links, extract essential details, and store the data in a JSON file.
import scrapy
class ArticleSpider(scrapy.Spider):
name = "articles"
start_urls = [
'https://example.com/articles/page/1',
]
def parse(self, response):
# Extract articles on the page
for article in response.css('article'):
yield {
'title': article.css('h2::text').get(),
'author': article.css('span.author::text').get(),
'date': article.css('time::attr(datetime)').get(),
'content': article.css('div.content').get(),
}
# Find the link to the next page
next_page = response.css('a.next_page::attr(href)').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
spiders
directory in your Scrapy project.scrapy crawl articles
to run the spider.# spiders/articles_spider.py
import scrapy
import json
import csv
from scrapy.crawler import CrawlerProcess
class ArticlesSpider(scrapy.Spider):
name = "articles"
start_urls = [
'http://example.com/page1',
'http://example.com/page2',
# Add all the target URLs or handle pagination in the parse function.
]
def parse(self, response):
for article in response.css('div.article'):
yield {
'title': article.css('h2.title::text').get(),
'author': article.css('span.author::text').get(),
'publication_date': article.css('span.pub_date::text').get(),
'content': article.css('div.content::text').get(),
}
# Handle pagination
next_page = response.css('a.next::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
# pipelines.py
import json
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('articles.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class CsvWriterPipeline:
def open_spider(self, spider):
self.file = open('articles.csv', 'w', newline='')
self.csvwriter = csv.writer(self.file)
self.csvwriter.writerow(['title', 'author', 'publication_date', 'content'])
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.csvwriter.writerow([item['title'], item['author'], item['publication_date'], item['content']])
return item
# settings.py
# Enable Item Pipelines
ITEM_PIPELINES = {
'myproject.pipelines.JsonWriterPipeline': 300,
'myproject.pipelines.CsvWriterPipeline': 400,
}
# main.py
from scrapy.crawler import CrawlerProcess
from myproject.spiders.articles_spider import ArticlesSpider
process = CrawlerProcess({
'USER_AGENT': 'my-crawler (http://example.com)'
})
process.crawl(ArticlesSpider)
process.start()
A Practical Guide to Scraping Large Volumes of Articles from a Documentation Website Using Scrapy in Python
Part 9: Handle Errors and Exceptions
Implementing robust error and exception handling in your Scrapy project ensures that your spider is resilient and can handle unexpected situations gracefully. Below is the practical implementation to handle errors and exceptions in your existing Scrapy project.
Edit the Spider to Handle Exceptions
Open your my_spider.py
(replace my_spider
with your actual spider’s name) and modify it to include error handling.
import scrapy
from scrapy import Request
from logging import getLogger
class ArticleSpider(scrapy.Spider):
name = "articles"
start_urls = [
'http://example.com/start-url'
]
def __init__(self, *args, **kwargs):
super(ArticleSpider, self).__init__(*args, **kwargs)
self.logger = getLogger(self.name)
def parse(self, response):
try:
# Attempt to parse the necessary data
articles = response.css('.article')
for article in articles:
try:
title = article.css('h2::text').get().strip()
url = article.css('a::attr(href)').get()
yield {
'title': title,
'url': response.urljoin(url)
}
except AttributeError as e:
self.logger.error(f"Failed to parse article: {e}")
# Handle pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield Request(response.urljoin(next_page), callback=self.parse)
except Exception as e:
self.logger.error(f"Error parsing response: {e}")
def errback(self, failure):
self.logger.error(f"Request failed: {failure}")
Add Retry Middleware in settings.py
Modify settings.py
to include and configure Retry Middleware to handle temporary network issues.
# settings.py
# Enable Retry Middleware
RETRY_ENABLED = True
RETRY_TIMES = 3 # Number of times to retry failed requests
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429] # HTTP codes to retry
DOWNLOAD_TIMEOUT = 15 # Timeout for each request
Enable Logging in settings.py
Enhance logging to track down issues easily.
# settings.py
LOG_ENABLED = True
LOG_LEVEL = 'ERROR' # Change to 'DEBUG' for more detailed logs
LOG_FILE = 'scraping_errors.log' # Save logs to a file
Global Exception Handling in the Project
Add a custom middleware for global error handling. Create a new file called middlewares.py
and add the following:
# middlewares.py
from logging import getLogger
class HandleAllExceptionsMiddleware:
def __init__(self):
self.logger = getLogger('HandleAllExceptionsMiddleware')
def process_spider_input(self, response, spider):
return None
def process_spider_exception(self, response, exception, spider):
self.logger.error(f"Unhandled exception: {exception}")
return None
# Update settings.py to include the new middleware
# settings.py
SPIDER_MIDDLEWARES = {
'my_project.middlewares.HandleAllExceptionsMiddleware': 543,
}
This setup ensures that all exceptions are logged appropriately, allowing you to monitor and fix issues as they arise in your web scraping operations.
Run and Test the Scrapy Spider
Step 1: Command to Run the Spider
From your command line, navigate to your Scrapy project directory and execute:
$ scrapy crawl
Replace with the actual name of your spider.
Step 2: Sample Code for a Scrapy Spider (Assuming you have one defined)
Navigate to your spiders
directory and ensure you have a spider file (e.g., articles_spider.py
).
Here is a minimalistic example of how it might look:
import scrapy
class ArticlesSpider(scrapy.Spider):
name = 'articles'
start_urls = ['http://example.com/']
def parse(self, response):
for article in response.css('div.article'):
yield {
'title': article.css('h2::text').get(),
'content': article.css('div.content::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Step 3: Check Output in JSON or CSV (Based on Your Configuration)
If you want to store the output in a file, you can run:
$ scrapy crawl articles -o output.json
or
$ scrapy crawl articles -o output.csv
Step 4: Verify the Output
Open the output.json
or output.csv
file in any text editor or spreadsheet application to ensure the data is scraped correctly.
Step 5: Automated Testing (Optional)
Create a test script. In the same directory, create a file test_spider.py
:
import unittest
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class TestSpider(unittest.TestCase):
def setUp(self):
self.process = CrawlerProcess(get_project_settings())
def test_spider(self):
self.process.crawl('articles')
self.process.start()
# Here you can add checks to validate the output file
if __name__ == "__main__":
unittest.main()
Run the test using:
$ python test_spider.py
Conclusion
Following these steps, you will be able to run and test your Scrapy spider as part of your project to scrape large volumes of articles from a documentation website.