Use Raspberry PI as your personal web crawler with Python and Scrapy
Last Updated on 24th March 2023 by peppe8o
A web crawler (also known as spider or spiderbot) is an internet bot that continually browses web pages, typically for web indexing purposes.
Typically Search Engines use web crawling ito scan the web and be aware of contents, links and websites relations. These data are processed to understand what results better fit users queries.
Crawlers consume resources on visited systems. For this reason, mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent via file named robot.txt under their root url.
Crawers are also used by some websites update their web content and stay aligned with target sources.
In this article I’ll show you how to create and configure a simple spiderbot (which crawls peppe8o.com home page posts) with a tiny computer like Raspberry PI. I’ll use a Raspberry PI Zero W, but this applies also to newer Raspberry PI boards.
What Is Scrapy
From Scrapy official website:
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
Ref. https://docs.scrapy.org/en/latest/intro/overview.html
What We Need
As usual, I suggest adding from now to your favourite e-commerce shopping cart all the needed hardware, so that at the end you will be able to evaluate overall costs and decide if to continue with the project or remove them from the shopping cart. So, hardware will be only:
- Raspberry PI Zero W (including proper power supply or using a smartphone micro usb charger with at least 3A) or newer Raspberry PI Board
- high speed micro SD card (at least 16 GB, at least class 10)
Check hardware prices with following links:
Step-By-Step Procedure
This procedure will drive you in installing Raspberry PI OS Lite, installing required packages and finally setup Scrapy via Python PIP.
Install OS
Please start from OS installation following Install Raspberry PI OS Lite guide.
At the end, please be sure that your system is up to date:
sudo apt update sudo apt upgrade
Install Scrapy
Raspberry PI OS Lite base OS already includes Python 3, so you don’t need any specific setup to have Python working.
Instead, we need to install some required packages. From terminal:
sudo apt install python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
As you can see, we have also python3-pip package in previous command, so we ca now install scrapy with pip. Please note that, differently from standard Python installation, you need to use “sudo” to make scrapy command available directly from terminal.
sudo pip3 install scrapy
During scrapy installation, some dependencies version errors may aoccurr. At the date of this article, an error comes with cryptograpy version:
pyopenssl 19.1.0 has requirement cryptography>=2.8, but you'll have cryptography 2.6.1 which is incompatible.
To solve this kind of errors, simply install required version of package notified. For example, I will fix my error with the following command:
sudo pip3 install cryptography==2.8
Finally check that your scrapy installation is ok:
pi@raspberrypi:~ $ scrapy version Scrapy 2.1.0
Create Your First Spiderbot
Scrapy is not complicated, but requires a bit of study on scrapy tutorial pages. It can be run from and interactive shell (with the command “scrapy shell http://url…”). This method is usually the best way to identify tags inside pages you want to extract.
Another good practice, is visiting the url you want to crawl with dev tools (for example in Chrome -> Options -> Tools -> Developer Tools). Then, analyze web page elements and identify the ones you want to extract.
Once identified your targets, you can build a simple crawler to run manually or prepare a complete crawling project. Let’s try an example of a single standalone crawler, manually launched, which extracts posts titles, Summary and date from peppe8o.com home page.
You can both download in your Raspberry PI my script “myspider.py” from my download area:
wget https://peppe8o.com/download/python/myspider.py
Or manually create the spider configuration file:
nano myspider.py
Insert following code:
import scrapy class QuotesSpider(scrapy.Spider): name = "peppe8o" start_urls = [ 'https://peppe8o.com', ] def parse(self, response): for posts in response.css('div.post-wrapper article'): yield { 'Title': posts.css('h2.entry-title a::text').get(), 'Description': posts.css('div.post-content p::text').get(), 'Date': posts.css('time::text').get(), }
Run this spider by simply typing from terminal:
scrapy runspider myspider.py -o peppe8o.json
The final “-o peppe8o.json” writes results to output file named peppe8o.json,
When the crawler ends its tasks, you will find the new file (if not already existing) and you can see its content:
nano peppe8o.json
Each run will append newly downloaded record to this file,
Enjoy!