Use Raspberry PI as your personal web crawler with Python and Scrapy

5
(1)

Last Updated on 24th March 2023 by peppe8o

A web crawler (also known as spider or spiderbot) is an internet bot that continually browses web pages, typically for web indexing purposes.
Typically Search Engines use web crawling ito scan the web and be aware of contents, links and websites relations. These data are processed to understand what results better fit users queries.

Crawlers consume resources on visited systems. For this reason, mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent via file named robot.txt under their root url.

Crawers are also used by some websites update their web content and stay aligned with target sources.

In this article I’ll show you how to create and configure a simple spiderbot (which crawls peppe8o.com home page posts) with a tiny computer like Raspberry PI. I’ll use a Raspberry PI Zero W, but this applies also to newer Raspberry PI boards.

What Is Scrapy

From Scrapy official website:

Scrapy logo

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Ref. https://docs.scrapy.org/en/latest/intro/overview.html

What We Need

Raspberry PI Zero WH board

As usual, I suggest adding from now to your favourite e-commerce shopping cart all the needed hardware, so that at the end you will be able to evaluate overall costs and decide if to continue with the project or remove them from the shopping cart. So, hardware will be only:

Check hardware prices with following links:

Amazon raspberry pi boards box
Amazon raspberry pi Zero W box
Amazon Micro SD box
Amazon Raspberry PI Power Supply box

Step-By-Step Procedure

This procedure will drive you in installing Raspberry PI OS Lite, installing required packages and finally setup Scrapy via Python PIP.

Install OS

Please start from OS installation following Install Raspberry PI OS Lite guide.

At the end, please be sure that your system is up to date:

sudo apt update
sudo apt upgrade

Install Scrapy

Raspberry PI OS Lite base OS already includes Python 3, so you don’t need any specific setup to have Python working.

Instead, we need to install some required packages. From terminal:

sudo apt install python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

As you can see, we have also python3-pip package in previous command, so we ca now install scrapy with pip. Please note that, differently from standard Python installation, you need to use “sudo” to make scrapy command available directly from terminal.

sudo pip3 install scrapy

During scrapy installation, some dependencies version errors may aoccurr. At the date of this article, an error comes with cryptograpy version:

pyopenssl 19.1.0 has requirement cryptography>=2.8, but you'll have cryptography 2.6.1 which is incompatible.

To solve this kind of errors, simply install required version of package notified. For example, I will fix my error with the following command:

sudo pip3 install cryptography==2.8

Finally check that your scrapy installation is ok:

pi@raspberrypi:~ $ scrapy version
Scrapy 2.1.0

Create Your First Spiderbot

Scrapy is not complicated, but requires a bit of study on scrapy tutorial pages. It can be run from and interactive shell (with the command “scrapy shell http://url…”). This method is usually the best way to identify tags inside pages you want to extract.

Another good practice, is visiting the url you want to crawl with dev tools (for example in Chrome -> Options -> Tools -> Developer Tools). Then, analyze web page elements and identify the ones you want to extract.

Once identified your targets, you can build a simple crawler to run manually or prepare a complete crawling project. Let’s try an example of a single standalone crawler, manually launched, which extracts posts titles, Summary and date from peppe8o.com home page.

You can both download in your Raspberry PI my script “myspider.py” from my download area:

wget https://peppe8o.com/download/python/myspider.py

Or manually create the spider configuration file:

nano myspider.py

Insert following code:

import scrapy
class QuotesSpider(scrapy.Spider):
  name = "peppe8o"
  start_urls = [
    'https://peppe8o.com',
  ]

  def parse(self, response):
    for posts in response.css('div.post-wrapper article'):
      yield {
        'Title': posts.css('h2.entry-title a::text').get(),
        'Description': posts.css('div.post-content p::text').get(),
        'Date': posts.css('time::text').get(),
      }

Run this spider by simply typing from terminal:

scrapy runspider myspider.py -o peppe8o.json

The final “-o peppe8o.json” writes results to output file named peppe8o.json,

When the crawler ends its tasks, you will find the new file (if not already existing) and you can see its content:

nano peppe8o.json

Each run will append newly downloaded record to this file,

Enjoy!

How useful was this post?

Click on a star to rate it anonymously!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?