Making a spider 🕷

Spiders are the sole reason why everything is so accessible on the internet. Here's how to easily make one.

October 10, 2020
2 minute read

Not an arachnid, a web crawler (otherwise known as a spider). Spiders are the sole reason why everything is so accessible on the internet. Spiders look at websites, scan for links, opens those links, scans for more links, etc. It stores all this data in to a database and then makes it accessible to you. Google gets all of its data from its own spider called Googlebot. It might sound complicated but it's surprisingly easy. Thanks to a Python framework called Scrapy, it takes only a few lines of code to scrape information off of a website and keep on crawling.

To start off, you'll have to install Scrapy using with PyPi with the following command: pip install scrapy. Then, create a new Scrapy project with scrapy createproject coolspider. After the project is made, go to coolspider/coolspider/spiders/ and make a new Python file called coolspider.py. It should look like this:

import scrapy

class CoolSpider(scrapy.Spider):
    name = 'coolspider'
    allowed_domains = ['ismaeelakram.com']
    start_urls = ['https://ismaeelakram.com/']
    
    def parse(self, response):
        print(response.body)

This is the simplest spider. All it does is it takes URLs out of the list and prints out their bodies. To run it, run scrapy crawl coolspider. You can replace coolspider with the name of the Python file in the spiders folder. If you run this, you'll notice that it only prints the body of one page; it doesn't crawl. To make it crawl, we have to look for links in the response body and then scan those ones. You can accomplish this by adding:

import scrapy


class CoolSpider(scrapy.Spider):
    name = 'coolspider'
    allowed_domains = ['ismaeelakram.com']
    start_urls = ['https://ismaeelakram.com/']

    def parse(self, response):
        with open('urls.txt', 'a') as f:
            f.write(f"{response.url}\n")
        for href in response.css('a::attr(href)'):
            yield response.follow(href, callback=self.parse)

With this new code, it begins at https://ismaeelakram.com/, and it writes down the URL in a text file called urls.txt. Then, for every link on that page, it continues the same pattern. If you run this now, you'll see that it might take longer but the text file will be created. The contents of my urls.txt goes like this:

https://ismaeelakram.com/
https://ismaeelakram.com/will-crypto-currencies-take-over-the-world/
https://ismaeelakram.com/walmarts-robot-rage/
https://ismaeelakram.com/the-man-unbroken-by-solitary/
https://ismaeelakram.com/intels-fear-of-missing-out/
https://ismaeelakram.com/bigger-upwards-not-outwards/
https://ismaeelakram.com/why-i-like-to-host-my-website-on-aws-servers/
https://ismaeelakram.com/
https://ismaeelakram.com/about/
https://ismaeelakram.com/august-view-smart-doorbell/
https://ismaeelakram.com/the-corbyn-question/
https://ismaeelakram.com/how-creepy-is-your-smart-speaker/
https://ismaeelakram.com/googles-flutter-and-dart/

The crawler is a success! Now, instead of writing the URL to urls.txt, you can do anything with the information on the page. For example, you can add it to a database, look for keywords, and more. You can also try to crawl bigger and more comprehensive websites. The limits are endless with Scrapy and you should try making everything you can.

More posts like this