Building an image crawler using Python and Scrapy
Have you ever needed to pull data from a website that doesn’t provide an API? Well, you could just pull out the data from the HTML then! This tutorial will teach you how to scrape websites so that you can get the data you want from third party websites without using APIs.
Scrapy is an open source web scraping and crawling framework written in python. We’ll learn how to use scrapy to crawl and scrape websites.
You should be comfortable writing code in Python. You should also know how to use Regular Expressions(Regex). A great tutorial for learning Regex can be found on Regexone.
You need the following tools:
Once you have installed both python and scrapy, make sure you have them in your
PATH environment variable. Here is a detailed installation guide for both python and scrapy.
Creating a project
Once you’ve set up the above tools, you are ready to dive into creating a Crawler. Lets start by creating a Scrapy project. Fire up your terminal and enter:
$ scrapy startproject imagecrawler
This will create a directory for you with the following structure.
imagecrawler/ scrapy.cfg # deploy configuration file imagecrawler/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
Building your first spider
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass
scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the response to extract data.
Create a new file called pexels_scraper.py in the spiders folder with the following content:
import scrapy class PexelsScraper(scrapy.Spider): name = "pexels" def start_requests(self): url = "https://www.pexels.com/" yield scrapy.Request(url, self.parse) def parse(self, response): print response.url, response.body
Lets look at what the literals in the above code mean:
name: identifies the Spider. You’ll use this name to start crawling.
start_requests(): returns an iterable of Requests that’ll get executed.
parse(): parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.
To run the code we wrote above, open your terminal and
cd to the imagecrawler directory and enter the following command:
$ scrapy crawl pexels
This will start the crawler and print the url and the body of the response it got back. Then the crawler will stop. This is because we haven’t yet specified how to move to links it encounters on a page. We’ll look at that in the next section.
Recursively crawling the website
Now that we’ve set up the project, let’s look at the website we’ll scrape. We’ll scrape Pexels, a website that provides high quality and completely free stock photos. They have an API but it has a limit of 200 requests per hour. We’ll crawl this website for images, url of the page we found them on and the tags associated with them.
Lets go to Pexels.com and open an image. Let us first examine the URL structure which is used by pexels for each image. It is of the form:
All links containing photos have the following in common:
- They start with
- They have an id at the end of the link:
We’ll extract all links from the pages we visit. Then we’ll filter out all links that do not match the given prefix. We’ll use the id to keep track of the links we’ve already visited so that we do not crawl same pages repeatedly.
First import 3 modules we’ll need for the above tasks:
Selector. Then we need a regex url matcher that’ll match the common url. We also need a function to extract image ids from urls. Modify the PexelScraper class so that it looks like the following:
import scrapy import re from scrapy.linkextractor import LinkExtractor from scrapy.selector import Selector class PexelsScraper(scrapy.Spider): name = "pexels" # Define the regex we'll need to filter the returned links url_matcher = re.compile('^https:\/\/www\.pexels\.com\/photo\/') # Create a set that'll keep track of ids we've crawled crawled_ids = set() def start_requests(self): url = "https://www.pexels.com/" yield scrapy.Request(url, self.parse) def parse(self, response): body = Selector(text=response.body) link_extractor = LinkExtractor(allow=PexelsScraper.url_matcher) next_links = [link.url for link in link_extractor.extract_links(response) if not self.is_extracted(link.url)] # Crawl the filtered links for link in next_links: yield scrapy.Request(link, self.parse) def is_extracted(self, url): # Image urls are of type: https://www.pexels.com/photo/asphalt-blur-clouds-dawn-392010/ id = int(url.split('/')[-2].split('-')[-1]) if id not in PexelsScraper.crawled_ids: PexelsScraper.crawled_ids.add(id) return False return True
Looking at the website structure
We are now getting only the pages we wanted to get. We now want to get the image urls and associated tags for the images. For this, We need to take a look at how the HTML pages for images look. Go to an image page on pexels. Now right click on the image and click Inspect Element, you’ll see something like this:
We can see that the
img tag has a class
image-section__image. We’ll use this information to extract this tag. The url of the image is in the
src attribute and the tags we need are there in the
alt attribute. Now let us modify our
PexelsScraper class to extract these things and print them out to the console.
For this we’ll create 2 regex patterns that’ll extract the src and alt attributes for us. Then we’ll use the
css method in the
Selector class to extract
img tags with class
image-section__image. Finally we’ll extract the url and tags and print them to the screen.
Add the following variables to the PexelsScraper class:
src_extractor = re.compile('src="([^"]*)"') tags_extractor = re.compile('alt="([^"]*)"')
Now modify the
parse method such that it prints the required url and tags:
def parse(self, response): body = Selector(text=response.body) images = body.css('img.image-section__image').extract() # body.css().extract() returns a list which might be empty for image in images: img_url = PexelsScraper.src_extractor.findall(image) tags = [tag.replace(',', '').lower() for tag in PexelsScraper.tags_extractor.findall(image).split(' ')] print img_url, tags link_extractor = LinkExtractor(allow=PexelsScraper.url_matcher) next_links = [link.url for link in link_extractor.extract_links(response) if not self.is_extracted(link.url)] # Crawl the filtered links for link in next_links: yield scrapy.Request(link, self.parse)
The body.css(‘img.image-section__image’).extract() call gives us all the
img tags with class
image-section__image in a list.
Now you can run the spider and test it out! Open your terminal and enter the following:
$ scrapy crawl pexels
You’ll get the output similar to the following:
https://images.pexels.com/photos/132894/pexels-photo-132894.jpeg?w=940&h=650&auto=compress&cs=tinysrgb [u'red', u'and', u'grey', u'fish', u'on', u'ice'] https://images.pexels.com/photos/343812/pexels-photo-343812.jpeg?w=940&h=650&auto=compress&cs=tinysrgb [u'cuisine', u'delicious', u'diet'] https://images.pexels.com/photos/274595/pexels-photo-274595.jpeg?w=940&h=650&auto=compress&cs=tinysrgb [u'adult', u'beautiful', u'beauty'] https://images.pexels.com/photos/230824/pexels-photo-230824.jpeg?w=940&h=650&auto=compress&cs=tinysrgb [u'orange', u'squash', u'beside', u'stainless', u'steel', u'bowl']
So in around 50 lines of code, we were able to get a web crawler( which scrapes a website for images) up and running. This was just a tiny example of something you could do with a web crawler. There are whole businesses running based on web scraping, for example, most of the product price comparison websites use crawlers to get their data.
Now that you have the basic knowledge of how to build a crawler, go and try building your own crawler!