While it’s all well and good to know how to retrieve pages, parse them, and build a crude web crawler, there’s no reason to reinvent the wheel given there are many open sources libraries that package this up for you. The most prolific Python library for scraping is Scrapy.
In this post, I’ll use Scrapy to scrape the American Gem Society, the same site we scraped in the last post.
What is Scrapy?
Scrapy is a modular framework for crawling and parsing the web. “Modular” here means that each component of Scrapy is not just customizable, but can be re-written on its own and plugged back in. By far, the best way to learn about the library is via the official documentation. (What we cover here is but a fraction of the feature set offered by Scrapy.)
We develop with Scrapy by building “spiders” – bits of code that crawl through the web. A spider defines what we want to scrape and in some cases how we want to scrape it. As spiders traverse the web, they hand back “items” – the data we have extracted from the web. Of course, this is a gross simplification. Scrapy has support for cookie management, downloading different types of files, multi-server clustering, throttling, server proxies, etc…
Installing Scrapy
In theory, installing Scrapy is easy as pip install Scrapy
– but there are all sorts of things that can go wrong. I defer to the official install guide. Once Scrapy is installed, then you should be able to run scrapy at your command line. If you run scrapy createproject ags
, that will create the necessary folder structure and basic files you’ll need to create a working spider. In this case, ags will be the root folder. Head on over to /ags/spiders and create a new file, ags_spider.py.
from scrapy.spiders.crawl import CrawlSpider
class AmericanGemSpider(CrawlSpider):
name = 'ags'
allowed_domains = ['www.americangemsociety.org']
start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler']
Let’s break this down. We have defined a new spider, AmericanGemSpider, that extends from Scrapy’s CrawlSpider. The CrawlSpider class provides some basic functionality for crawling web pages – if you want to crawl the web, use a CrawlSpider. We’ve given our spider a name, ags – every spider needs a name. We’ve then told the spider the domains it’s allowed to crawl, and the page from which to start crawling.
The allowed_domains field restricts what domains our spider is allowed to crawl. Many sites have pages on subdomains, so we need to account for that. Also, you might just want a spider that can crawl over different sites entirely.
The start_urls is where our spider will start. We’re specifying just one page, but this can be a long list of pages. Our first step is pulling out the link for each state; we could just specify those links explicitly here, but it’s way less typing to give just the main URL.
Defining Spider Rules
The next step is to define some rules for how we find links to traverse, and what we do when we find those links. Remember, with AGS, we want to find the state-specific link for each state on the main page; on each state’s page, we want to find links for each jeweler’s page; and then when we get to the jeweler pages, we want to extract some data. We accomplished identifying links with XPath in the last post, and we can re-use the same XPath here:
state_link_xpath = '//area'
jeweler_link_xpath = '//a[@class="jeweler__link"]'
With Scrapy, we need to wrap this XPath inside “link extractors” which provide the instrumentation for converting XPath into traversable links in Scrapy. Then each of those link extractors is wrapped in a rule, which tells Scrapy what to do with the pages retrieved for each traversed link.
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule, CrawlSpider
class AmericanGemSpider(CrawlSpider):
name = 'ags'
allowed_domains = ['www.americangemsociety.org']
start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler']
rules = [
Rule(LxmlLinkExtractor(restrict_xpaths='//area')),
Rule(LxmlLinkExtractor(restrict_xpaths='//a[@class="jeweler__link"]'), callback='parse_jeweler_page')
]
We’ve assigned a list of Rule objects to our spider’s rules field. Whenever our spider downloads a page, it will use our rules to find links. On the starting page, it will find each of the state links with the first rule. Scrapy will download each of those pages, and then run the rules over those newly downloaded pages. The XPath for the first rule doesn’t identify any links on those state pages, as expected; the XPath for the second rule does – it finds all of the jeweler-specific links.
With this second rule, we’ve specified a callback function. When Scrapy downloads the pages for each of the jeweler-specific links, it will hand the response to a new function we’ve defined, parse_jeweler_page, where we can analyze the HTML and extract the data. If we didn’t specify a callback here, then the default behavior would be to use the rules to find links on the page, and continue traversing. Neither of our rules would find any links on the jeweler pages, so that would effectively terminate the crawl. Given we’ve specified a callback, we get access to the HTML responses, and can use XPath to extract the data.
Parsing the Page
Here is the parse_jeweler_page method where we convert the HTML into a dictionary of extracted data:
def parse_jeweler_page(self, response):
yield {
f: response.xpath(s).extract_first() for f, s in {
'name': '//h1[@class="page__heading"]/text()',
'grading': '//p[@class="appraiser__grading"]/strong/text()',
'certified': '//p[@class="appraiser__certified"]/following-sibling::ul//li/text()',
'address': '//p[@class="appraiser__hours"]/text()',
'phone': '//p[@class="appraiser__phone"]/text()',
'website': '//p[@class="appraiser__website"]/a/@href'
}.items()
}
Rather than returning data, this method uses yield to hand execution back to the caller. In this example, we are returning one object per response, so there’s no real distinction between using yield versus return. In other cases, when there might be multiple objects returned, or if the callback is both extracting data and generating new links to crawl, yield makes more sense – it allows the Scrapy engine to work on the returned information while the spider continues executing.
Here we are yielding a dictionary comprehension. The dictionary returned is a transformation of the dictionary in lines 4-9, which is made up of field names and the corresponding XPath needed to extract that information. The parameter passed to our callback is a wrapped response that exposes an xpath method, so we can simply call that on each of our XPaths and get the data we want. This method returns data that looks like:
{'website': u'http://www.jcjewelers.com', 'name': u'J.C. Jewelers', 'grading': u'Offers AGS Labs Diamond Grading Reports', 'phone': u' (307) 733-5933', 'address': u' 132 N Cache St, Jackson, WY 83001-8681', 'certified': u'Jan Case, CGA'}
The Complete Example
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule, CrawlSpider
class AmericanGemSpider(CrawlSpider):
name = 'ags'
allowed_domains = ['www.americangemsociety.org']
start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler']
rules = [
Rule(LxmlLinkExtractor(restrict_xpaths='//area')),
Rule(LxmlLinkExtractor(restrict_xpaths='//a[@class="jeweler__link"]'), callback='parse_jeweler_page')
]
def parse_jeweler_page(self, response):
yield {
f: response.xpath(s).extract_first() for f, s in {
'name': '//h1[@class="page__heading"]/text()',
'grading': '//p[@class="appraiser__grading"]/strong/text()',
'certified': '//p[@class="appraiser__certified"]/following-sibling::ul//li/text()',
'address': '//p[@class="appraiser__hours"]/text()',
'phone': '//p[@class="appraiser__phone"]/text()',
'website': '//p[@class="appraiser__website"]/a/@href'
}.items()
}
That’s a pretty small amount of code to extract a whole lot of data. You can run the spider by executing scrapy crawl ags
– here’s sample output:
(scraper)Craigs-MBP:scraper craigperler$ scrapy crawl americangem2
2016-11-04 14:10:09 [scrapy] INFO: Scrapy 1.0.5 started (bot: boot)
2016-11-04 14:10:09 [scrapy] INFO: Optional features available: ssl, http11
2016-11-04 14:10:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'boot.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['thuzio.spider', 'crunchbase.spider', 'leafly.spider', 'shopify.spider', 'florida_bar.spider', 'indiegogo.spider', 'uktariff.spider', 'giaalumni.spider', 'americangem.spider', 'australiantpb.spider', 'walmart.spider', 'aidn.spider', 'reddit.spider', 'capropertysearch.spider', 'pokemongomap.spider', 'selfstoragefinders.spider', 'yelp.spider', 'catholicdirectory.spider', 'americangem.flat_spider'], 'BOT_NAME': 'boot', 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87', 'DOWNLOAD_DELAY': 0.5}
2016-11-04 14:10:09 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-11-04 14:10:09 [scrapy] INFO: Enabled downloader middlewares: CrackDistilMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-11-04 14:10:09 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-11-04 14:10:09 [scrapy] INFO: Enabled item pipelines:
2016-11-04 14:10:09 [scrapy] INFO: Spider opened
2016-11-04 14:10:09 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-04 14:10:09 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-04 14:10:10 [scrapy] DEBUG: Crawled (200) <GET https://www.americangemsociety.org/en/find-a-jeweler> (referer: None)
2016-11-04 14:10:12 [scrapy] DEBUG: Crawled (200) <GET https://www.americangemsociety.org/en/nevada-jewelers> (referer: https://www.americangemsociety.org/en/find-a-jeweler)
2016-11-04 14:10:14 [scrapy] DEBUG: Crawled (200) <GET https://www.americangemsociety.org/en/wyoming-jewelers> (referer: https://www.americangemsociety.org/en/find-a-jeweler)
2016-11-04 14:10:15 [scrapy] DEBUG: Crawled (200) <GET https://www.americangemsociety.org/en/500194> (referer: https://www.americangemsociety.org/en/nevada-jewelers)
2016-11-04 14:10:15 [scrapy] DEBUG: Scraped from <200 https://www.americangemsociety.org/en/500194>
{'website': u'http://www.tbirdjewels.com', 'name': u'T-Bird Jewels', 'grading': u'Offers AGS Labs Diamond Grading Reports', 'phone': u' (702) 256-3900', 'address': u' 1990 Village Center CirSte P6, Las Vegas, NV 89134-6242', 'certified': u'Jenny O Calleri, CGA'}
2016-11-04 14:10:19 [scrapy] DEBUG: Crawled (200) <GET https://www.americangemsociety.org/en/j-c-jewelers> (referer: https://www.americangemsociety.org/en/wyoming-jewelers)
2016-11-04 14:10:19 [scrapy] DEBUG: Scraped from <200 https://www.americangemsociety.org/en/j-c-jewelers>
{'website': u'http://www.jcjewelers.com', 'name': u'J.C. Jewelers', 'grading': u'Offers AGS Labs Diamond Grading Reports', 'phone': u' (307) 733-5933', 'address': u' 132 N Cache St, Jackson, WY 83001-8681', 'certified': u'Jan Case, CGA'}
...
Once the scraper is running, we can see our spider crawl the start_url, start traversing the state-specific pages, and as it finds the jeweler links, it grabs the responses, parses them, and spits out the data. In this example, the data goes nowhere; in practice, we could use Scrapy’s item pipeline to persist the objects, or do whatever we want them.
There’s no reason not to use Scrapy. Whether you’re working on a simple project (such as scraping AGS), or something massive, Scrapy provides a set of tools that you don’t need to write yourself. If you need something very customized or specific, you can still start with Scrapy and just replace the components that need it.
1 comment
If you are getting blocked by websites, you can use a free proxy with thousands of IP addresses for Scrapy at https://proxy.webshare.io/
Comments are closed.