Web Scraping, Part 3: Scraping Case Study

In this post, we’re going to use what was covered on web scraping in the first two posts (#1 and #2) in this series for a web scraping case study: scraping contact details for all of the jewelers certified by the American Gem Society. The AGS is an organization that “[helps] protect the jewelry-buying public from fraud and false advertising.” The AGS makes no secrets about providing access to their member lists – they want to promote their jeweler members. Consequently, it’s quite easy to search for AGS members by zip code, state, or even name. It’s not as easy, however, to collate a full list of all members, which is what we’ll be doing in this exercise.

Now, web scraping AGS may seem like a random place to start. It would be, except that we’re starting here because the site is easily scrape-able and someone on Upwork was willing to front up to $250 for this data set. Why not try to make a buck while learning this stuff?

Robots.txt

Before we dive into it, one quick tangent on ethics. As noted in my post On Scraping, web scraping is often against a site’s terms of use. Even if the data to be scraped is not going to be used, analyzed, or sold, simply the act of scraping could violate your implicit contract with the web site. There are two things we need to check first: the public Terms of Use for the site, and the site’s robots.txt file.

Skimming over https://www.americangemsociety.org/en/, I don’t see any signs of a Terms of Use. Ordinarily, if a site is providing you some free service (a la Facebook or Twitter), or is e-commerce in any way, there’s a Terms. Usually, there’s a link to the Terms in the footer.

AGS does, however, have a robots.txt file, copied here:

The robots file is a list of rules that the site requests we abide by. The list’s intended for scrapers and bots – any bit of code that is programmatically accessing the site. Google, for example, has a bot that crawls the web, indexing everything to make Search work for us. The robots file is not enforceable, and so unsurprisingly, any bot that has malicious intents is probably not going to check the robots file to review the requested policy.

The AGS robots file says that for any user of the site (“User-agent: *”), you cannot programmatically access any URL under the admin path (“/admin/”). In effect, we’re free to scrape AGS, as long as we stay away from the admin pages. Fair enough. Note that robots files can and will be much more complex – LinkedIn’s, for example.

Web Scraping Case Study Objective

We want to extract from AGS the full list of certified jewelers with their contact info and certification status. For example, this is the page for Arthur Weeks & Son Jewelers in NY. Note that there are separate lines for each of the fields we want to extract, and for each of those fields the heading is in bold. If you want to find the phone number, it might be as easy as finding the emboldened “Phone:” text, and then grabbing the next bit of content. Before we jump into parsing this page, however, we need to figure out how to get to this page. If we want all of the jewelers, it means we need a list of jewelers or URLs, and as we saw, AGS only provides access to jewelers by zip, state, or search.

At first blush, search by state seems promising. We could list out each state, and construct a link based on the state name. If we click on NY state, for example, it opens up a link to https://www.americangemsociety.org/en/newyork-jewelers. Perhaps we can just construct these URLs by listing out every state.

The page for each state presents another problem, as they contain lists of jewelers with some details and some links. We would need to figure out how to get from a state’s page to each individual jeweler’s page.

Traversing the Mental Map

What we’ve just done is built a mental map of a traversal scheme. We started with our endpoint, the individual jeweler pages, and worked our way backwards to a starting point. The map of states has links to summaries for each state, and those summaries contain links to each of the jeweler pages. We may be able to construct URLs for each state’s page, but would need to do some parsing from there. Let’s take each of these pages at a time now.

Let’s take a closer look at the state links. While it may be easy enough to type out all the states and format URLs per state, I don’t want to do that – it sounds like a lot of work. We should be able to write some code to find all the state links. Certainly, if we just pull all the links on this page, we’ll get back way more than just those for the state pages – this would include links to other areas of the site, all the stuff on the footer, and the social media icons. We need to narrow this down. Inspecting the NY link, we see this:

The link is actually defined with an area element, which is an image map with clickable coordinates. Looking over the page, it doesn’t look like there are any other image maps – we should be able to filter for just the area links to get the URLs for each state’s page.

import urllib2
response = urllib2.urlopen('https://www.americangemsociety.org/en/find-a-jeweler')

from lxml import etree
tree = etree.parse(response, etree.HTMLParser())

tree.xpath('//area/@href')
'''['/en/alabama-jewelers',
 '/en/alaska-jewelers',
 '/en/arizona-jewelers',
 '/en/arkansas-jewelers',
...
 '/en/westvirginia-jewelers',
 '/en/wisconsin-jewelers',
 '/en/wyoming-jewelers']'''

There you go. We can use those links and programmatically get to each state’s page. Next step, we need to see if there’s a similar shortcut to get from each state page to each jeweler’s page.

First step is to inspect these links and see if there’s a key attribute that’s common to each of them, and not common to anything else on the page. It appears we’re in luck.

We can see that each jeweler link has the class “jeweler__link” making it easy to use XPath or CSS to pick just these links from the page.

XPath: //a[@class="jeweler__link"]/@href
CSS: a.jeweler__link

Crawling

Now we know how to crawl to the pages we want to hit, and it’s just a matter of parsing those pages for the data. Let’s take a closer look at Arthur Weeks & Son.

Our goal is to generate some JSON for each jeweler – what is the name of the jeweler, what grading do they offer, who are the certified AGS members, what is their address, their phone number, and their web page – this is basically all of the information available per jeweler. We need to check each of these fields one at a time, inspecting for ways we can query for just the data we want, either using XPath or CSS.

The jeweler name seems easy enough. It’s the largest thing on the page, and the only text in an h1 tag.

The appraiser grading is similarly identifiable. Each grading is emphasized text held in a p tag that has the class “appraiser__grading”.

The list of certified members is a bit more complex. Each member is held in a list item, but that list (ul tag) is not underneath any easily identifiable tag. Instead, the list is “next” to a p tag with the class “appraiser__certified”. In other words, to get to the list of appraisers, you need to find the “appraiser__certified” tag, go to the next ul HTML element on the same level as it, and then grab the content from its contained li items. Deconstructing this:

First, find the p tag with the “appraiser__certified” class:
//p[@class="appraiser__certified"]

From there, we need the following ul tag – the key bit of XPath here is following-sibling which says find the next element at the same level:
./following-sibling::ul

And then we need all of the text contained in the list items within that ul:
.//li/text()

Altogether, that looks like:
//p[@class="appraiser__certified"]/following-sibling::ul//li/text()

The address, phone number, and website are all easily accessible. (Note the address is actually held under a tag with class appraiser___hours.)

Putting the Scraper Together

That’s it – we have everything we need to build a web crawler and scrape this data. Here’s an example that simply iterates over all the links we get, traversing to the jeweler pages, and then uses XPath to get the data:

import urllib2
from lxml import etree

domain = 'https://www.americangemsociety.org'

# Utility method for parsing the main page for each of the state links:
def get_state_links(url='https://www.americangemsociety.org/en/find-a-jeweler'):
    response = urllib2.urlopen(url)
    tree = etree.parse(response, etree.HTMLParser())
    return ['{}{}'.format(domain, href) for href in tree.xpath('//area/@href')]

# Utility method for parsing a state page and getting each of the jeweler links:
def get_jeweler_links(url):
    response = urllib2.urlopen(url)
    tree = etree.parse(response, etree.HTMLParser())
    return ['{}{}'.format(domain, href) for href in tree.xpath('//a[@class="jeweler__link"]/@href')]

# Given a URL to a jeweler's page, retrieve the HTML response and use XPath to pull out
# all of the data we want into a dictionary:
def parse_jeweler_page(url):
    response = urllib2.urlopen(url)
    tree = etree.parse(response, etree.HTMLParser())

    fields = {
        'name': '//h1[@class="page__heading"]/text()',
        'grading': '//p[@class="appraiser__grading"]/strong/text()',
        'certified': '//p[@class="appraiser__certified"]/following-sibling::ul//li/text()',
        'address': '//p[@class="appraiser__hours"]/text()',
        'phone': '//p[@class="appraiser__phone"]/text()',
        'website': '//p[@class="appraiser__website"]/a/@href'
    }

    return {k: tree.xpath(v) for k, v in fields.items()}

def crawl():
    jeweler_data = []

    # Crawl through each of the state links...
    for state_link in get_state_links():

        # For each state, crawl through each of the jeweler links...
        for jeweler_link in get_jeweler_links(state_link):

            # And for each jeweler, store the extracted data:
            jeweler_data.append(parse_jeweler_page(jeweler_link))

    return jeweler_data

'''[{'address': [' 4355 Montgomery HwySte 2, Dothan, AL 36303-1696'],
  'certified': ['Ronny Lisenby, RJ'],
  'grading': [],
  'name': ["Bradshaw's Jewelers"],
  'phone': [' (334) 793-6363'],
  'website': ['http://www.bradshawjewelers.com/']},
 {'address': [' 333 Fairhope Ave, Fairhope, AL 36532-2317'],
  'certified': ['Michael Brenny, CGA'],
  'grading': [],
  'name': ["Brenny's Jewelry Co."],
  'phone': [' (251) 928-3916'],
  'website': ['http://brennysjewelry.com/']},
...
]'''

Though we have the data here, we’re not really done yet. First, note that each of the values pulled back by XPath is a list. We don’t really want that. For example, the name of each jeweler isn’t a list of stuff, but just a string – the one name. We can use XPath to pull back the first value for each retrieved field, or we can post-process the extracted results and just grab the first element in each of these lists. Second, it’s one thing to extract data, but we probably want to persist it somewhere as well – save it to a file or database, for example. In any event, hopefully this served as a straightforward web scraping case study.

The next step from here is re-visiting this same example, but leveraging Scrapy, the standard when it comes to web scraping libraries in Python. We’ll go through that in the next post.