While I have some writing momentum, I’m going to take a stab at a series on teaching some web scraping. My plan is to do this through some case studies of increasing complexity.
Let’s get some preliminary stuff out of the way first.
If you’re reading this on Medium, you’ll be missing out on the code samples! Full details are available on the actual blog.
When you access a web page via a browser, that browser retrieves and executes a complex set of instructions. Those instructions tell the browser what content to show where, when, and how. Some of these instructions tell the browser to go and retrieve more instructions, or more content, from other places around the web; others of these instructions tell the browser to report back to the owner of that web page information about you, the viewer, and your computer. These instructions come in many forms: HTML, CSS, JavaScript, HTTP headers, and more. This post won’t be explaining what each of these is or does.
We can see the innards of these instructions using the curl command. (I’ll leave it to you to sort out how to install it if it’s not readily available from your terminal or command prompt.)
For example, to see what the instruction set is when requesting the root page of my domain, http://craigperler.com/, we get the following:
curl -v http://craigperler.com/
* Trying 173.254.19.193...
* Connected to craigperler.com (173.254.19.193) port 80 (#0)
> GET / HTTP/1.1
> Host: craigperler.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 302 Found
< Date: Tue, 25 Oct 2016 03:08:49 GMT
< Server: Apache
< Location: https://www.craigperler.com/blog
< Content-Length: 280
< Content-Type: text/html; charset=iso-8859-1
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
The document has moved <a href="https://www.craigperler.com/blog">here</a>.
<hr>
<address>Apache Server at craigperler.com Port 80</address>
</body></html>
* Connection #0 to host craigperler.com left intact
An explanation:
- Line 1 is the curl call requesting the page http://craigperler.com/.
- Lines 3 and 4 show the request hitting the remote server.
- Lines 5-8 are the request headers, meta-data we’ve sent to the remote webserver; extra instructions that curl has passed along with the request to retrieve the remote page. As a specific example, line 7 identifies the “user agent” – a description of the agent making the retrieval request.
- Lines 10-15 are the response headers, meta-data sent back to us by the remote webserver. A web browser might take action upon reading these headers; however, our curl command simply prints these details.
- lines 17-25 contain the actual web page response. In this case, the page is actually telling us that the document we really want is in another castle. More on this in a second.
- The last line effectively closes the connection (though not exactly).
If you open up http://craigperler.com/ in a web browser, why don’t you see the message “The document has moved..”? It’s because your web browser is following the instructions in the response headers. “HTTP/1.1 302 Found” tells the browser to redirect the current location, and “Location: https://www.craigperler.com/blog” provides that location. Consequently, your web browser will automatically redirect you to the new URL, https://www.craigperler.com/blog. If you curl that guy, you’ll see another 302 redirect to https://www.craigperler.com/blog/; and if you curl that one, you’ll finally get some real content which lines up with what your browser is showing. Clearly, those HTTP headers are just as important as the actual rendered content.
The first step in scraping is understanding what URLs are of interest, which may not always be obvious.
Once you know what URL you want to scrape, the next step is to get a reference to that content so you can parse, analyze, and persist the details. Using curl, you could pipe the results to a file. Instead, let’s see how this looks with Python, in a few ways.
The package urllib2 will automatically follow the 302 redirect, so asking for craigperler.com will get back the /blog location.
import urllib2
response = urllib2.urlopen('http://craigperler.com')
response.code
# 200
response.url
# 'https://www.craigperler.com/blog/'
content = response.read()
If we want real control of the wheel, we need to dive in a bit further. This uses the httplib package.
import httplib
cnx = httplib.HTTPConnection('craigperler.com')
cnx.request('GET', '')
response = cnx.getresponse()
response.status
# 302
response.getheaders()
'''
[('content-length', '280'),
('server', 'nginx/1.10.2'),
('connection', 'keep-alive'),
('location', 'https://www.craigperler.com/blog'),
('date', 'Thu, 27 Oct 2016 03:28:38 GMT'),
('content-type', 'text/html; charset=iso-8859-1')]
'''
content = response.read()
We can accomplish the same via the Requests package:
import requests
response = requests.get('http://craigperler.com', allow_redirects=False)
response.status_code
# 302
content = response.content
And lastly, the same spiel but with more code, this using a custom HTTP handler:
import urllib
import urllib2
class NoRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
infourl = urllib.addinfourl(fp, headers, req.get_full_url())
infourl.status = code
infourl.code = code
return infourl
opener = urllib2.build_opener(NoRedirectHandler())
urllib2.install_opener(opener)
response = urllib2.urlopen('http://craigperler.com')
response.status
# 302
content = response.read()
The reason I gave a couple of examples here is to emphasize that there are lots of ways to retrieve content over the web. There are many more than those above. Sometimes, one of these may make more sense than the others, so it’s good to be aware of the toolbox.
As a final step in this intro, let’s pretend we want to scrape my blog’s landing page for post titles (just those visible on the landing page). If I continue with these writeups, a later post will dive into parsing the HTML content in depth; for now, we can hack our way with some basic Python string operations to get at what we want.
import urllib2
content = urllib2.urlopen('http://craigperler.com').read()
# All of the blog posts on the landing page are contained in a link:
before_anchor_tags = content.split('</a>')
len(before_anchor_tags)
# 141
# Let's filter those links for just those that stay on my domain:
anchors_on_my_domain = [anchor for anchor in before_anchor_tags if 'craigperler.com' in anchor]
len(anchors_on_my_domain)
# 137
# Given I've only written blog posts in the last few years,
# we can further filter out URLs using that.
anchors_on_my_domain = [anchor for anchor in anchors_on_my_domain if \
'craigperler.com/blog/2' in anchor]
len(anchors_on_my_domain)
# 42
# Each anchor in our list ends with a ">post title" pattern. We can use that to our advantage.
possible_posts = []
for anchor in anchors_on_my_domain:
tokens = anchor.split('>')
if tokens[-1].strip() != '':
possible_posts.append(tokens[-1].strip())
len(possible_posts)
# 22
# There's still some noise (pulling in calendar links in addition to the post titles; unicode),
# but this is a solid start.
possible_posts[:5]
'''
['On Web Scraping',
'Tracking Personal Finances',
'ProjectSHERPA: a startup retrospective',
'Better Babies',
'Can Yelp Reviews Predict Real Estate Prices?']
'''
possible_posts[-5:]
# ['Side Projects', '« Apr', '21', '22', '24']
The next post will focus on other ways to parse the content.