Requests

Requests is a Python library for getting web pages. It’s very useful for working with APIs, but we’ll only be using some basic functionality to fetch web pages.

Basic Request

For this part of the workshop, we’ll be trying to get data from each of the lists linked from this Wikipedia list of lists of legendary creatures. To start, we’ll just use requests to get that starting page, to show how simple it can be.

# Import requests
import requests
# Define our url
url = "https://en.wikipedia.org/wiki/Lists_of_legendary_creatures"
# Make a request
# R will be a response object
R = requests.get(url)
# We can get basic information, like the HTTP response status code
# This can be useful for checking if your request went through, 
# or if there was an error
R.status_code
# The text attribute has the HTML of the response
# This is what we'll feed to the library actually doing the parsing
# I'm just displaying the first 200 characters, the full text is just R.text
R.text[:200]

Extra features

That’s all we’ll need from requests for the purpose of this workshop, but I want to point out another feature that I use a lot in requests.

You can define HTTP GET parameters (the things you’ll see at the end of a url that look like ?foo=bar&biz=buz) in a dictionary object and feed them to your request. Since these parameters are often used by websites in ways that you might use in web scraping, I want to give a quick overview of this functionality.

# This is a long url that's hard to work with
longurl = "https://anth1130.omeka.fas.harvard.edu/elasticsearch?q=sherd&facet_itemtype=Archaeological+Find&facet_tags%5B%5D=ceramic"
# If you're just starting with Python, this might be your first instinct to 
# insert a variable into this url.
# That's fine, and it will work, but it gets annoying to work with after a while
searchterm = "sherd"
longurl = "https://anth1130.omeka.fas.harvard.edu/elasticsearch?q=" + searchterm + "&facet_itemtype=Archaeological+Find&facet_tags%5B%5D=ceramic"
# Here's our first request with that long url
R1 = requests.get(longurl)
R1.text[:1000]
# Here's what that definition can look like using a dict to define GET parameters
# Note how much easier it is to insert a variable into the parameters
baseurl = "https://anth1130.omeka.fas.harvard.edu/elasticsearch"
searchterm = "sherd"
params = {
    "q": searchterm,
    "facet_itemtype": "Archaeological Find",
    "facet_tags[]": "ceramic"
}
# Here's our second request, this time using a dict to define GET parameters
R2 = requests.get(baseurl, params=params)
# You can always check the url that was used to make the request
# Note how the params we defined are there, but they may not be in the same order
R2.url
# Here's the start of the text of our response
R2.text[:1000]
# Did our two requests produce the same results?
R2.text == R1.text