Beautiful Soup
Contents
Beautiful Soup¶
Beautiful soup is a pretty common and easy to use library for parsing HTML documents, which is what we need to do in web scraping.
We’ll be using requests
to fetch webpages, and BeautifulSoup
to parse them. We’re still working with our list of lists of legendary creatures from Wikipedia.
We only fetched that landing page last time, but now we want to actually scrape data from each list. Looking at the lists, we can see that they have information about the names and origins of these creatures, as well as links to the wiki pages for the creatures and their culture of origin.
Setup¶
First we need to set up our libraries, starting url, and initial text, using requests
# Import all of our tools
# In addition to BeautifulSoup and requests, we also have re, time, and pandas
# re gives us the ability to extract data with regular expressions
# pandas gives us an easy way to export our data as csv
# time gives us the ability to have Python chill out for a bit between page requests
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import time
# Define our start url
url = "https://en.wikipedia.org/wiki/Lists_of_legendary_creatures"
# get the text for our starting page
text = requests.get(url).text
Scraping data¶
We’re going to set up our “soup”, an object representing the page that we can use to select content from.
You’ll probably find it helpful to reference the Beautiful Soup documentation when working on your own scraping projects.
soup = BeautifulSoup(text, 'html.parser')
# Just verifying what kind of thing soup is
type(soup)
The first thing that we’ll need to do is to get a list of the pages that we’ll want to scrape data off of. They’re all listed here, so we can first scrape the links to those pages. This is just going to get us a collection of links to the pages that we want to scrape data from, not any of the data itself, but it’s a simple process so it’s a good introduction to Beautiful Soup.
soup.find_all('ul')
It looks like the list we’re after is the second list on the page. Since Python is 0-indexed, that will be accessed with [1]
alphaList = soup.find_all('ul')[1]
# What kind of thing is this?
type(alphaList)
On bs4 tag objects, you can use .text
to get the plain text within the tag, with all of the HTML and formatting taken out. This is generally handy, but especially when you find yourself trying to extract data that has a bunch of styling tags that you don’t care about.
print(alphaList.text)
We can still use .find_all()
on the tag we extracted, so we can run a .find_all('a')
on our list of page links. That gives us a list of bs4 tags, but what we want are urls we can use. To do that, we can get the href
property with ['href']
List comprehension¶
I’ve put this together in a list comprehension. This is a common way in Python to make a list from something else you can iterate through. A less compact approach would look like this:
listPageLinks = []
for a in alphaList.find_all('a'):
listItem = a['href']
listPageLinks.append(listItem)
A list comprehension lets you shrink that down, so that if you have a simple operation you want to perform on every item in a list to make a new list, you can do that very easily.
listPageLinks = [a['href'] for a in alphaList.find_all('a')]
listPageLinks
So now we have a list of partial links to all of the pages that we want to iterate through and scrape data off of. Now we can take one of those pages and test out a process for extracting data from the page. If you click through the alphabetical lists of creatures, you’ll see that they have more or less the same structure, so once we figure out a process to scrape data from one page with our script, we’ll be able to apply it to all of the pages.
f-strings¶
To start, we’ll pull the first link from our list. I’m using an f-string to create the full url, which is a new string formatting method introduced in Python 3.6. You can read more about it in an f-string overview from RealPython, but for our purposes all you need to know is that if you put an f
before the quotation marks of your string, you can insert variables and Python code into strings, as long as you surround the code with curly braces: {}
.
newUrl = f'https://en.wikipedia.org{listPageLinks[0]}'
print(newUrl)
# make a new soup object to extract new data
newSoup = BeautifulSoup(requests.get(newUrl).text, 'html.parser')
type(newSoup)
Once again, we can look for unorderd list tags.
newSoup.find_all('ul')
# The first ul looks like the one we want
creatureList = newSoup.find_all('ul')[1]
# We can get the first creature in the list by finding all li elements
# and getting the first one
testCreature = creatureList.find_all('li')[0]
testCreature
# Using .text gets us the plain text of the element with no HTML markup
testCreature.text
# Calling the built-in method str() on the element gets us the full raw HTML
str(testCreature)
# For the name, all we need is the text of the first link element
name = testCreature.find('a').text
name
# We can also get the href attribute of that first link
pageLink = testCreature.find('a')['href']
pageLink
Regular expressions¶
We’re about to dip our toes into regular expressions, just a little bit. Regular expressions are another way to extract information from text. Where xpath and css selectors rely on the structure of an HTML document, regular expressions allow you to articulate general patterns in text, whether it’s marked up in HTML or not.
Getting familiar with regular expressions takes time, but there are a few things to know that will help you get started:
Each text character is an instruction of some kind. For most text characters, that instruction is just “match this one letter exactly”, but it’s still an instruction. Special characters carry special instructions.
The “dot”,
.
, matches any character.“Quantifiers” change how many of the previous thing can be matched. The quantifiers are:
+
matches the thing before it one or more times, so whatever is before it still has to be there, but may occur many times.*
matches the thing before it zero or more times, so whatever is before it is optional
If you want to actually match a character that normally has an instruction associated, you can “escape” it with
\
. If you want to literally match a period, you can use\.
You can extract just part of an expression by enclosing it with
()
. This allows you to use some text around the thing that you’re interested in to find it, but not keep that extra cruft in the data that you extract.
We’re just scratching the surface of regular expressions, but that should be enough to let you know what’s happening in the regular expression we’ll use. You can take a closer look at regex101, where I’ve set up the expression along with the HTML of our first list item, so that you can take a look at how it works. You can also check out the quick reference in the bottom of regex101, which has a much more thorough overview of things you can do.
If you want a structured way to learn regular expressions syntax, regexone is a good tutorial site, and if you like puzzles, regex crossword is good for practice.
origin = re.findall(r"\((.*)\)", str(testCreature))[0]
origin
# We can make a new soup from what our regex found
# That way we can leverage the HTML structure of our document to extract info
originSoup = BeautifulSoup(origin)
originName = originSoup.find('a').text
originName
# We can also get the link pretty easily
originLink = originSoup.find('a')['href']
originLink
Making a scraping function¶
Now that we’ve tested out a process for extracting information from that first list item, we can make a function to generalize the process for every list item. If you’re new to Python, you might not have defined many functions, but it’s easy to do, and lets you distill a process and make it repeatable. This function is set up to return a dictionary with the name, page link, origin names, and origin links from one of the creatures on our list. You’ll notice that origin names and links are set up as lists, because I noticed on the page that sometimes a legendary creature is associated with more than one cultural origin.
def parse_creature(creature):
name = creature.find('a').text
pageLink = creature.find('a')['href']
origin = re.findall(r"\((.*)\)", str(creature))[0]
originSoup = BeautifulSoup(origin)
originNames = [a.text for a in originSoup.find_all('a')]
originLinks = [a['href'] for a in originSoup.find_all('a')]
return {
"name": name,
"pageLink": pageLink,
"originNames": originNames,
"originLinks": originLinks
}
# Let's try out this function on our test creature
# This should work, since it's the same data we used to test on
parse_creature(testCreature)
With our function created, it’s now very easy to loop through all of the list items in our list of legendary creatures that start with “A” to extract information about them.
for creature in creatureList.find_all('li'):
print(parse_creature(creature))
# We can make this into a list with a list comprehension
aCreatures = [parse_creature(creature) for creature in creatureList.find_all('li')]
Making a spreadsheet with pandas¶
Pandas does a lot of stuff that I’m not going to get into here (that’s a topic for another workshop), but one thing that it’s very good at is taking different kinds of Python data structures and turning them into tabular data. You can often just call pd.DataFrame()
on an object containing data that you want to make tabular, and as long as you’re thinking about data in the same way that pandas does, it’ll do the trick. A list of dictionaries with common keys is one structure that this method understands, so we can use it to inspect our data
df = pd.DataFrame(aCreatures)
df.head()
Expanding the scope of the scrape¶
Now that we’ve got a function for extracting data from a list entry, and we know how to use it to iterate over all of the list items in a page, we can make a function that takes a url and does the whole scraping process for us on a single page.
This function will fetch the page, parse it with beautiful soup, extract a list of legendary creatures, and then extract all of the information about those creatures into a list, which the function returns as output.
def scrape_creature_page(url):
R = requests.get(url)
soup = BeautifulSoup(R.text, 'html.parser')
creatureList = soup.find_all('ul')[1]
creatures = [parse_creature(creature) for creature in creatureList.find_all('li')]
return creatures
# Let's use this function on the third page in our dataset, as a test
scrape_creature_page(f'https://en.wikipedia.org{listPageLinks[2]}')
Errored! Looking at the error, there’s a “list index out of range”. In the stack trace, that’s happening when we’re trying to extract the origin with a regular expression. The list index used on that line is to get the first result from the regular expression. The fact that we’re getting an index error when trying to use the first item in a list of regex matches tells us that the regex didn’t turn up any results. We’ll have to adjust our script to accommodate.
In python, you can use a try
block in situations like this. That’s a way of saying “try this, and if it fails in a specific way, do something else without crashing out like you normally do”. In our case, we can say “try to extract the origin with a regular expression, and if you don’t find anything, set the origin names and links to an empty list, then print the element that failed out so we can look at it”
In this notebook, I’m redefining functions as we go. This is just for learning purposes, so that the iterative process of trying stuff out, encountering errors, and fixing them in a function is explicit. When you’re scraping on your own, just change the function definition in the initial cell, rather than copying it and making a new one. Having multiple function definitions for the same function name is confusing, and makes it easier for you to have a bad time in your notebook.
def parse_creature(creature):
name = creature.find('a').text
pageLink = creature.find('a')['href']
try:
origin = re.findall(r"\((.*)\)", str(creature))[0]
originSoup = BeautifulSoup(origin)
originNames = [a.text for a in originSoup.find_all('a')]
originLinks = [a['href'] for a in originSoup.find_all('a')]
except IndexError:
originNames = []
originLinks = []
print(f"No origin found in {creature}")
return {
"name": name,
"pageLink": pageLink,
"originNames": originNames,
"originLinks": originLinks
}
scrape_creature_page(f'https://en.wikipedia.org{listPageLinks[2]}')
Hey, it worked! That third page scraped successfully, and looking at the entry where our script couldn’t find an origin, we can confirm that there was no origin to be found. That kind of verification is important in web scraping, since you might run into some different repeated patterns in how data is organized in a website that you might want to accommodate.
Now we’re ready to try out our function on all of our links!
allCreatures = []
for link in listPageLinks:
allCreatures.extend(scrape_creature_page(f"https://en.wikipedia.org{link}"))
time.sleep(0.2)
Oh no, more errors! In this case, we see that it’s an attribute error. Something that’s a NoneType
was trying to read a .text
attribute. Looking at the stack trace, we see that creature.find('a')
was trying to read a .text
attribute, so it looks like that was returning None
. We can approach this in a similar way to how we approached the missing origin error:
def parse_creature(creature):
try:
name = creature.find('a').text
pageLink = creature.find('a')['href']
except AttributeError:
name = ""
pageLink = ""
print(f"Could not find link in {creature}")
try:
origin = re.findall(r"\((.*)\)", str(creature))[0]
originSoup = BeautifulSoup(origin)
originNames = [a.text for a in originSoup.find_all('a')]
originLinks = [a['href'] for a in originSoup.find_all('a')]
except IndexError:
originNames = []
originLinks = []
print(f"No origin found in {creature}")
return {
"name": name,
"pageLink": pageLink,
"originNames": originNames,
"originLinks": originLinks
}
allCreatures = []
for link in listPageLinks:
allCreatures.extend(scrape_creature_page(f"https://en.wikipedia.org{link}"))
time.sleep(0.2)
No errors! In looking through the printouts from lines that were missing an origin or a link, we can see that it’s mostly true. I see some at the end where the format is “Creature - Origin” rather than “Creature (Origin)”, but it looks like there’s only three of them, so it’s not worth rewriting the regular expression to accommodate.
However, looking through these errors brings up a good idea. It’s helpful to keep the original text from which the info was extracted, so let’s include that in our output.
def parse_creature(creature):
try:
name = creature.find('a').text
pageLink = creature.find('a')['href']
except AttributeError:
name = ""
pageLink = ""
print(f"Could not find link in {creature}")
try:
origin = re.findall(r"\((.*)\)", str(creature))[0]
originSoup = BeautifulSoup(origin)
originNames = [a.text for a in originSoup.find_all('a')]
originLinks = [a['href'] for a in originSoup.find_all('a')]
except IndexError:
originNames = []
originLinks = []
print(f"No origin found in {creature}")
return {
"name": name,
"pageLink": pageLink,
"originNames": originNames,
"originLinks": originLinks,
"sourceText": str(creature)
}
Completed scrape¶
With those functions defined, our final scrape is a pretty short command that builds on all of the work we’ve done so far. These 4 lines will grab all of the data we’ve defined and add it to the allCreatures
list, which we can export to a spreadsheet with pandas
.
allCreatures = []
for link in listPageLinks:
allCreatures.extend(scrape_creature_page(f"https://en.wikipedia.org{link}"))
time.sleep(0.2)
With all that data extracted, we can take a look at it in pandas:
pd.DataFrame(allCreatures)
The other thing that we’ll use pandas for is exporting to CSV. If you’ve tried to use Python’s built-in methods for handling CSV data, this will be a welcome change of pace: just one line to export to CSV.
pd.DataFrame(allCreatures).to_csv("legendary_creatures.csv")