Selenium
Contents
Selenium¶
Sometimes, you’ll encounter a page that loads content after the initial page load. You might get a very small HTML file from requests that doesn’t contain everything on the page. This happens when a page is relying on JavaScript to load its contents, and can be difficult to deal with when scraping. One such site is the WHO photo archive. We’re going to take a look at scraping a collection of photos of WHO staff to show how you might approach scraping such a site.
We’ll use Selenium to get around this page load issue. Selenium was built as a front-end testing tool, allowing developers to automatically make sure that their applications still work as they change them. This functionality translates well to web scraping, since it creates an actual web browser window controlled by Python. That will let our scripts take advantage of the browser’s ability to use JS content, and will let us scrape anything that we can see when we’re browsing a site as a normal user.
Setup¶
Selenium sets up a webdriver object that will be your primary way to interact with the Python-controlled browser. So first we’ll create that object and use it to fetch a page.
from selenium import webdriver
# Uncomment and run this cell if you're running this notebook out of WSL or another headless environment
# If that doesn't make sense to you then it probably doesn't apply to you
# I copied this out of a stack overflow post, for reference: https://stackoverflow.com/questions/46809135/webdriver-exceptionprocess-unexpectedly-closed-with-status-1
# from selenium.webdriver import FirefoxOptions
# opts = FirefoxOptions()
# opts.add_argument("--headless")
# driver = webdriver.Firefox(options=opts)
# We'll be using firefox for this. This is why you need both Firefox and geckodriver installed, so that Selenium has a browser to run
driver = webdriver.Firefox()
# Getting a url is pretty simple:
driver.get("https://photos.hq.who.int/galleries/172/who-staff-in-office")
Examining the site¶
If we take a look at this site, we can figure out what information we might want and where it lives. There’s a gallery of photos, and when you click on one, it shows more information about the photo. If we want to collect metadata about these photos, for instance to take a look at where the photos were taken and in what context, we’d need to click on each photo and copy that information out of the pop-up. Since the url isn’t changing when we click around, we have strong evidence that requests
won’t work, although you’re welcome to try for yourself.
It looks like the images have a class called “hovThumb” on them, so we can take a look and see if that will select what we want. Selenium has a built-in ability to use css selectors, which you’ll see below. You can always feed the page source into Beautiful Soup if you prefer that syntax, but we’ll look at how selenium thinks.
driver.find_elements(by="css selector", value=".hovThumb")
driver.find_elements(by="css selector", value=".hovThumb")[0].click()
Hm, this is odd. If we look at the end of the stack trace, we can see that we can’t click on the element we’ve selected (the first thing with a hovthumb
tag), because another element obscures it. It looks like that also has the same hovThumb
tag, but with an additional blankDL
tag. Let’s try selecting elements with both those tags.
driver.find_elements(by="css selector", value=".hovThumb.blankDL")[0].click()
Getting out HTML contents¶
Success! We’re able to click on that, and looking at our live browser, it looks right. Looking again in our inspector tool, it looks like we can find the HTML for the pop-up window in an element with the id previewPopupWindow
. I’ll show how you might get the HTML content of that to feed to another parser, if you prefer.
popupContent = driver.find_element(by="id", value="previewPopupWindow")
# You can use the get_attribute function to get the "innerHTML" for anything you've selected in Selenium
print(popupContent.get_attribute("innerHTML"))
Selenium selectors¶
Going back to our inspector, we can find a few things that might be of interest. One is the top headline, and another is the description below it. There’s more interesting information, but that’s for you to hack on later!
Fortunately, the headline and description look like they have their own ids, so it’s very easy for us to grab text from them.
driver.find_element(by="id", value="ctblc_headline").text
driver.find_element(by="id", value="ctblc_desc").text
Now that we’ve got some information, we need to close the popup. Remember how we couldn’t click on something because it was obscured by something else? That will happen again if we leave the popup up.
driver.find_element(by="css selector", value=".btn-close-popup").click()
Defining a scraping function¶
Like we did in Beautiful Soup, we’ll define a function to standardize what we just did. Assuming we have a popup open, we can use this function to extract the headline and description from it.
def get_popup_content():
headline = driver.find_element(by="id", value="ctblc_headline").text
description = driver.find_element(by="id", value="ctblc_desc").text
driver.find_element(by="css selector", value=".btn-close-popup").click()
return {'headline': headline, 'description': description}
# Let's try applying our function!
metadata = []
for link in driver.find_elements(by="css selector", value=".hovThumb.blankDL"):
link.click()
metadata.append(get_popup_content())
Handling errors¶
Oh no! We’ve run into a problem. Our script couldn’t find the headline element. There are two ways this could happen. One is that sometimes a popup might not have a title, which we can accommodate in our function. The other is more complicated.
Essentially, our script is running at the speed of a python script, executing code as fast as it possibly can. The browser, however, is running at the speed of a browser, which has to wait for network calls and stuff. We only want our script to go looking for data once the data is there, so we have to wait until a consistent element is loaded until we have it try. To do that, we need to import a bunch of other stuff from selenium.
It may comfort you to know that I didn’t know how to do all this just because I’m so smart, I looked it up on stack overflow and copied over some code from someone further along. Here’s that post for reference: https://stackoverflow.com/questions/26566799/wait-until-page-is-loaded-with-selenium-webdriver-for-python. Using other people’s code is a great way to learn, since you still have to figure out how to make it work in your context.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
def get_popup_content():
try:
headline = driver.find_element(by="id", value="ctblc_headline").text
except NoSuchElementException:
headline = ""
description = driver.find_element(by="id", value="ctblc_desc").text
driver.find_element(by="css selector", value=".btn-close-popup").click()
return {'headline': headline, 'description': description}
# Make sure the popup is closed before we start!
driver.find_element(by="css selector", value=".btn-close-popup").click()
metadata = []
delay = 3 # seconds
for link in driver.find_elements(by="css selector", value=".hovThumb.blankDL"):
link.click()
try:
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.ID, 'menuDetailsTab')))
metadata.append(get_popup_content())
WebDriverWait(driver, delay).until(EC.invisibility_of_element_located((By.ID, 'menuDetailsTab')))
except TimeoutException:
print("Loading took too much time!")
What’s blocking us now? Taking a closer look, it seems like it’s the cookie banner! You can close it in the visible browser, but it’s not a bad idea to build closing it into your script.
driver.find_element(by="css selector", value=".cc_btn_accept_all").click()
# Trying again, without the bar
metadata = []
delay = 3 # seconds
for link in driver.find_elements(by="css selector", value=".hovThumb.blankDL"):
link.click()
try:
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.ID, 'menuDetailsTab')))
metadata.append(get_popup_content())
WebDriverWait(driver, delay).until(EC.invisibility_of_element_located((By.ID, 'menuDetailsTab')))
except TimeoutException:
print("Loading took too much time!")
Hey, we finished without errors! Let’s see how much stuff we have.
len(metadata)
Preparing the final scrape¶
Ah, that’s not good. We have 50 things, but looking at the page, you can see it says there are 110 files in the gallery. Why do we only have 50?
If you scroll down the page, you’ll see the page get longer with more items. When we got our list of links to iterate through, we only saw what was there at the initial page load. We’ll need to update our list as we go to get everything. As an alternative, we can keep track of how many links we’ve accessed, update the list every time we hit a new link, and just access the next link in sequence.
counter = 0
metadata = []
delay = 3 # seconds
# This line should scare you! We have to be really confident that we'll get an IndexError to use this,
# otherwise the code could get stuck in an infinite loop.
while True:
try:
link = driver.find_elements(by="css selector", value=".hovThumb.blankDL")[counter]
link.click()
counter += 1
try:
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.ID, 'menuDetailsTab')))
metadata.append(get_popup_content())
WebDriverWait(driver, delay).until(EC.invisibility_of_element_located((By.ID, 'menuDetailsTab')))
except TimeoutException:
print("Loading took too much time!")
except IndexError:
break
len(metadata)
Exporting with pandas¶
That looks better! Now, your mission, should you choose to accept it (and if we have time) is to get more data from each of those popups and include it in our list of metadata. When you’re done, you can use the lines below to export the data to a spreadsheet using pandas
.
import pandas as pd
df = pd.DataFrame(metadata)
df.head()
df.to_csv("who_photo_metadata.csv")