NPR Political Section Webscraping Demonstration

Presented by Phil White

The following Jupyter Notebook was the basis for the Python/Beautiful Soup demonstration.

Contact Phil White with questions: philip.white@colorado.edu

Dependencies: if you're using Python, you will need to ensure requests and bs4 are installed. csv and datetime come standard. The easiest way to install these packages is to open your command prompt and type 'pip install bs4' and 'pip install requests'

Just press shift+enter in each code block (cell) to execute each code element.

Basic workflow

  1. Use requests.get() to call a url, and append the .text to the command to bring in the html of the page.

  2. bs(source, 'lxml') tells BeautifulSoup to read/parse the html

  3. Identify the tags you want to extract data from. It is helpful to go to the page and use right-click inspect element to figure out where the item you want is nested in the HTML structure.

  4. In this example, soup.h2.text grabs all of the text from in between h2 tags.

  5. soup.h2.a['href'] grabs the hyperlinks within an anchor tags

  6. We then create variables out of each line of code, naming them 'source' (the source data), 'soup' (the list that BeautifulSoup can read), and then 'headline' (first h2 tag text) and 'link' (first link within the first h2 tag)

  7. Finally, we printed them both out.

In this example, we're only grabbing the first h2 and a tag from the html doc.

Next, instead of grabbing the first h2 tag, we use 'find_all' to grab all of the h2 tags.

  1. findall gets all h2 tags. In this example, we filtered by class using class = 'title'. This got ride of h2 tags that contained info we don't want. We made a variable out of this called 'headlines.'
  2. Then, list() was used to create an empty list. We made the list a variable called linkList
  3. Next, we create a 'for loop' to iterate over each item in the headlines list. Within the loop, it finds each h2 and grabes the titles and the links (same general structure as above)
  4. Finally, the loop uses an append command to add each link into a new list.

The 'with' command opens up a new csv. You'll need to modify the path to place the output csv into your own file directory. 'a', appends each new line created to a new row in the csv. 'as f' creates a variable out of the csv.

The next line uses the csv.writer function to make a new variable that writes new rows to f, out file.

writer.writerow is used to write data to the cells in a row. "Columns" in the row go between brackets and each row is separated by a comma. To write free text, just add single quotes around it.

Finally, the loop runs through the same workflow as above: First it grabs the url, then bs4 reads it, then we take the h1 tag from each of the pages and write them to a csv.

Bonus: I added in a datetime function to time stamp each headline with a collected date, and wrote that as an additional column to the output csv.

Note: Because the open command uses 'a' for append, if you run this again it will just add new lines to your file. 'w' in place of 'a' will rewrite it each time.

Bonus!

Import your csv to a Pandas dataframe so you can take a look at it.

Type df into the next cell and view a pretty version of your data.

Ta da!!