NPR Political Section Webscraping Demonstration¶

Presented by Phil White¶

The following Jupyter Notebook was the basis for the Python/Beautiful Soup demonstration.¶

Contact Phil White with questions: philip.white@colorado.edu

Dependencies: if you're using Python, you will need to ensure requests and bs4 are installed. csv and datetime come standard. The easiest way to install these packages is to open your command prompt and type 'pip install bs4' and 'pip install requests'

Just press shift+enter in each code block (cell) to execute each code element.

In [1]:

#This cell imports requests, bs4, csv, and datetime

from bs4 import BeautifulSoup as bs #used to parse html
import requests #used to call urls
import csv #reads and writes csvs
from datetime import datetime #makes a timestamp
import pandas as pd

Basic workflow¶

Use requests.get() to call a url, and append the .text to the command to bring in the html of the page.
bs(source, 'lxml') tells BeautifulSoup to read/parse the html
Identify the tags you want to extract data from. It is helpful to go to the page and use right-click inspect element to figure out where the item you want is nested in the HTML structure.
In this example, soup.h2.text grabs all of the text from in between h2 tags.
soup.h2.a['href'] grabs the hyperlinks within an anchor tags
We then create variables out of each line of code, naming them 'source' (the source data), 'soup' (the list that BeautifulSoup can read), and then 'headline' (first h2 tag text) and 'link' (first link within the first h2 tag)
Finally, we printed them both out.

In this example, we're only grabbing the first h2 and a tag from the html doc.

In [2]:

source = requests.get('https://www.npr.org/sections/politics/').text #get the html

In [3]:

soup = bs(source, 'lxml') #make it a list that BeautifulSoup can interpret

In [4]:

headline = soup.h2.text #grab the text within the first h2 tag

In [6]:

link = soup.h2.a['href'] #grab the link within the first h2 tag

In [9]:

#print them both
print(headline)
print(link)

EXCLUSIVE: Governors have questions about Afghan refugees. Here's who they call
https://www.npr.org/2021/10/08/1043662124/exclusive-governors-have-questions-about-afghan-refugees-heres-who-they-call

Next, instead of grabbing the first h2 tag, we use 'find_all' to grab all of the h2 tags.¶

findall gets all h2 tags. In this example, we filtered by class using class = 'title'. This got ride of h2 tags that contained info we don't want. We made a variable out of this called 'headlines.'
Then, list() was used to create an empty list. We made the list a variable called linkList
Next, we create a 'for loop' to iterate over each item in the headlines list. Within the loop, it finds each h2 and grabes the titles and the links (same general structure as above)
Finally, the loop uses an append command to add each link into a new list.

In [10]:

headlines = soup.find_all('h2', class_ = 'title') #grabs the text from each h2 tag classed as 'title'

In [12]:

linkList = list() #creates an empty list

In [11]:

for items in headlines:
    titles = items.text
    links = items.a['href']
    #print (titles)
    linkList.append(links)
    
#this loop iterates over each h2 in the headlines list and plucks the title headline text and link, then adds each link to the linkList list element.

Next, we take all of the links harvested in the previous step, and scrape data from each individual page.¶

The 'with' command opens up a new csv. You'll need to modify the path to place the output csv into your own file directory. 'a', appends each new line created to a new row in the csv. 'as f' creates a variable out of the csv.

The next line uses the csv.writer function to make a new variable that writes new rows to f, out file.

writer.writerow is used to write data to the cells in a row. "Columns" in the row go between brackets and each row is separated by a comma. To write free text, just add single quotes around it.

Finally, the loop runs through the same workflow as above: First it grabs the url, then bs4 reads it, then we take the h1 tag from each of the pages and write them to a csv.

Bonus: I added in a datetime function to time stamp each headline with a collected date, and wrote that as an additional column to the output csv.

Note: Because the open command uses 'a' for append, if you run this again it will just add new lines to your file. 'w' in place of 'a' will rewrite it each time.

In [23]:

with open('output.csv', 'a', newline = '', encoding = 'utf-8') as f: #opens an output csv
    writer = csv.writer(f) #creates a writing function
    writer.writerow(['Article Title', 'Date Collected']) #writes headers into the first row of our output csv
    for links in linkList: #iterates through each link in the linkList created above
        sources = requests.get(links).text #grabs the html behind each link
        soups = bs(sources, 'lxml') #BeautifulSoup reads each one
        articleTitles = soups.h1.text #grabs h1 text for each page
        today = str(datetime.today()) #Creates a time stamp for each run through this iterator
        writer.writerow([articleTitles, today]) #writes the h1 text for each page to a new row and adds a timestamp on each.

In [24]:

df = pd.read_csv('output.csv', encoding = 'utf-8') #tell pandas to read a csv

In [28]:

df

Out[28]:

	Article Title	Date Collected
0	Trump Says He's Not 'Happy' With Budget Deal B...	2019-02-12 12:55:07.943443
1	Trump's 'Socialism' Attack On Democrats Has It...	2019-02-12 12:55:08.303651
2	Former Attorney General Eric Holder Close To 2...	2019-02-12 12:55:08.650661
3	Trump Took Fight For Border Wall To El Paso â...	2019-02-12 12:55:08.823678
4	Trump Supporter Violently Shoves BBC Cameraman...	2019-02-12 12:55:09.096291
5	'Agreement In Principle' Reached On Border Sec...	2019-02-12 12:55:09.417298
6	If Trump Declares An Emergency To Build The Wa...	2019-02-12 12:55:09.764906
7	Rep. Ilhan Omar Apologizes 'Unequivocally' For...	2019-02-12 12:55:09.952110
8	Days From Another Shutdown, Here's What The Ne...	2019-02-12 12:55:10.233329
9	ICE Detention Beds New Stumbling Block In Effo...	2019-02-12 12:55:10.576536
10	Parkland Family Reflects On A Year Of Anguish ...	2019-02-12 12:55:10.797342
11	Border Security Funding Talks Stalled, Governm...	2019-02-12 12:55:11.062547
12	GOP Rep. Walter Jones, Who Spent Years Seeking...	2019-02-12 12:55:11.377954
13	Virginia State Leaders Hold On Tight To Office...	2019-02-12 12:55:11.786964
14	Minnesota Sen. Amy Klobuchar Launches 2020 Pre...	2019-02-12 12:55:12.098970
15	'Watch What We're Doing': Could Maryland Gov. ...	2019-02-12 12:55:12.413377
16	Acting Attorney General Says He Hasn't Discuss...	2019-02-12 12:55:12.711783
17	Democratic Governors Pitch Pragmatism On Sidel...	2019-02-12 12:55:13.038791
18	Va. Lt. Gov. Fairfax Asks For FBI Investigatio...	2019-02-12 12:55:13.369799
19	Amid Blackface Backlash, Ralph Northam Tells S...	2019-02-12 12:55:13.682805
20	Ahead Of 2020 Election, Voting Rights Becomes ...	2019-02-12 12:55:14.105414
21	Former Rep. John Dingell Left An Enduring Heal...	2019-02-12 12:55:14.514424
22	Despite Few Details And Much Doubt, The Green ...	2019-02-12 12:55:14.811829
23	Former Rep. John Dingell, Longest-Serving Memb...	2019-02-12 12:55:15.095036

NPR Political Section Webscraping Demonstration¶

Presented by Phil White¶

The following Jupyter Notebook was the basis for the Python/Beautiful Soup demonstration.¶

Basic workflow¶

Next, instead of grabbing the first h2 tag, we use 'find_all' to grab all of the h2 tags.¶

Next, we take all of the links harvested in the previous step, and scrape data from each individual page.¶

Bonus!¶

Type df into the next cell and view a pretty version of your data.¶

Ta da!!¶