Contact Phil White with questions: philip.white@colorado.edu
Dependencies: if you're using Python, you will need to ensure requests and bs4 are installed. csv and datetime come standard. The easiest way to install these packages is to open your command prompt and type 'pip install bs4' and 'pip install requests'
Just press shift+enter in each code block (cell) to execute each code element.
#This cell imports requests, bs4, csv, and datetime
from bs4 import BeautifulSoup as bs #used to parse html
import requests #used to call urls
import csv #reads and writes csvs
from datetime import datetime #makes a timestamp
import pandas as pd
Use requests.get() to call a url, and append the .text to the command to bring in the html of the page.
bs(source, 'lxml') tells BeautifulSoup to read/parse the html
Identify the tags you want to extract data from. It is helpful to go to the page and use right-click inspect element to figure out where the item you want is nested in the HTML structure.
In this example, soup.h2.text grabs all of the text from in between h2 tags.
soup.h2.a['href'] grabs the hyperlinks within an anchor tags
We then create variables out of each line of code, naming them 'source' (the source data), 'soup' (the list that BeautifulSoup can read), and then 'headline' (first h2 tag text) and 'link' (first link within the first h2 tag)
Finally, we printed them both out.
In this example, we're only grabbing the first h2 and a tag from the html doc.
source = requests.get('https://www.npr.org/sections/politics/').text #get the html
soup = bs(source, 'lxml') #make it a list that BeautifulSoup can interpret
headline = soup.h2.text #grab the text within the first h2 tag
link = soup.h2.a['href'] #grab the link within the first h2 tag
#print them both
print(headline)
print(link)
EXCLUSIVE: Governors have questions about Afghan refugees. Here's who they call https://www.npr.org/2021/10/08/1043662124/exclusive-governors-have-questions-about-afghan-refugees-heres-who-they-call
headlines = soup.find_all('h2', class_ = 'title') #grabs the text from each h2 tag classed as 'title'
linkList = list() #creates an empty list
for items in headlines:
titles = items.text
links = items.a['href']
#print (titles)
linkList.append(links)
#this loop iterates over each h2 in the headlines list and plucks the title headline text and link, then adds each link to the linkList list element.
The 'with' command opens up a new csv. You'll need to modify the path to place the output csv into your own file directory. 'a', appends each new line created to a new row in the csv. 'as f' creates a variable out of the csv.
The next line uses the csv.writer function to make a new variable that writes new rows to f, out file.
writer.writerow is used to write data to the cells in a row. "Columns" in the row go between brackets and each row is separated by a comma. To write free text, just add single quotes around it.
Finally, the loop runs through the same workflow as above: First it grabs the url, then bs4 reads it, then we take the h1 tag from each of the pages and write them to a csv.
Bonus: I added in a datetime function to time stamp each headline with a collected date, and wrote that as an additional column to the output csv.
Note: Because the open command uses 'a' for append, if you run this again it will just add new lines to your file. 'w' in place of 'a' will rewrite it each time.
with open('output.csv', 'a', newline = '', encoding = 'utf-8') as f: #opens an output csv
writer = csv.writer(f) #creates a writing function
writer.writerow(['Article Title', 'Date Collected']) #writes headers into the first row of our output csv
for links in linkList: #iterates through each link in the linkList created above
sources = requests.get(links).text #grabs the html behind each link
soups = bs(sources, 'lxml') #BeautifulSoup reads each one
articleTitles = soups.h1.text #grabs h1 text for each page
today = str(datetime.today()) #Creates a time stamp for each run through this iterator
writer.writerow([articleTitles, today]) #writes the h1 text for each page to a new row and adds a timestamp on each.
Import your csv to a Pandas dataframe so you can take a look at it.
df = pd.read_csv('output.csv', encoding = 'utf-8') #tell pandas to read a csv
df
Article Title | Date Collected | |
---|---|---|
0 | Trump Says He's Not 'Happy' With Budget Deal B... | 2019-02-12 12:55:07.943443 |
1 | Trump's 'Socialism' Attack On Democrats Has It... | 2019-02-12 12:55:08.303651 |
2 | Former Attorney General Eric Holder Close To 2... | 2019-02-12 12:55:08.650661 |
3 | Trump Took Fight For Border Wall To El Paso â... | 2019-02-12 12:55:08.823678 |
4 | Trump Supporter Violently Shoves BBC Cameraman... | 2019-02-12 12:55:09.096291 |
5 | 'Agreement In Principle' Reached On Border Sec... | 2019-02-12 12:55:09.417298 |
6 | If Trump Declares An Emergency To Build The Wa... | 2019-02-12 12:55:09.764906 |
7 | Rep. Ilhan Omar Apologizes 'Unequivocally' For... | 2019-02-12 12:55:09.952110 |
8 | Days From Another Shutdown, Here's What The Ne... | 2019-02-12 12:55:10.233329 |
9 | ICE Detention Beds New Stumbling Block In Effo... | 2019-02-12 12:55:10.576536 |
10 | Parkland Family Reflects On A Year Of Anguish ... | 2019-02-12 12:55:10.797342 |
11 | Border Security Funding Talks Stalled, Governm... | 2019-02-12 12:55:11.062547 |
12 | GOP Rep. Walter Jones, Who Spent Years Seeking... | 2019-02-12 12:55:11.377954 |
13 | Virginia State Leaders Hold On Tight To Office... | 2019-02-12 12:55:11.786964 |
14 | Minnesota Sen. Amy Klobuchar Launches 2020 Pre... | 2019-02-12 12:55:12.098970 |
15 | 'Watch What We're Doing': Could Maryland Gov. ... | 2019-02-12 12:55:12.413377 |
16 | Acting Attorney General Says He Hasn't Discuss... | 2019-02-12 12:55:12.711783 |
17 | Democratic Governors Pitch Pragmatism On Sidel... | 2019-02-12 12:55:13.038791 |
18 | Va. Lt. Gov. Fairfax Asks For FBI Investigatio... | 2019-02-12 12:55:13.369799 |
19 | Amid Blackface Backlash, Ralph Northam Tells S... | 2019-02-12 12:55:13.682805 |
20 | Ahead Of 2020 Election, Voting Rights Becomes ... | 2019-02-12 12:55:14.105414 |
21 | Former Rep. John Dingell Left An Enduring Heal... | 2019-02-12 12:55:14.514424 |
22 | Despite Few Details And Much Doubt, The Green ... | 2019-02-12 12:55:14.811829 |
23 | Former Rep. John Dingell, Longest-Serving Memb... | 2019-02-12 12:55:15.095036 |