LibGuides: Digital Scholarship Resources: Web Scraping

Introduction

Do you need to gather an original corpus on the web for your research? Do you have little to no programming expertise? This toolkit is designed for you. Please reach out to us if you need any assistance with any of the workflows.

What is web scraping?

Web scraping refers to an automated process that results in the creation of an an original dataset by identifying components of a website, and copying pieces of information using a tool (software or programming language) into another file or organized structure for use in a variety of different contexts.Web scraping is used when an API is not available, or when the API does not provide information you need or in a format that you can work with.

A few things to know before you start.

There may be an easier, better way to get the data you need. See "Open Access Resources for Text Mining"
If you can't get the online data you need, web scraping is probably the right answer. Web scraping is legal, in most cases. You can fairly and legally scrape:
- any data that is publicly available and not copyrighted
- commercial use of scraped data is still limited
- you cannot scrape sites that require authentication (e.g. The Brown University Library catalog or Facebook)
Even if your scraping a site that is legal to scrape, abide by the rules (e.g. see the "scraping webpages" workflow in Beautiful Soup).

Key Audience

This toolkit is designed primarily for students in the humanities and social sciences who are not familiar with programming or have only a minimal background in programming but who need to gather an original corpus from the web. The web scraping workflows, however, could be used by anyone regardless of programming expertise or disciplinary focus.

Resources for web scraping

Objective	Knowledge required	Resource	Workflow or Tutorial
Scrape historical tweets outside of the Twitter API window (-7 days in the past or in the future). Example: you want to gather tweets with the hashtag #taco from 2018-2019.	Some Python	Twint (a Python Library)	Tutorial to scrape tweets outside the Twitter API
Scrape tweets within the Twitter API window (-7 days in the past or in the future). Example: you want to gather tweets going forward on #COVID19	No programming required	TAGS (Twitter Archiving Google Sheet)	Tutorial to scrape tweets with the standard API
Scrape webpages	Beautiful Soup Python Library	Python	Tutorial to scrape webpages with Beautiful Soup
Scrape newspapers using an API - New York Times	Some Python This tutorial uses the Beautiful Soup library.	Python, Import-io or ParseHub	Tutorial to scrape NYT newspaper articles
Scrape newspapers using an API - Guardian	No programming required	OpenRefine, Import-io or ParseHub	Tutorial to scrape Guardian newspaper articles
Organize full text in a CSV into plain text files for text mining (this workflow is often used in tandem with others)	Willingness to use the terminal	Unix shell, Excel	How to turn full text into individual plain text files
Use the Internet Archive API to gather millions of resources	Willingness to use the terminal	GNU Wget	Tutorial to use the Internet Archive API

Open Access Resources for Text Mining

One of the first steps in deciding if you need to scrape is understanding the datasets that already exist. Here are some already created and ready to deploy datasets for text mining or data visualization.

Digital Scholarship Resources

Director, Center for Digital Scholarship

Emily Ferrier

Introduction

Key Audience

Resources for web scraping

Open Access Resources for Text Mining