Skip to Main Content

Web Scraping Toolkit

This is the web scraping toolkit for the Brown University Library. This guide is currently in progress. We appreciate your patience

Introduction

Do you need to gather an original corpus on the web for your research? Do you have little to no programming expertise? This toolkit is designed for you. Please reach out to us if you need any assistance with any of the workflows. 

What is web scraping?

Web scraping refers to an automated process that results in the creation of an an original dataset by identifying components of a website, and copying pieces of information using a tool (software or programming language) into another file or organized structure for use in a variety of different contexts.Web scraping is used when an API is not available, or when the API does not provide information you need or in a format that you can work with.

A few things to know before you start.

  1. There may be an easier, better way to get the data you need. See "Open Access Resources for Text Mining
  2. If you can't get the online data you need, web scraping is probably the right answer. Web scraping is legal, in most cases. You can fairly and legally scrape:
    • any data that is publicly available and not copyrighted
    • commercial use of scraped data is still limited
    • you cannot scrape sites that require authentication (e.g. The Brown University Library catalog or Facebook)
  3. Even if your scraping a site that is legal to scrape, abide by the rules (e.g. see the "scraping webpages" workflow in Beautiful Soup). 

Key Audience

This toolkit is designed primarily for students in the humanities and social sciences who are not familiar with programming or have only a minimal background in programming but who need to gather an original corpus from the web. The web scraping workflows, however, could be used by anyone regardless of programming expertise or disciplinary focus. 

Resources for web scraping

Objective Knowledge required Resource Workflow or Tutorial

Scrape historical tweets outside of the Twitter API window (-7 days in the past or in the future).

Example: you want to gather tweets with the hashtag #taco from 2018-2019.

Some Python Twint (a Python Library) Tutorial to scrape tweets outside the Twitter API 

Scrape tweets within the Twitter API window (-7 days in the past or in the future).

Example: you want to gather tweets going forward on #COVID19

No programming required  TAGS (Twitter Archiving Google Sheet) Tutorial to scrape tweets with the standard API
Scrape webpages Beautiful Soup Python Library Python Tutorial to scrape webpages with Beautiful Soup

Scrape newspapers using an API

- New York Times

Some Python

This tutorial uses the Beautiful Soup library.

Python, Import-io or ParseHub Tutorial to scrape NYT newspaper articles 

Scrape newspapers using an API

- Guardian 
No programming required  OpenRefine, Import-io or ParseHub  Tutorial to scrape Guardian newspaper articles

Organize full text in a CSV into plain text files for text mining

(this workflow is often used in tandem with others)

Willingness to use the terminal Unix shell, Excel How to turn full text into individual plain text files
Use the Internet Archive API to gather millions of resources  Willingness to use the terminal GNU Wget Tutorial to use the Internet Archive API

Open Access Resources for Text Mining

One of the first steps in deciding if you need to scrape is understanding the datasets that already exist. Here are some already created and ready to deploy datasets for text mining or data visualization.