Skip to main content

Web Scraping Toolkit

This is the web scraping toolkit for the Brown University Library

Introduction

Do you need to gather an original corpus on the web for your research? Do you have little to no programming expertise? This toolkit is designed for you. Please reach out to us if you need any assistance with any of the workflows. 

What is web scraping?

Web scraping refers to an automated process that results in the creation of an an original dataset by identifying components of a website, and copying pieces of information using a tool (software or programming language) into another file or organized structure for use in a variety of different contexts.Web scraping is used when an API is not available, or when the API does not provide information you need or in a format that you can work with.

A few things to know before you start.

  1. There may be an easier, better way to get the data you need. See "Open Access Resources for Text Mining
  2. If you can't get the online data you need, web scraping is probably the right answer. Web scraping is legal, in most cases. You can fairly and legally scrape:
    • any data that is publicly available and not copyrighted
    • commercial use of scraped data is still limited
    • you cannot scrape sites that require authentication (e.g. The Brown University Library catalog or Facebook)
  3. Even if your scraping a site that is legal to scrape, abide by the rules (e.g. see the "scraping webpages" workflow in Beautiful Soup). 

Key Audience

This toolkit is designed primarily for students in the humanities and social sciences who are not familiar with programming or have only a minimal background in programming but who need to gather an original corpus from the web. The web scraping workflows, however, could be used by anyone regardless of programming expertise or disciplinary focus. 

Resources for web scraping

Objective Knowledge required Resource Workflow or Tutorial

Scrape historical tweets outside of the Twitter API window (-7 days in the past or in the future).

Example: you want to gather tweets with the hashtag #taco from 2018-2019.

Some Python Twint (a Python Library) https://github.com/ashleychampagne/Web-Scraping-Toolkit/blob/master/Twitter-with-Twint-Workflow.md

Scrape tweets within the Twitter API window (-7 days in the past or in the future).

Example: you want to gather tweets going forward on #COVID19

No programming required  TAGS (Twitter Archiving Google Sheet) https://github.com/ashleychampagne/Web-Scraping-Toolkit/blob/master/Twitter-API-Workflow.md
Scrape webpages Beautiful Soup Python Library Python https://github.com/ashleychampagne/Web-Scraping-Toolkit/blob/master/Beautiful-Soup-Workflow.md

Scrape newspapers using an API

- New York Times

Some Python Python, Import-io or ParseHub https://github.com/ashleychampagne/Web-Scraping-Toolkit/blob/master/NYT-Collection-Workflow.md

Scrape newspapers using an API

- Guardian 
No programming required  OpenRefine, Import-io or ParseHub  https://github.com/ashleychampagne/Web-Scraping-Toolkit/blob/master/Guardian-Newspaper-Workflow.md

Organize full text in a CSV into plain text files for text mining

(this workflow is often used in tandem with others)

Willingness to use the terminal Unix shell, Excel https://github.com/ashleychampagne/Web-Scraping-Toolkit/blob/master/Spreadsheet-Splitting-Workflow.md
Use the Internet Archive API to gather millions of resources  Willingness to use the terminal GNU Wget https://github.com/ashleychampagne/Web-Scraping-Toolkit/blob/master/Internet-Archive-API-Workflow.md

Open Access Resources for Text Mining

One of the first steps in deciding if you need to scrape is understanding the datasets that already exist. Here are some already created and ready to deploy datasets for text mining or data visualization.