Do you need to gather an original corpus on the web for your research? Do you have little to no programming expertise? This toolkit is designed for you. Please reach out to us if you need any assistance with any of the workflows.
What is web scraping?
Web scraping refers to an automated process that results in the creation of an an original dataset by identifying components of a website, and copying pieces of information using a tool (software or programming language) into another file or organized structure for use in a variety of different contexts.Web scraping is used when an API is not available, or when the API does not provide information you need or in a format that you can work with.
A few things to know before you start.
This toolkit is designed primarily for students in the humanities and social sciences who are not familiar with programming or have only a minimal background in programming but who need to gather an original corpus from the web. The web scraping workflows, however, could be used by anyone regardless of programming expertise or disciplinary focus.
Objective | Knowledge required | Resource | Workflow or Tutorial |
---|---|---|---|
Scrape historical tweets outside of the Twitter API window (-7 days in the past or in the future). Example: you want to gather tweets with the hashtag #taco from 2018-2019. |
Some Python | Twint (a Python Library) | Tutorial to scrape tweets outside the Twitter API |
Scrape tweets within the Twitter API window (-7 days in the past or in the future). Example: you want to gather tweets going forward on #COVID19 |
No programming required | TAGS (Twitter Archiving Google Sheet) | Tutorial to scrape tweets with the standard API |
Scrape webpages | Beautiful Soup Python Library | Python | Tutorial to scrape webpages with Beautiful Soup |
Scrape newspapers using an API - New York Times |
Some Python This tutorial uses the Beautiful Soup library. |
Python, Import-io or ParseHub | Tutorial to scrape NYT newspaper articles |
Scrape newspapers using an API - Guardian
|
No programming required | OpenRefine, Import-io or ParseHub | Tutorial to scrape Guardian newspaper articles |
Organize full text in a CSV into plain text files for text mining (this workflow is often used in tandem with others) |
Willingness to use the terminal | Unix shell, Excel | How to turn full text into individual plain text files |
Use the Internet Archive API to gather millions of resources | Willingness to use the terminal | GNU Wget | Tutorial to use the Internet Archive API |
One of the first steps in deciding if you need to scrape is understanding the datasets that already exist. Here are some already created and ready to deploy datasets for text mining or data visualization.
Brown University Library | Providence, RI 02912 | (401) 863-2165 | Contact | Comments | Library Feedback | Site Map