LibGuides: Text Mining Resources: Gathering Text

Gathering and Locating Text

Though tempting, most library database licenses available at the Library do not allow for text and/or data mining.

To help you identify alternate sources for news and other textual data, this page contains a growing list of free sources for data that can be gathered using APIs or web scraping that might appeal to researchers and students.

The library databases that do allow for text and data mining are:

ProQuest Historical Newspapers
These historical newspapers offer full text and full image articles for newspapers dating back to the 19th century. Only Historical Newspapers allow for text and data mining.
These historical newspapers offer full text and full image articles for newspapers dating back to the 19th century. Use the list below to search each newspaper individually.
Brown University Library subscribes to:

JSTOR
Use of these APIs is subject to Terms and Conditions. Please see the JSTOR website for more information.
HathiTrust
Two HTRC Derived Datasets are available in addition to a bibliographic API and a Data API.

ProquestTDM Studio
Constellate

Web Scraping Toolkit

Do you need to gather an original corpus on the web for your research? Do you have little to no programming expertise? This toolkit is designed for you. Please reach out to us if you need any assistance with any of the workflows.

What is web scraping?

Web scraping refers to an automated process that results in the creation of an an original dataset by identifying components of a website, and copying pieces of information using a tool (software or programming language) into another file or organized structure for use in a variety of different contexts.Web scraping is used when an API is not available, or when the API does not provide information you need or in a format that you can work with.

Web Scraping Toolkit
by Ashley Champagne Last Updated Aug 17, 2025 613 views this year

Government and Court Records

Case Law Access Project
CAP includes all official, book-published United States case law through June 2018 — every volume designated as an official report of decisions by a court within the United States.
Free Legal Information Sites: George Mason Law
Links to the best places to find free legal information online. Covers US Federal, US State and International sources.Includes links to numerous sources for case law, search engines, codes, legislative information, congressional records, journal articles, news, blogs, dictionaries, and encyclopedias.

ProQuest TDM Studio

ProQuest TDM Studio is a platform that allows you to text and data mine (in other words, gather and analyze large amounts of text) content from news, scholarly and other kinds of publications that Brown subscribes to via ProQuest.

You may find ProQuest TDM studio useful if you'd like to:

Identify trends in a publication over time
Use data visualizations to represent texts
Gather a large "corpus" or collection of texts for text analysis, machine learning, etc.
Use a web-based interface to run Python and R code using these texts
Query, transform, and export text data to your computer

A screenshot of the ProQuest TDM Studio frontpage.