LibGuides: Digital Scholarship Resources: Text Mining

About this guide

Text mining is a useful methodology to explore large texts or collections of texts. This guide is in development -- and I'll continue to update it. Please reach out if you'd like help in this area.

Voyant

Voyant is a web-based text reading and analysis environment. It has sample corpora and you can upload your own collection in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word.

To try one sample for the purposes of this guide, we'll use the University of North Carolina library's North American Slave Narrative Collection.

Go to Voyant Tools

Upload your corpus. Navigate to the location you’ve saved the North American Slave Narrative Collection. For many of you, this will be your desktop. Go to na-slave-narratives > data > texts. Then select all texts. For many operating systems, you can select all texts by clicking ⌘ A on a Mac). For Windows 8 or 10, click “Edit” in the menu bar at the top of the window, and click “Select All” on the drop-down menu. Click “Open” on in the lower right-hand corner.

Analyze the visualizations

You should see three visualizations on your screen.

Cirrus: a word cloud that displays the highest frequency terms – the larger the term, the more frequent it is. Hover over a word to see additional information.
Summary: Basic information about the text, including the number of words, the length of documents, vocabulary density, and distinctive words for each document.
Corpus Reader: This allows you to read the text(s) in the collection – more text will appear as you scroll. You can hover over words to view their frequency and click to see more information.

When you click on a word in the Cirrus word cloud, you'll then see that the graph to your right hand side changes to that specific word. You'll be able to see each individual text file in the "Corpus (Documents)" section and find the relative frequency of the word you've chosen in each of the documents. You'll see a keyword in context view on the bottom left hand size that tells you the words that come before and after your query. The reader view in the middle of your page offers a view of the full text of a given document.

Voyant has a lot of great documentation on its site. Also, please see this helpful tutorial by Miriam Posner.

AntConc

Laurence Anthony’s Antconc is a freeware concordance program for Windows, Mac, and Linux. Download: http://www.laurenceanthony.net/software/antconc/. Antconc works on plain text files with the file appendix .txt. This section offers some basic information about the tool. If you're interested in using the tool for your research, please feel free to schedule an appointment with me and we can cover the ins and outs of the tool.

You’ll see 7 tabs across the top:

Concordance: This will show you what’s known as a Keyword in Context view (abbreviated KWIC, more on this in a minute), using the search bar below it.

Concordance Plot: This will show you a very simple visualization of your KWIC search, where each instance will be represented as a little black line from beginning to end of each file containing the search term.

File View: This will show you a full file view for larger context of a result.

Clusters: This view shows you words which very frequently appear together.

Collocates: Clusters show us words which _definitely _appear together in a corpus; collocates show words which are statistically likely to appear together.

Word list: All the words in your corpus.

Keyword List: This will show comparisons between two corpora.

Let’s get started.

Load files. To load one file for viewing, click “Open File.” To load a corpora of files, click “Open Dir.” For our purposes, click “Open Dir”

Navigate to our corpora on your desktop. After clicking the proper file, you should see the files loading into Antconc.

Search. In the search box, type the word “apple” to see how many times “apple” appears in the corpus and what words exist around it. Click “Start” when you’re ready to see this.

If you want to search for the singular and plural version of a word, such as “women” and “woman,” Antconc has “Wildcard settings” that allow for this.

Try typing wom?n into the search box.

Try typing m?n into the search box.

Why are there so many more instances of men than women? Take a look at the Concordance Plot Tool tab to see where results appear in target texts. Hover over one of the instances, and a hand will appear. Click the result to see how the “File View” or the word or phrase in context. Click on the “Clusters/N-Grams” tab and search for “wom?n” again. You’ll see each instance of the word “women” or “woman” in the context of the text.

See any words you'd rather didn't appear in AntConc? Configure Antconc to apply to texts a stopwords file

1. Click on the tab for "Tool Preferences"

2. In the window that opens, click on the left Category sidebar on "Word List"

3. Check the button for "Use a stoplist below"

4. Click on "Open" and navigate to your stopwords file. A couple good stopwords lists that you can use are: NLTK List of Stopwords; Matthew Jockers Stoplist

5. Click "Apply"

Topic Modeling

Topic modeling is a good way to explore different topics within a large corpus. If, for example, you want to explore how words relate to each other, topic modeling is a good way to explore groups of words in a document. MALLET is the most commonly used and well respected resource for topic modeling. For an excellent step-by-step guide on how to install MALLET, please see this Programming Historian tutorial. Please reach out to me if you need assistance getting MALLET working. You may also want to use the MALLET Python wrapper if you prefer to work in Python and would like the option to visualize your models using frequency distribution and other options. As an example of what topics we might find, here is what you might see if you topic modeled documents about travel:

Economy growth money barter Europe France Italy train

These topics give a general idea of what is in the corpus. We know, for example, that there are a lot of documents in the corpus that discuss tourism, the economy, and Europe. If we change the number of topics in our topic model, we'll get different results. Here are a few ways to think about interpreting your topic model:

1. Vary the number of topics within a model using an experimental, iterative approach so that your topics make more sense

This Programming Historian tutorial explains how to create different numbers of topics in MALLET. If your corpus contains a large number of texts, it is generally a good idea to have a larger number of topics. If your corpus is smaller, you generally need a smaller number of topics. The reason for this is that topic models are like microscopes. The closer the magnification, the more detailed the topics become. If you allow greater distance between your microscope and your corpus, the more topics appear within your line of vision, so the topics are broader in scope as a result. Here is how to create a topic model and test out different numbers of topics.

Open your command line or terminal by clicking Start Menu -> All Programs -> Accessories -> Command Prompt. You want to make sure the command line shows your user home directory.

Figure 1: Terminal on an Apple computer

Type cd (That is: cd-space-period-period) to change directory. Keep doing this until you’re at your user home directory as in Figure 1.

Then type cd mallet and press “tab” after you type in mallet. This will give you the full name of the mallet folder you downloaded. You can also type the full name of the mallet folder (e.g. “mallet-2.0.8”) so that your computer can find MALLET properly.

Figure 2: Terminal after changing directory to mallet and pressing tab for the full Mallet folder name.

2. Import your data

Add your own dataset to the MALLET "sample data" folder.

Your MALLET application comes with sample data. Open the sample data folder to navigate to your Mallet folder. Open the “en” folder within the “web” folder of the sample data folder. The full file path on my computer is here: /Users/ashleychampagne/mallet-2.0.8/sample-data/web/en.

Tip: If you have a Mac, hold down the option, command, and c key to copy the full path.

Figure 3: The sample-data folder within a MALLET application.

Delete files in the “en” folder and copy and paste the files downloaded from the Documenting The American South initiative collection into the “en” folder. Your “en” folder should now have all the files in our data collection.

Figure 4: The web folder within the sample-data folder.

3. Create your first first topic model

Open your command line again, type in the following code in your command line.

bin\mallet import-dir --input sample-data\web\en --output tutorialasi.mallet --keep-sequence --remove-stopwords

This creates our tutorialasi.mallet file which will help us run our topic models. Let’s input this file and create a topic model. ./bin/mallet train-topics --input tutorialasi.mallet --num-topics 20 --optimize-interval 20 --output-state topic-state.gz --output-topic-keys tutorialasi_20_keys.txt --output-doc-topics tutorialasi_20_composition.txt

This code tells MALLET to create a topic model (train-topics) with twenty topics. Every other command that includes two dashes in front makes the command more specific.

Here's what the command does:

opens your tutorialasi.mallet file
trains MALLET to find 20 topics
outputs every word in your corpus of materials and the topic it belongs to into a compressed file (.gz; see www.gzip.org on how to unzip this)
outputs a text document showing you what the top key words are for each topic (tutorialasi_20_keys.txt)
Optimize-interval gives you an indication of the weight of a topic. The higher the number, the more significant that topic is within the entire topic model. It outputs a text file indicating the breakdown, by percentage, of each topic within each original text file you imported (tutorialasi_composition.txt). (You can see the full range of parameters by typing bin\mallet train-topics –help at the prompt.

Open your tutorialasi_20_keys file. This file has the top keywords in each topic. Read each of the topics.

Figure X: Topic model of twenty topics of our dataset.

Following the general practice of the digital humanities, the right number of topics for a given topic model would be when each of the topics makes sense, and when each topic expresses its own unit of information (i.e. there are no overlapping topics or topics too general to make sense). Following this logic, how does our first topic model look?

Because the LDA algorithm can only understand what words co-occur, and not what makes sense to a human reader, these topics may not look like much yet. But we know a lot from our first topic model.

We know that, out of 151 files in a corpus of First Person Narratives of the American South, significant topics include:

Slavery (i.e. topics 4, 9, 19)
Armed Forces (i.e. topics 6, 7, 12, 14).
Religion (i.e. topic 10)
Love (i.e. topic 1)

It’s typical in digital humanities for the researcher to assign a label that describes the topic, which is what we have done above.

We also know that, within the First Person Narratives of the American South, oral dialect is important (i.e. topic 18).

We can also make some interesting observations. For example, it’s interesting that the word “blackford” is a significant enough word to appear in topic 13. Blackford is the name of a slave-owning family, and John Blackford published “Ferry Hill Plantation Journal: January 4, 1938-January 15, 1839,” a work that is part of the First-Persons corpus.

What we also know, however, is that some of our topics do not really make sense. For example, topic 3 is seemingly about family and friends, but also about religion:

3 0.46577 meeting friends life years family thomas mind society good attended father love spirit children divine religious god subject year house

It’s hard to assign a clear label to this topic yet. Topic 8, as well, seems a bit too broad to assign a label.

8 0.42425 state states court president party united north county judge general law congress government war governor carolina people public convention house

Because the number of our topics are a bit too broad to assign a clear label, let’s try creating a topic model with fewer topics. For example, let’s try ten topics this time.

4. Try fewer topics

Let’s make another topic model. Use the following command to create a new topic model with ten topics.

./bin/mallet train-topics --input tutorialasi.mallet --num-topics 10 --optimize-interval 10 --output-state topic-state.gz --output-topic-keys tutorialasi_10_keys.txt --output-doc-topics tutorialasi_10_composition.txt

Open your tutorialasi_10_keys.txt file.

Take a look at the topics. Can you get a sense of what your corpus is about from these topics? How does this model differ from the previous?

Resources on topic modeling

Here are some great reads to learn about topic modeling:

Boyer, Ryan Culp Boyer. “A Human-in-the-Loop Methodology For Applying Topic Models to Identify Systems Thinking and to Perform Systems Analysis.” Masters thesis. University of Virginia, 2016.
Chang, Jonathan, Sean Gerrish, Chong Wang, Jordan L. Boyd-graber, and David M. Blei. “Reading Tea Leaves: How Humans Interpret Topic Models.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, 288–296. Cxurran Associates, Inc., 2009.
Chuang, Jason, Sonal Gupta, Christopher D. Manning, Jeffrey Heer. “Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment.” International Conference on Machine Learning (ICML), 2013.
Evans, Michael S. “A Computational Approach to Qualitative Analysis in Large Textual Datasets.” PLOS ONE 9, no. 2 (February 3, 2014): e87908.
Posner, Miriam. “Very Basic Strategies for Interpreting Results from the Topic Modeling Tool.” Miriam Posner’s Blog (blog), October 29, 2012.
Underwood, Ted. "Topic Modeling Made Just Simple Enough." (April, 2012)
Veas, Edurardo, and Cecilia di Sciascio. “Interactive Topic Analysis with Visual Analytics and Recommender Systems.” Association for the Advancement of Artificial Intelligence, 2015.
Tilton, Lauren, David Mimno, and Jessica Marie Johnson. “What Gets Counted Computational Humanities under Revision.” In Computational Humanities (Debates in DH). Minneapolis: University of Minnesota Press, 2024.
Nguyen, Dong, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Rebekah Tromble, and Jane Winters. “How We Do Things With Words: Analyzing Text as Social and Cultural Data.” Frontiers in Artificial Intelligence 3 (2020): 62. https://doi.org/10.3389/frai.2020.00062.
Morrison, Romi Ron. “Voluptuous Disintegration: A Future History of Black Computational Thought.” Digital Humanities Quarterly 016, no. 3 (n.d.). http://www.digitalhumanities.org/dhq/vol/16/3/000634/000634.html.