Text mining is a useful methodology to explore large texts or collections of texts. This guide is in development -- and I'll continue to update it. Please reach out if you'd like help in this area.
Voyant is a web-based text reading and analysis environment. It has sample corpora and you can upload your own collection in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word.
To try one sample for the purposes of this guide, we'll use the University of North Carolina library's North American Slave Narrative Collection.
Go to Voyant Tools
Upload your corpus. Navigate to the location you’ve saved the North American Slave Narrative Collection. For many of you, this will be your desktop. Go to na-slave-narratives > data > texts. Then select all texts. For many operating systems, you can select all texts by clicking ⌘ A on a Mac). For Windows 8 or 10, click “Edit” in the menu bar at the top of the window, and click “Select All” on the drop-down menu. Click “Open” on in the lower right-hand corner.
Analyze the visualizations
You should see three visualizations on your screen.
When you click on a word in the Cirrus word cloud, you'll then see that the graph to your right hand side changes to that specific word. You'll be able to see each individual text file in the "Corpus (Documents)" section and find the relative frequency of the word you've chosen in each of the documents. You'll see a keyword in context view on the bottom left hand size that tells you the words that come before and after your query. The reader view in the middle of your page offers a view of the full text of a given document.
Voyant has a lot of great documentation on its site. Also, please see this helpful tutorial by Miriam Posner.
Laurence Anthony’s Antconc is a freeware concordance program for Windows, Mac, and Linux. Download: http://www.laurenceanthony.net/software/antconc/. Antconc works on plain text files with the file appendix .txt. This section offers some basic information about the tool. If you're interested in using the tool for your research, please feel free to schedule an appointment with me and we can cover the ins and outs of the tool.
You’ll see 7 tabs across the top:
Concordance: This will show you what’s known as a Keyword in Context view (abbreviated KWIC, more on this in a minute), using the search bar below it.
Concordance Plot: This will show you a very simple visualization of your KWIC search, where each instance will be represented as a little black line from beginning to end of each file containing the search term.
File View: This will show you a full file view for larger context of a result.
Clusters: This view shows you words which very frequently appear together.
Collocates: Clusters show us words which _definitely _appear together in a corpus; collocates show words which are statistically likely to appear together.
Word list: All the words in your corpus.
Keyword List: This will show comparisons between two corpora.
Let’s get started.
Load files. To load one file for viewing, click “Open File.” To load a corpora of files, click “Open Dir.” For our purposes, click “Open Dir”
Navigate to our corpora on your desktop. After clicking the proper file, you should see the files loading into Antconc.
Search. In the search box, type the word “apple” to see how many times “apple” appears in the corpus and what words exist around it. Click “Start” when you’re ready to see this.
If you want to search for the singular and plural version of a word, such as “women” and “woman,” Antconc has “Wildcard settings” that allow for this.
Try typing wom?n into the search box.
Try typing m?n into the search box.
Why are there so many more instances of men than women? Take a look at the Concordance Plot Tool tab to see where results appear in target texts. Hover over one of the instances, and a hand will appear. Click the result to see how the “File View” or the word or phrase in context. Click on the “Clusters/N-Grams” tab and search for “wom?n” again. You’ll see each instance of the word “women” or “woman” in the context of the text.
1. Click on the tab for "Tool Preferences"
2. In the window that opens, click on the left Category sidebar on "Word List"
3. Check the button for "Use a stoplist below"
4. Click on "Open" and navigate to your stopwords file. A couple good stopwords lists that you can use are: NLTK List of Stopwords; Matthew Jockers Stoplist
5. Click "Apply"
Topic modeling is a good way to explore different topics within a large corpus. If, for example, you want to explore how words relate to each other, topic modeling is a good way to explore groups of words in a document. MALLET is the most commonly used and well respected resource for topic modeling. For an excellent step-by-step guide on how to install MALLET, please see this Programming Historian tutorial. Please reach out to me if you need assistance getting MALLET working. You may also want to use the MALLET Python wrapper if you prefer to work in Python and would like the option to visualize your models using frequency distribution and other options. As an example of what topics we might find, here is what you might see if you topic modeled documents about travel:
Economy growth money barter Europe France Italy train
These topics give a general idea of what is in the corpus. We know, for example, that there are a lot of documents in the corpus that discuss tourism, the economy, and Europe. If we change the number of topics in our topic model, we'll get different results. Here are a few ways to think about interpreting your topic model:
1. Vary the number of topics within a model using an experimental, iterative approach so that your topics make more sense
This Programming Historian tutorial explains how to create different numbers of topics in MALLET. If your corpus contains a large number of texts, it is generally a good idea to have a larger number of topics. If your corpus is smaller, you generally need a smaller number of topics. The reason for this is that topic models are like microscopes. The closer the magnification, the more detailed the topics become. If you allow greater distance between your microscope and your corpus, the more topics appear within your line of vision, so the topics are broader in scope as a result. Here is how to create a topic model and test out different numbers of topics.
Open your command line or terminal by clicking Start Menu -> All Programs -> Accessories -> Command Prompt. You want to make sure the command line shows your user home directory.
Figure 1: Terminal on an Apple computer
Add your own dataset to the MALLET "sample data" folder.
Tip: If you have a Mac, hold down the option, command, and c key to copy the full path.
Figure 3: The sample-data folder within a MALLET application.
Delete files in the “en” folder and copy and paste the files downloaded from the Documenting The American South initiative collection into the “en” folder. Your “en” folder should now have all the files in our data collection.
Figure 4: The web folder within the sample-data folder.
Open your command line again, type in the following code in your command line.
bin\mallet import-dir --input sample-data\web\en --output tutorialasi.mallet --keep-sequence --remove-stopwords
This creates our tutorialasi.mallet file which will help us run our topic models. Let’s input this file and create a topic model. ./bin/mallet train-topics --input tutorialasi.mallet --num-topics 20 --optimize-interval 20 --output-state topic-state.gz --output-topic-keys tutorialasi_20_keys.txt --output-doc-topics tutorialasi_20_composition.txt
This code tells MALLET to create a topic model (train-topics) with twenty topics. Every other command that includes two dashes in front makes the command more specific.
Here's what the command does:
Open your tutorialasi_20_keys file. This file has the top keywords in each topic. Read each of the topics.
Figure X: Topic model of twenty topics of our dataset.
Following the general practice of the digital humanities, the right number of topics for a given topic model would be when each of the topics makes sense, and when each topic expresses its own unit of information (i.e. there are no overlapping topics or topics too general to make sense). Following this logic, how does our first topic model look?
Because the LDA algorithm can only understand what words co-occur, and not what makes sense to a human reader, these topics may not look like much yet. But we know a lot from our first topic model.
We know that, out of 151 files in a corpus of First Person Narratives of the American South, significant topics include:
It’s typical in digital humanities for the researcher to assign a label that describes the topic, which is what we have done above.
We also know that, within the First Person Narratives of the American South, oral dialect is important (i.e. topic 18).
We can also make some interesting observations. For example, it’s interesting that the word “blackford” is a significant enough word to appear in topic 13. Blackford is the name of a slave-owning family, and John Blackford published “Ferry Hill Plantation Journal: January 4, 1938-January 15, 1839,” a work that is part of the First-Persons corpus.
What we also know, however, is that some of our topics do not really make sense. For example, topic 3 is seemingly about family and friends, but also about religion:
3 0.46577 meeting friends life years family thomas mind society good attended father love spirit children divine religious god subject year house
It’s hard to assign a clear label to this topic yet. Topic 8, as well, seems a bit too broad to assign a label.
8 0.42425 state states court president party united north county judge general law congress government war governor carolina people public convention house
Because the number of our topics are a bit too broad to assign a clear label, let’s try creating a topic model with fewer topics. For example, let’s try ten topics this time.
4. Try fewer topics
Let’s make another topic model. Use the following command to create a new topic model with ten topics.
./bin/mallet train-topics --input tutorialasi.mallet --num-topics 10 --optimize-interval 10 --output-state topic-state.gz --output-topic-keys tutorialasi_10_keys.txt --output-doc-topics tutorialasi_10_composition.txt
Open your tutorialasi_10_keys.txt file.
Take a look at the topics. Can you get a sense of what your corpus is about from these topics? How does this model differ from the previous?
Here are some great reads to learn about topic modeling:
Nguyen, Dong, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Rebekah Tromble, and Jane Winters. “How We Do Things With Words: Analyzing Text as Social and Cultural Data.” Frontiers in Artificial Intelligence 3 (2020): 62. https://doi.org/10.3389/
Morrison, Romi Ron. “Voluptuous Disintegration: A Future History of Black Computational Thought.” Digital Humanities Quarterly 016, no. 3 (n.d.). http://www.
Brown University Library | Providence, RI 02912 | (401) 863-2165 | Contact | Comments | Library Feedback | Site Map