Skip to main content

A few useful text mining tools

About this guide

Text mining is a useful methodology to explore large texts or collections of texts. This guide is in development -- and I'll continue to update it. Please reach out if you'd like help in this area.

Tool 1: Voyant

Voyant is a web-based text reading and analysis environment. It has sample corpora and you can upload your own collection in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word.

To try one sample for the purposes of this guide, we'll use the University of North Carolina library's North American Slave Narrative Collection.

1.     Go to Voyant Tools (https://voyant-tools.org/).

Click “upload.”

2.     Upload your corpus.

Navigate to the location you’ve saved the North American Slave Narrative Collection. For many of you, this will be your desktop. Go to na-slave-narratives > data > texts. Then select all texts. For many operating systems, you can select all texts by clicking Ctrl+A (or clicking +A on a Mac). For Windows 8 or 10, click “Edit” in the menu bar at the top of the window, and click “Select All” on the drop-down menu. Click “Open” on in the lower right-hand corner.

3.     Analyze the visualizations.

You should see three visualizations on your screen.

  1. Cirrus: a word cloud that displays the highest frequency terms – the larger the term, the more frequent it is. Hover over a word to see additional information.
  2. Summary: Basic information about the text, including the number of words, the length of documents, vocabulary density, and distinctive words for each document.
  3. Corpus Reader: This allows you to read the text(s) in the collection – more text will appear as you scroll. You can hover over words to view their frequency and click to see more information.

When you click on a word in the Cirrus word cloud, you'll then see that the graph to your right hand side changes to that specific word. You'll be able to see each individual text file in the "Corpus (Documents)" section and find the relative frequency of the word you've chosen in each of the documents. You'll see a keyword in context view on the bottom left hand size that tells you the words that come before and after your query. The reader view in the middle of your page offers a view of the full text of a given document.

 

Voyant has a lot of great documentation on its site. Also, please see this helpful tutorial by Miriam Posner.  

Tool 2: AntConc

Laurence Anthony’s Antconc is a freeware concordance program for Windows, Mac, and Linux. Download: http://www.laurenceanthony.net/software/antconc/. Antconc works on plain text files with the file appendix .txt. This section offers some basic information about the tool. If you're interested in using the tool for your research, please feel free to schedule an appointment with me and we can cover the ins and outs of the tool.

You’ll see 7 tabs across the top:

Concordance: This will show you what’s known as a Keyword in Context view (abbreviated KWIC, more on this in a minute), using the search bar below it.

Concordance Plot: This will show you a very simple visualization of your KWIC search, where each instance will be represented as a little black line from beginning to end of each file containing the search term.

File View: This will show you a full file view for larger context of a result.

Clusters: This view shows you words which very frequently appear together.

Collocates: Clusters show us words which _definitely _appear together in a corpus; collocates show words which are statistically likely to appear together.

Word list: All the words in your corpus.

Keyword List: This will show comparisons between two corpora.

Let’s get started.

 Load files. To load one file for viewing, click “Open File.” To load a corpora of files, click “Open Dir.” For our purposes, click “Open Dir”   

Navigate to our corpora on your desktop. After clicking the proper file, you should see the files loading into Antconc.

Search. In the search box, type the word “apple” to see how many times “apple” appears in the corpus and what words exist around it. Click “Start” when you’re ready to see this.

If you want to search for the singular and plural version of a word, such as “women” and “woman,” Antconc has “Wildcard settings” that allow for this.

Try typing wom?n into the search box.

Try typing m?n into the search box.

Why are there so many more instances of men than women? Take a look at the Concordance Plot Tool tab to see where results appear in target texts. Hover over one of the instances, and a hand will appear. Click the result to see how the “File View” or the word or phrase in context. Click on the “Clusters/N-Grams” tab and search for “wom?n” again. You’ll see each instance of the word “women” or “woman” in the context of the text.

 

See any words you'd rather didn't appear in AntConc? Configure Antconc to apply to texts a stopwords file

 

1. Click on the tab for "Tool Preferences" 

2. In the window that opens, click on the left Category sidebar on "Word List" 

3. Check the button for "Use a stoplist below" 

4. Click on "Open" and navigate to your stopwords file. A couple good stopwords lists that you can use are: NLTK List of StopwordsMatthew Jockers Stoplist

5. Click "Apply"

Resources on topic modeling

Topic modeling is a good way to explore different topics within a large corpus. If, for example, you want to explore how words relate to each other, topic modeling is a good way to explore groups of words in a document. MALLET is the most commonly used and well respected resource for topic modeling. For an excellent step-by-step guide on how to install MALLET, please see this Programming Historian tutorial. Please schedule an appointment with me if you'd like help installing and using MALLET.

Resources on interpreting topic models:

Boyer, Ryan Culp Boyer. “A Human-in-the-Loop Methodology For Applying Topic Models to Identify Systems Thinking and to Perform Systems Analysis.” Masters thesis. University of Virginia, 2016. https://libra2.lib.virginia.edu/downloads/r207tp34z?filename=Boyer_Thesis_Dec2016.pdf.

Chang, Jonathan, Sean Gerrish, Chong Wang, Jordan L. Boyd-graber, and David M. Blei. “Reading Tea Leaves: How Humans Interpret Topic Models.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, 288–296. Cxurran Associates, Inc., 2009. http://papers.nips.cc/paper/3700-reading-tea-leaves-how-humans-interpret-topic-models.pdf.

Chuang, Jason, Sonal Gupta, Christopher D. Manning, Jeffrey Heer. “Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment.” International Conference on Machine Learning (ICML), 2013. http://vis.stanford.edu/papers/topic-model-diagnostics

Evans, Michael S. “A Computational Approach to Qualitative Analysis in Large Textual Datasets.” PLOS ONE 9, no. 2 (February 3, 2014): e87908. https://doi.org/10.1371/journal.pone.0087908.

Posner, Miriam. “Very Basic Strategies for Interpreting Results from the Topic Modeling Tool.” Miriam Posner’s Blog (blog), October 29, 2012. http://miriamposner.com/blog/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool/.

Veas, Edurardo, and Cecilia di Sciascio. “Interactive Topic Analysis with Visual Analytics and Recommender Systems.” Association for the Advancement of Artificial Intelligence, 2015. https://www.researchgate.net/publication/279285547_Interactive_Topic_Analysis_with_Visual_Analytics_and_Recommender_Systems.