As a disclaimer: The first two paragraphs explore the nature of the word dendrogram and the names of things associated within a dendrogram. Please feel free to read this article in its entirety if you enjoy factoids. But my feelings will not be hurt if you jump down to the last three paragraphs to see how Lexos can be utilized on many levels and my exploration of what I found using it explicitly for a dendrogram. I would lastly like to say thank you to Wheaton College for making all of this possible.
Statistical overachievers might delight at the alliteration achieved in the title of this article and the possibility of some anagrammed value in the word dendrogram (which is not recognized by my recently updated word processor dictionary). But, what is a dendrogram? If you are an English major, you will jump to the online site for all things wordy: The Oxford English Dictionary (OED), where dendrogram is defined as: “A branched diagram representing the apparent similarity or relationship between taxa, esp. on the basis of their observed overall similarity rather than on their phylogeny.” This helps a little. I now understand it to be a diagram showing some sort of relationship to other things. I kind of got that from the ‘gram’ in dendrogram. But now what is phylogeny? Well, I jumped back into the OED and found my newest definition to be even more helpful (note sarcasm). Phylogeny is defined as: “A diagram or theoretical model of the sequence of evolutionary divergence of species or other groups of organisms from their common ancestors.” Taking that idea and transferring it to words, gives me the thought that it possibly means the evolution of a word or a divergence in the usage of a word. Luckily, I looked up taxa too: “A taxonomic group or unit, esp. when its rank in the taxonomic hierarchy is not specified.” The EOD didn’t quite fail me, but it wasn’t as helpful as I had hoped.
Luckily, I was provided a little information before diving headfirst into the dendrogram waters, so I understood what a dendrogram was before I decided to show off my lofty OED skills. Otherwise, I would have had to spend a few more hours searching for the idea of exactly what a dendrogram does. Out of curiosity, I did look through some other websites and the heavy linguistics seems prevalent to any discussion of a dendrogram. My simple breakdown of what a dendrogram can do is: compare relationships, in our case, word usage and put those relationships in a simple diagram to show how closely a work of text is related to another one. The parts of a dendrogram are named with attributes of a tree, from leaves to stand alone branches. There are no limits to how many leaves can be in a dendrogram, it just depends on how adept a user is when setting the parameters of their dendrogram. Pairs of leaves are called clades and each pairing of clades is also named a clade, while single leaves are named simplicifolious (which means single-leafed). Now that you know some words describing a dendrogram, I have made it easier to understand by letting you peek at a picture just below.
Wheaton College has dedicated a site to the Lexos tool: http://lexos.wheatoncollege.edu/upload. There is also a site dedicated to explaining many of the functions and understanding the dendrogram: http://wheatoncollege.edu/lexomics/educational-material/. Wheaton College explains what Lexos is better than I could: “Lexos is an integrated workflow of tools to facilitate the computational analyses of texts.” I won’t go into all the tools that are provided by Lexos, and I honestly couldn’t tell you about most of them. I have an understanding that there can be many things done by Lexos and I have only dipped my toes in. Without having had a small tutorial in a classroom, using Lexos would have been very cumbersome to navigate, almost impossible. Wheaton College does have some tutorials, but they explain what Lexos can do more than show a beginning user where to get started. But I’m just learning to swim in this very large pond of wondrous potential; others might be able to wade right in without fear. I can at least see it as a powerful tool for analyzing texts of all sizes. It can do a word cloud, much like the Wordle website, and it can do word bubbles, or BubbleViz, as it is named. This is pretty slick, but there are many variables that change the text and could give a user some very errant data on word usage. But it was fun to play around with.
Let us clear the table and get at the meat of this project: the dendrogram. I picked a multitude of books from authors living and writing during the mid 1800s to the early 1900s. I chose five books from Mark Twain: The Adventures of Huckleberry Finn, The Adventures of Tom Sawyer, The Innocents Abroad, Roughing It, and The Tragedy of Pudd’nhead Wilson. I chose two books from Alexandre Dumas: The Three Musketeers and The Count of Monte Cristo. The other three books were all from different authors: Kate Chopin’s The Awakening and Selected Short Fiction, Herman Melville’s Moby Dick, and Nathanial Hawthorne’s The Scarlet Letter. I chose the Dumas books to see if they would match (out of pure curiosity). I am a huge fan of Mark Twain, so I couldn’t help picking more of his books; and the chance to see if the man with the finest ear for regional dialects written, having a wide variance within a dendrogram was too good to pass up. Kate Chopin’s book had a similar dialectal pattern so I wanted to see where she fell when compared to Mark Twain. Melville and Hawthorne were great friends at one point and did what amounted to modern-day workshops on their writings, so I was again curious to see if they would be similar in word usage, even though Moby Dick is ten times the length of The Scarlet Letter.
I won’t go into the how-to of getting a dendrogram. But I think it is at least note worthy to talk about the results. I’ll abbreviate the titles to make this brief. Huck Finn and Awakening are simplicifolious (out there on their own) just as I thought they would be. Dumas is by himself and Twain’s Puddn’Head and Tom Sawyer are set for a clade of Mississippi River dialect. Twain’s other two novels were written mainly in his voice and it makes sense that they form their own clade as well. I am surprised and delighted to see Moby Dick and The Scarlet Letter in their own clade as well, maybe not proving I was right, but it gives me a warm and fuzzy feeling that two friends at least paired up in this dendrogram. This was a fun experiment that immediately showed a usefulness that would lead to some fruitful research later on down the road. I have attached a picture of this amazing dendrogram (clicking on the picture enlarges to show a much better view). For me, this shows how much fun there can be had, while still advancing academia, even if it is a slow and careful descent into the Lexomic waters.