Feb 22
2016
11:25 PM

Dreaded Dendrogram Dreams

As a disclaimer: The first two paragraphs explore the nature of the word dendrogram and the names of things associated within a dendrogram. Please feel free to read this article in its entirety if you enjoy factoids. But my feelings will not be hurt if you jump down to the last three paragraphs to see how Lexos can be utilized on many levels and my exploration of what I found using it explicitly for a dendrogram. I would lastly like to say thank you to Wheaton College for making all of this possible.

Statistical overachievers might delight at the alliteration achieved in the title of this article and the possibility of some anagrammed value in the word dendrogram (which is not recognized by my recently updated word processor dictionary). But, what is a dendrogram? If you are an English major, you will jump to the online site for all things wordy: The Oxford English Dictionary (OED), where dendrogram is defined as: “A branched diagram representing the apparent similarity or relationship between taxa, esp. on the basis of their observed overall similarity rather than on their phylogeny.” This helps a little. I now understand it to be a diagram showing some sort of relationship to other things. I kind of got that from the ‘gram’ in dendrogram. But now what is phylogeny? Well, I jumped back into the OED and found my newest definition to be even more helpful (note sarcasm). Phylogeny is defined as: “A diagram or theoretical model of the sequence of evolutionary divergence of species or other groups of organisms from their common ancestors.” Taking that idea and transferring it to words, gives me the thought that it possibly means the evolution of a word or a divergence in the usage of a word. Luckily, I looked up taxa too: “A taxonomic group or unit, esp. when its rank in the taxonomic hierarchy is not specified.” The EOD didn’t quite fail me, but it wasn’t as helpful as I had hoped.

Luckily, I was provided a little information before diving headfirst into the dendrogram waters, so I understood what a dendrogram was before I decided to show off my lofty OED skills. Otherwise, I would have had to spend a few more hours searching for the idea of exactly what a dendrogram does. Out of curiosity, I did look through some other websites and the heavy linguistics seems prevalent to any discussion of a dendrogram. My simple breakdown of what a dendrogram can do is: compare relationships, in our case, word usage and put those relationships in a simple diagram to show how closely a work of text is related to another one. The parts of a dendrogram are named with attributes of a tree, from leaves to stand alone branches. There are no limits to how many leaves can be in a dendrogram, it just depends on how adept a user is when setting the parameters of their dendrogram. Pairs of leaves are called clades and each pairing of clades is also named a clade, while single leaves are named simplicifolious (which means single-leafed). Now that you know some words describing a dendrogram, I have made it easier to understand by letting you peek at a picture just below.

Clades

Wheaton College has dedicated a site to the Lexos tool: http://lexos.wheatoncollege.edu/upload. There is also a site dedicated to explaining many of the functions and understanding the dendrogram: http://wheatoncollege.edu/lexomics/educational-material/. Wheaton College explains what Lexos is better than I could: “Lexos is an integrated workflow of tools to facilitate the computational analyses of texts.” I won’t go into all the tools that are provided by Lexos, and I honestly couldn’t tell you about most of them. I have an understanding that there can be many things done by Lexos and I have only dipped my toes in. Without having had a small tutorial in a classroom, using Lexos would have been very cumbersome to navigate, almost impossible. Wheaton College does have some tutorials, but they explain what Lexos can do more than show a beginning user where to get started. But I’m just learning to swim in this very large pond of wondrous potential; others might be able to wade right in without fear. I can at least see it as a powerful tool for analyzing texts of all sizes. It can do a word cloud, much like the Wordle website, and it can do word bubbles, or BubbleViz, as it is named. This is pretty slick, but there are many variables that change the text and could give a user some very errant data on word usage. But it was fun to play around with.

Let us clear the table and get at the meat of this project: the dendrogram. I picked a multitude of books from authors living and writing during the mid 1800s to the early 1900s. I chose five books from Mark Twain: The Adventures of Huckleberry Finn, The Adventures of Tom Sawyer, The Innocents Abroad, Roughing It, and The Tragedy of Pudd’nhead Wilson. I chose two books from Alexandre Dumas: The Three Musketeers and The Count of Monte Cristo. The other three books were all from different authors: Kate Chopin’s The Awakening and Selected Short Fiction, Herman Melville’s Moby Dick, and Nathanial Hawthorne’s The Scarlet Letter. I chose the Dumas books to see if they would match (out of pure curiosity). I am a huge fan of Mark Twain, so I couldn’t help picking more of his books; and the chance to see if the man with the finest ear for regional dialects written, having a wide variance within a dendrogram was too good to pass up. Kate Chopin’s book had a similar dialectal pattern so I wanted to see where she fell when compared to Mark Twain. Melville and Hawthorne were great friends at one point and did what amounted to modern-day workshops on their writings, so I was again curious to see if they would be similar in word usage, even though Moby Dick is ten times the length of The Scarlet Letter.

I won’t go into the how-to of getting a dendrogram. But I think it is at least note worthy to talk about the results. I’ll abbreviate the titles to make this brief. Huck Finn and Awakening are simplicifolious (out there on their own) just as I thought they would be. Dumas is by himself and Twain’s Puddn’Head and Tom Sawyer are set for a clade of Mississippi River dialect. Twain’s other two novels were written mainly in his voice and it makes sense that they form their own clade as well. I am surprised and delighted to see Moby Dick and The Scarlet Letter in their own clade as well, maybe not proving I was right, but it gives me a warm and fuzzy feeling that two friends at least paired up in this dendrogram. This was a fun experiment that immediately showed a usefulness that would lead to some fruitful research later on down the road. I have attached a picture of this amazing dendrogram (clicking on the picture enlarges to show a much better view).  For me, this shows how much fun there can be had, while still advancing academia, even if it is a slow and careful descent into the Lexomic waters.

Screen Shot 2016-02-22 at 11.15.57 PM

Feb 22
2016
9:40 PM

Assignment 2

Lexos is a software program developed by Wheaton College in Norton, Massachusetts, it was developed for use in the field of Digital Humanities. As a tool the Lexos program can perform a handful of tasks that are quite useful for individuals involved in Digital humanities research.

The program is designed to be able to take a text, (which are easily inserted by uploading files), apply a variety of editing options, and then finally visualize and/or analyze the piece. The editing options include: scrubbing, cutting, and tokenize/count. Scrubbing allows users to alter the texts’ punctuation, case, remove certain characters, remove words (stopwords), replace words (lemmas), replace characters (consolidations), and tell the program what to do with non-standard characters (special characters e.g. ∆). Cutting appears to make it possible to split the text into smaller segments, with five options on how those segments are separated (segments can also overlap with each other). Tokenize/count is the simplest of the options; it shows how many words are in the document, compares single-word count repetition to total word count, and can alter the way lexos compares words to each other. The Visualize section of the program has four options, RollingWindow Graph (which I was unable to figure out), Word Cloud, Multicloud (makes word clouds for multiple documents), and BubbleViz which is essentially a word cloud but with the word frequency represented by bubble size, rather than word size. Lexos’ “analyze” section has 4 options, Statistics, clustering, similarity query, and Topword. Statistics gives information such as number of distinct terms and number of words occurring only once. Clustering has two options; hierarchal and K-means, these change the way that data about word choice of authors is displayed, clustering is used to show differences in word choice which can be utilized to differentiate between authors as well as compare differences in the writings, (i.e. an authors first work compared to their last work.

Lexos as software program is quite confusing to the untrained eye, many of the words used in the program such as dendrogram, referring to the graph created by clustering, tokenize, culling, etc. this limits usability to those trained in the digital humanities or instructed on how to use it, other individuals would have a very difficult time figuring out how to do use the program properly. Another complaint I have with the program is its lack of instruction on the actual program, having to leave the page to figure out what some components of the program do is a pain. Wheaton College already provides definitions and hints at what certain components do, so why not do the same for the all of the editing options and sections? The user interface is very simple to use however, so once an individual gets past the challenge of the diction used in the program it is quite easy to use. Overall, lexos is an extremely useful tool for people in digital humanities, but is both difficult to use and very specified for people not involved in the digital humanities field, I would suggest using other programs first if someone not involved in digital humanities wants to use it for reasons other than the primary purpose.

Feb 22
2016
9:14 PM

Assignment #2

Lexos is a unique tool that can be used in Digital Humanities and Computing Humanities, as well as any other person that has the need for it. As a novice user of lexos the initial interface was not hard to figure it out, but did have some tricky factors.

Lexos uses a simple approach to get started using its program.You simply choose a document that you would like to upload and drag it in to a folder. Although, for me I kept getting an error due to the fact that you cannot upload it as a word file, it has to be uploaded as a txt file. After dragging your file into the box you go through a variety of options to specify your result. The prepare tab allows you to make specification towards the final product having to do with the words that are involved. Some of the options available include; removing punctuations, making all words lowercase, and removing digits. These are all listed under the prepare tab and scrubbing. By having this option under the scrubbing tool it can offer a variation to the style of production. Another available tab is visualization, which allows you to choose the orientation and style of your word cloud. Lastly lexos allows for multiple styles of words usage diagrams such as clustering, hierarchal clustering, and top word. The different types of diagrams enables lexos user to see things in different perspectives as well as it allows lexos to appeal to a wider variety of user. For myself lexos has been a helpful tool in the field I am currently enrolled in, although I think it is mostly restricted to the field of Digital Humanities, due to the fact that it is a major key in the research for digital humanities members. Not many other work forces call for a program to analyze the usage of words. That being said it is very helpful when that is called for. In class we used lexos to discuss and analyze the works of writers. A couple of the authors included were Jane Austen and George Eliot. From our comparison we saw a reoccurring similarity in the word usage of Jane Austen and her work as a whole, as well as a similarity between Austen and George Eliot. The tool we used for this analysis was under “Analyze” and then Hierarchal Clustering. From here this produced a Dendrogram, which I had no clue what this was. The Dendrogram was a graph the used height and vertical difference to show the similarity between authors and their works. This point about the Dendrogram goes back to the fact about the usage for lexos. Without someone who is qualified for the lexos program it may be hard for normal users to understand what certain words are and what their best usages are.

Overall lexos offers a far more advanced result than its competitors because of its ability to be precise with its functions, which is positive for users who are familiar with the program and usage of lexos, but can be somewhat intimidating for beginner users.

 

Feb 22
2016
8:48 PM

Assignment #2

Lexos as a Text Mining Tool

The use of the word “tool” to describe an object implies that that object somehow makes a task or group of tasks easier. Lexos definitely makes text analysis and visualization easier from beginning to end. Lexos provides a simple user interface. The various tabs (Manage, prepare, visualize, and analyze) make navigation easy, assuming, of course, that the user understands what is meant by these terms in relation to analyzing text data. Beyond navigation, the actual functionality itself is also well laid out.

Lexos makes text analysis easier starting with preparing the text. The scrub page under the prepare tab allows the user to clearly select which document or documents to scrub. The prepare page provides basic scrubbing options as well as more advanced options such as stop words. In case the user is unfamiliar with some scrubbing methods, the helpful gray circles with a question mark provide a brief explanation of the scrubbing methods. Another helpful option is to upload a file for stop words as opposed to entering them manually. For serious text analysis, there may be a plethora of stop words. In this case, uploading a file with stop words may be a better option then manually entering them. Under the prepare tab, there is also an option to cut and an option to tokenize. Cutting allows the user to split the text into a specified number of chunks or chunks based on a milestone string. Tokenizing allows the user to separate words based on a specified delimiter. Another amazing feature of Lexos is the ability to download the Document-term Matrix as a .csv file. CSV files are great for work in excel and advanced data mining software such as Weka. According to the Lexos website, they included this feature “[to] facilitate subsequent text mining analyses beyond the scope of this site” (http://wheatoncollege.edu/lexomics/tools/). After the text is prepared through scrubbing, cutting, and/or tokenizing, Lexos provides a number of options for visualizing the text data.

Lexos can visualize the text data as a Rolling Window Graph, a word cloud, a multicloud or a bubbleviz. Lexos expands beyond the basic word cloud and allows for interesting visualization techniques. In my opinion, the bubbleviz surpasses the wordcloud as a visualization tool. The word cloud emphasizes the most prominent words too much whereas bubbleviz exaggerates the most prominent words but the less prominent words are still readable. As you visualize, Lexos allows you even more control with various options such as minimum word length. Overall, the visualization aspect of Lexos is straightforward to use and the option for various different visualizations allows the user to see the data in different ways which is generally a very good thing in terms of text analysis.

Not only does Lexos provide the ability to prepare and visualize text data but it also supports the analysis of text data. Lexos can analyze the data through statistics, clustering, SimilarityQuery, or topword. By providing the user with statistics, Lexos allows the user to focus on analyzing those statistics rather than spending time calculating them. Also, clustering is very important in data mining and Lexos allows for two types of clustering, both hierarchical and clustering with K means. SimilarityQuery is a great analytical tool because it allows for document comparison which is a major subfield of text mining.

Overall, I consider Lexos to be an effective and well rounded tool. Preparing the data, especially scrubbing, produces a more meaningful result. Visualizing the data in various ways allows for more complex analysis. The various analytical tools produce a plethora of useful information for the user. Lexos is well-organized and quite effective as a text mining analysis tool.

Feb 22
2016
11:26 AM

Lexomics Tool (assignment 2)

The Lexos tool provides an analytical resource to experiment with the frequency of word usage in a specific text. This tool is slightly confusing to the untaught eye as the terminology used isn’t necessarily common knowledge. The tool allows one to upload a text, buffer words out of the text and then look at their frequency alone. It also allows for one to upload multiple texts and compare their word usages together through a hierarchical dendrogram. (Note difficult terminology) Had I not been walked through how to set up the texts I likely would have stopped using the tool as I personally wouldn’t have understood what exactly my results were showing, so it is fair to assume that this tool was created for the professional digital humanists. The tool itself, while very interesting, doesn’t really have much use outside of digital humanities, it allows one to answer questions about the way in which an author compares with their own or others work but the average person doesn’t have that question in mind typically. So the social assumption made for who is using this tool is that they are an educated person doing a research project on specifically word usage, frequency and comparing/ contrasting various texts words.

This tool does offer much more detailed and interesting results than other tools in its realm such as Word Cloud. Word Cloud only allows you to look at one text and then see the results of most frequently used words on a spread with frequency depicted by the size of the word. Lexos doesn’t overemphasize in the same way that WordCloud does on the most frequently used words and it also provides a word count of frequency when you scroll over the word, which WordCloud does not do. Lexos also has the option of “scrubbing” out words that aren’t interesting or necessary in the research, for example one can take out all of the pronouns from a story to provide a better detail of the content rather than the subjects within. Also, differing from WordCloud, Lexos provides statistical analysis of word frequency and word usage by providing a dendrogram graph. This feature in itself makes Lexos a much more detailed oriented and research oriented site that can do more than just providing an interesting look at words in a text.

As previously stated, this tool is for those with a question or a problem to solve, not just a typical internet user. There has to be an interest in a particular text or texts and a question pertaining to it otherwise the results are rather meaningless.  In practice with the tool we analyzed the similarities between texts from Bronte, Jane Austen and George Eliot; what became interesting however is that the two authors that had more similarity were those which were from further time periods. We knew these were more similar by the way they cluster in the dendrogram. However, with further knowledge of these texts one would know that this is likely because the Eliot text was written as if the time setting was the early 1800s, which is the time of Jane Austen’s novels; which therefore provides evidence at the least that Eliot’s novel accomplished its goal in that sense.

Overall, it is arguable that the Lexos tool is much more useful than other tools like it on the internet. But at the same time, is also meant for people who have a purpose to use this tool and a knowledge of it in itself.

Feb 19
2016
5:31 PM

Dendrogram experiment

I wanted to post the results of our experiment yesterday afternoon here on the English 294 site, but WordPress wouldn’t allow me to upload an SVG image file. The best I could do was to include a link to a PDF of the results of our dendrogram experiment. I’ve uploaded the text files we worked with onto Blackboard (look in the folder “Dendrogram Experiment”).

Remember, dendrograms are a way of graphically representing the statistical similarity in word choice between two or more texts. (See the Lexomics videos for more information on this point.) Among texts from roughly the same period and genre, the dendrograms should be able to distinguish between authors. We looked at eight novels written by English women in the 19th century: Northanger Abbey (written 1798-99, published posthumously 1817), Sense and Sensibility (1811), Pride and Prejudice (1813), Mansfield Park (1814), and Emma (1815) all by Jane Austin; Jane Eyre (1847) by Charlotte Brontë and Wuthering Heights (also 1847) by her sister Emily Brontë; and Middlemarch (1871) by George Eliot.

If you want to reproduce the experiment, upload the files and apply the default scrubbing, then go to Clustering (Hierarchical Clustering) under the Analyze tab. The default options work well enough, so just click on the green Get Dendrogram button on the lower left.

As we observed in class, the Brontë sisters cluster together, and all the Jane Austen novels cluster together. Interestingly, George Eliot (Mary Ann Evans) clusters more closely with Jane Austen than with the Brontë sisters.

Feb 17
2016
11:41 AM

Mystery Solved!

I believe I’ve solved the mystery of why Lexos produced results for Total Term Count that didn’t agree in all cases with the results you got from running “wc -w” in the Terminal window. The reason is because once you apply a scrubbing that includes removing stopwords (whether from an uploaded file or from a list entered by hand in the textbox), those words are gone, and they’re not coming back.

We’ll take Molly’s example of “The Raven”. I’ve downloaded a copy from the Project Gutenberg website, and I’ve edited it so that the text file contains just the words of the poem itself, not even the title. (You can download my version from Blackboard as Raven.txt.) Running “wc -w Raven.txt” in the Terminal window shows that the poem has 1072 words. If you upload the file into Lexos, apply the default scrubbing (“Remove All Punctuation”, “Make Lowercase”, and “Remove Digits”), and generate statistics, you will get a correct Total Term Count of 1072.

But go back to the Scrubbing Options menu, add the single word “is” to the stopwords textbox, and apply scrubbing. If you then generate statistics again, you will get a Total Term Count of 1059, because the 13 instances of the “is” have been removed. And going back to the Scrubbing Options menu and removing “is” from the stopwords list does not change the result.

The key takeaway here is that Lexos is implemented in such a way that it does not keep track of the history of changes you’ve made to the file by scrubbing. You can’t tell Lexos to put all the instances of “is” you scrubbed back into the file, because it no longer “remembers” how many there were, and where they were.

Feb 12
2016
8:16 PM

Assignment one: His Last Bow

The book I decided to my assignment on was called His Last Bow by Sir Arthur Conan Doyle. The book came out in 1917. The most prominent words in the word cloud image are Bork ,Von, England, One, and Secretary. At first glance I am quite sure that Von and Bork Must be some sort of main character, probably a protagonist and antagonist. Some words that are intriguing due to their size are secretary, one, well and man. When talking a closer look at the text and doing a quick search of words and their number of usages throughout the text I came to the conclusion that the word “man” was used 20 times, while the word” well” was used 8 times, although “well” is shown at a much larger scale. When diving deeper into I found that when using wordle that by malfunctioning the settings and changing the word cloud to show words in lowercase only, changed the view dramatically, putting words that seemed larger initially to a to a better scale. When I proceeded to re look at the wordle document I wanted to look at another word that had a relatively large scale and the word was “Watson”. Initially I would have believed that “Watson” was a characters name, which is true, but it is quite interesting that in the book there is no formal introduction of a Watson. It immediately just says Watson in a sentence and then that is his introduction to the story. The story takes place in England and involves Von Bork, who is the main character and at the end of the novel his friend Watson and him fly away after successfully depositing money.Jeffwordle

Feb 12
2016
7:50 PM

What’s One Word?

My latest lapse into societal withdrawal was used to look at multiple images (all words) of Walt Whitman’s Leaves of Grass on the wonderfully addicting website www.wordle.net.   Leaves of Grass was first published in 1855; it only took someone 153 years to upload the entire book of poems to the www.gutenberg.org. This website is another phenomenal way to waste ones time. I will add one more thing about the site: it is called Project Gutenberg and in the FAQs it is stated as, “The original, and oldest, etext project on the Internet, founded in 1971.” As another warning: combining the use of the Gutenberg and Wordle websites can develop addictive Internet habits, that some might call procrastination (but I call it research).

Leaves of Grass is just over 122,000 words. I selected a font named “Steelfish”, which sounded Whitmanian enough while analyzing Whitman’s work. I then choose the maximum words to be shown as 10,000 (Wordle automatically populates with 150 words). I arranged the “word cloud” (named because it is a cloud of words) to be in a “rounded shape” with a selection for the words to be in a “preferred alphabetical order” (but some words don’t behave and edge to the left of the alphabet). I selected the words to be displayed horizontally only. The attached picture probably shows it better than I am writing about it. Wordle has an amazing array of choices to play with when it comes to shaping the “word cloud”.

LofG 10000

After Leaves of Grass was populated into what looks like a football, the top ten words were: see, one, now, old, love, life, yet, thee, soul, earth. I referenced my visual top ten with the use of a search function for the above specific words in the Leaves of Grass text, the word processor automatically counted and identified where those words were in the text. I found that the word “long” was used more often than “soul” and “earth”, while the “word cloud” in Wordle shows “long” to be slightly smaller (indicating it was used less throughout the text). “Long” did not show up when I selected a maximum of 10 words to generate my “cloud” (but did when I selected a maximum of 11). After looking through the poems, I would have to guess that Wordle does not count words that are hyphenated (there are 27 words that connect “long” with a hyphen).

“See” came up as a clear winner in the Wordle picture and on my word processor; the search function was able to account for 432 usages. “One” came up 355 times. The word “one” is a strange beast in the Wordle world. It is surprising how many texts have the word “one” as a widely used word (and it has nothing to do with a numbering system). Besides the curiosity of “one” and the possibility of the alienation of all things hyphenated by Wordle, there was nothing surprising (to me) from the picture generated by Wordle. Whitman (in text and person) was full of “life” and “love”. He talked often of the “soul” and the old “earth” (whether he spoke of the dirt in the ground or the whole rock circling the sun). To compare another poet alive during Whitman’s time, I created a “cloud” from a text that was labeled as the entire works of Emily Dickinson.   The words: “One”, “see”, and “life” were also in the top ten. Not completely satisfied with this result, I entered what was being sold as the complete works of Edgar Alan Poe (I deleted essays and a biography). “One”, “now,” “love”, and “soul” came up in the top twenty. “One” was in the top ten. Unless Wordle is programmed to throw in “one” word, like a sick joke, it might be time to take a closer look at how many writers used the “one” word, in what way, and quite possibly why they used it so much. Or one would think so.

Feb 12
2016
7:48 PM

Word Cloud Assignment: The Shoes of Fortune

“The Shoes of Fortune” by Hans Christian Anderson is the story about a pair of shoes, given to mankind by a fairy, which transport anyone who puts them on to the time and place they desire. The shoes are left at a party and upon leaving are taken in place of a party-goers’s shoes, who takes them on a quest eventually leading them to another characters possession and a second quest. When observing the word-cloud for “The Shoes of Fortune”, before diving into the story, a few questions and comments come to mind; immediately I find myself wondering where ‘Shoes’ can be found, I would assume that a fairy-tail about shoes would contain much higher counts of the word than is indicated by the word-cloud. Secondly, I would like to know ‘councillor’ and ‘watchman’s involvement in the story as they are two of the largest words. The word’s “Copenhagen” “streets” indicates that the city is possibly the setting for the story. After going deeper into the story, I found that the reason shoes appears so small is due to the shoes being turned into galoshes and referred to as that for the latter part of the book; “Councillor” and “watchmen” are two of the characters who come to possess the shoes, explaining their size. The Story is also set in Copenhagen, Denmark as I predicted. Overall this word-cloud gives us the names of two major characters and the setting, while also causing some curiosity over how hard it is to find the word shoes, that makes the observer interested in the lack of mentioning of the title of the story.THESHOESOFFORTUNE