Assignment 4


Here’s the Lexos assignment as I presented it to my students, and you can see their responses here, here, and here.

Use Lexos to analyze a corpus of texts downloaded from the Project Gutenberg website. As we discussed, it is best to analyze texts in their original language — if you analyze a translation, it’s the translator’s, not the author’s, use of language you’ll be analyzing. That said, Lexos should have no problem analyzing texts in other languages — Scott Kleinman, the creator of Lexos, works primarily on Anglo-Saxon texts, and features like Lemmas, Consolidations, and Special Characters are intended to facilitate the handling of foreign-language text. Select at least a dozen texts, and try to select from two distinct groups of authors to set up a contrast. For example, last year I had students compare English authors (mostly women) and African-American  authors (mostly men) from the nineteenth century.

Select Scrub (under the Prepare tab), and make sure to check Remove All Punctuation, Make Lowercase, and Remove Digits (these are the default selections). You should experiment with using both a Stop Words list and a Keep Words list. I’ve attached the Python Natural Language Toolkit (NLTK) stopwords list that you can use for both purposes (just not at the same time). Remember to Apply Scrubbing (green button), and visually confirm that the scrubbing took in the Previews of Documents text box on the right-hand side of the window. Then, go to Hierarchical Clustering (off the Cluster menu, under the Analyze tab). Make sure to select the Proportional Counts button under Normalize. Finally, click on the green Get Dendrogram button, download the resulting dendrogram as a PNG file, and include it in your blog post.

Finally, write up you findings, including your PNG dendrogram file on the course WordPress site. You should indicate ways in which the dendrograms did (or didn’t) cluster your authors in ways you might have expected, and how the use of stop words and keep words affected your results.

Please let me know if you have any questions.

Due Wednesday, May 2nd at 11:59 PM PDT.