Apr 12
2016
1:46 PM

Assignment 7: Topic Modeling

Welcome to Unit 4!

Assignments 7 and 8 will both require more collaboration and coordination between students than the previous projects, and in this regard will offer something closer to the experience of a real digital humanities project. The theme for both assignments is the use of techniques that take advantage of statistical properties of word frequencies in corpora of texts to provide information about topics and authorship.

Assignment 7: Overview

You will use Paper Machines, an add-on to Firefox and Zotero, to generate topic models from a corpus of texts you’re going to collectively assemble from material on Project Gutenberg. You will then write a 300-word blog post describing your findings. Due on Wednesday April 20th at 11:59PM PDT.

Assignment 7: Details

First, we need to perform minor surgery on Firefox. Don’t try this at home, kids!

  1. type about:config in the location bar on Firefox
  2. change xpinstall.signatures.required from true to false
    (right-click on the line, then select Toggle from the menu)

Then, install Paper Machines

Make sure PDF Indexing is turned on in Zotero Preferences
(third button from the left = the gear icon, Search tab)

Take a look at Getting Started. Note especially the following caveat: “Some users have found that Paper Machines produces empty results for smaller datasets. We suggest beginning with at least 20 files before you attempt a wordcloud or relational diagram, and more like 50 to 100 before you attempt Topic Modeling.”

This is where group collaboration and coordination comes in.

Rather than have each of you attempt to gather 50-100 texts, we are going to use Zotero to build up a group library from texts found on the Project Gutenberg. You have all received, and some of you have accepted, the invitation to join the group English 294. If you haven’t done so already, please do so now. As with our previous work involving material sourced from Project Gutenberg, the texts will have to be hand-edited to ensure that you don’t get spurious results.

Leave a Reply

Your email address will not be published.