Feb 17
2016
11:41 AM

Mystery Solved!

I believe I’ve solved the mystery of why Lexos produced results for Total Term Count that didn’t agree in all cases with the results you got from running “wc -w” in the Terminal window. The reason is because once you apply a scrubbing that includes removing stopwords (whether from an uploaded file or from a list entered by hand in the textbox), those words are gone, and they’re not coming back.

We’ll take Molly’s example of “The Raven”. I’ve downloaded a copy from the Project Gutenberg website, and I’ve edited it so that the text file contains just the words of the poem itself, not even the title. (You can download my version from Blackboard as Raven.txt.) Running “wc -w Raven.txt” in the Terminal window shows that the poem has 1072 words. If you upload the file into Lexos, apply the default scrubbing (“Remove All Punctuation”, “Make Lowercase”, and “Remove Digits”), and generate statistics, you will get a correct Total Term Count of 1072.

But go back to the Scrubbing Options menu, add the single word “is” to the stopwords textbox, and apply scrubbing. If you then generate statistics again, you will get a Total Term Count of 1059, because the 13 instances of the “is” have been removed. And going back to the Scrubbing Options menu and removing “is” from the stopwords list does not change the result.

The key takeaway here is that Lexos is implemented in such a way that it does not keep track of the history of changes you’ve made to the file by scrubbing. You can’t tell Lexos to put all the instances of “is” you scrubbed back into the file, because it no longer “remembers” how many there were, and where they were.

One thought on “Mystery Solved!

Leave a Reply

Your email address will not be published. Required fields are marked *