For my project, I analyzed a selection of texts off of the American Bestsellers lists from 1895, 1896, and 1897. As there are several authors who appear on these lists for multiple books, I suspected there would be many similarities. Additionally, I assumed that books written roughly at the same time and widely read would have be similar, even when written by different authors.
The analysis with just basic scrubbing appeared mostly as I suspected, though some authors are separated and appear dissimilar.
Adding the NLTK stop words list surprisingly increased the overall similarity between many of the books, but also split up most of the authors.
Using the NLTK list as keep words provided the most dramatic results. Most of the authors clustered together, but there is a dramatic height difference illustrated in the dendrogram.