Apr 20
2016
10:15 AM

WHAT IS TOPIC MODELING?

That is a great question! The more I read about the subject, the more I fear I don’t understand it as completely as I should. It feels like I am digging a well, but I started at the top of Mount Everest; the amount of information and work that can go into using and understanding topic modeling is huge. I can see why some humanists are scared off by the subject or even attempting to play around with topic modeling software; they might be swimming in familiar waters, but those waters have been stirred up and are now murky, making them seem unfamiliar and even scary to be in. Using Paper Machines in conjunction with Zotero sounded fascinating and I was excited to see the results. What didn’t excite were the results. The word cloud gave me hope. There was a small feeling of success. I could see a few words that I would associate as having been used throughout the corpus more than others. It didn’t tell me much, but it gave me a sense of moving forward.

The frustrating thing I see with topic modeling is what might possibly be the randomness to which the clusters of words, or topics, are generated (I will use the terms cluster(s) and topic(s) interchangeably throughout the post). It is my understanding that the algorithm is pretty complex (mathematically), so I trust in the software to have been correctly inputted, as my last math class was Statistics, which happened so long ago, I can only say I took the class and my mathematical skills have been reduced to fractions. From what I have read and understand, the algorithm can generate different clusters with each use, but will still be similar in results. I would correlate that to two different people reading the same text and coming up with slightly different topics, while still having the major themes match. In an odd way, I had hoped that the computer-generated clusters would be a bit more precise or accurate in regards to giving me something to readily digest and interpret from the corpus we entered into Zotero.

Having used less than 100 books authored in the 1800s gave me the ability to at least recognize the books by some of the clusters. Having familiarity with the books made the clusters understandable. I can see the potential of pairing down a large text into multiple pieces, which then could be more easily scanned through some form of topic modeling, and the results would be beneficial in understanding some of the main ideas that have been written in the text as a whole. An example from the topic modeling I used with 20 topics: “ship man captain whale sea deck men boat ye” could be linked to whaling, more specifically, the book Moby-Dick. With 50 topics it looked like: “whale ship man sea captain deck men war ye”. The only difference in this are the order of words and the 20-topic model had “boat” in it, while the 50-topic model had “war”, otherwise they had the same topics. I should mention these clusters were given to me and generated by using MALLET. Paper Machines was far less successful and would generate large numbers of topic models, but only gave me three words. For example: “whale”, “ahab”, and “trapper” came up in one cluster. If I had not known Moby-Dick was in the corpus used, I would have guessed that it was about the same book, as “Whale” and “Ahab” are almost synonymous with Moby-Dick, but if I had never heard of the book, I would have been hard pressed to say this was about anything.

My critique of Paper Machines might be from user error. With only three words to each cluster, it is difficult to point a finger at myself though. I am willing to say, I might have selected the wrong option and with each use, became less enthused and more irritated at the project and possibly gave up in frustration at my inability to use the software correctly.   Regardless of my poor results with Paper Machine, I can see how topic modeling can be beneficial and possible shine new ideas on old thoughts. I can also admit my analytical skills are not up to par when it comes to the use of topic modeling; another way to say that if I extrapolated anything from the clusters in a meaningful way adding to the larger scope of academia, it would be by accident.

Leave a Reply

Your email address will not be published.