-
-
Notifications
You must be signed in to change notification settings - Fork 46
Description
In case someone is interested in running this, as of this day I was able to run the earlier version of this project available here by the same author back in the day:
https://github.com/StanfordHCI/termite/blob/master/README.old
While trying to find a solution to make the current project run, this .txt file works as an example input:
Was a good example that matches the format of the old version.
The readme was more friendly on making sense of running the code. One thing to lookout for is that it will throw an error when it gets to the compiler-latest.zip saying it couldn't move the file. The file as of today (I was surprised to see all links working for download despite being 4 years later!) will have inside the .zip a jar file containing the expect name+version. Simple extract the closure.jar file, and rename it to closure.jar inside the lib folder. Re-running the script will then rename it to the intended name, and finish installing.
For running the script, I had some issues with the config file path, but the script allows to make the 3 paths explicit:
./execute.py --corpus-path ~/Desktop/finance_corpus.txt carlos_lda.cfg --model-path example-project2/topic-model/ --data-path example-project2/
It will create the folders for you or overwrite. The paths provided through the command line example above from left to right are the same required by the .cfg script from top to bottom. Provide the corpus.txt on the referred link (or any that follows the format doc-id\ttext and it should work. In practice, somewhere along the pipeline I experienced errors. This corpus, which is already tokenized ran like a breeze instead, so I imagine it would be best to tokenize using some other library before putting here.
Finally, there was some issue on the old project where it was pointed out by the author of the code that a small corpus may lead to throwing an error due to running out of vocabulary or something.
The visualization for this file took about half an hour to get done on a 2016 Macbook on 16 GB Ram in contrast to running LDA on R topicmodels package that takes about 3 minutes, plus loading on another visualization work that referred this one (LDAVis on github), which is about 1 minute.
I wish the visualization didn't attempt to do the entire process from start, but rather required the data as the other authors did (i.e. the matrixes and a few vectors). Would facilitate a lot on reusability.
If anyone is interested in how the output looks like in the end, here it is:
You can also select multiple topics. Sadly, the old version does not include the document view pane and the project seems abandoned now.
