Humanities researchers and literary scholars commonly struggle to make sense of, and extract information from patterns in large document collections. Computational techniques for finding similarities between documents in such collections continue to be developed and are invaluable in understanding these collections as single entities.
Topic modelling is a text processing tool widely used to identify structure in document collections. It has been successfully used for a wide range of applications. However, it suffers from several weaknesses which make it difficult to use and may lead to unreliable results: It is dependent on preprocessing for removal of stop-words and other text features: this is a common first step on many NLP applications; results produced are highly dependent on the quality of the selection at this stage:
- The number of topics typically needs to be determined a priori: ideally the number of topics could be inferred from the data itself.
- Interpretation of resulting topics is difficult: topics are normally represented by a set of most representative words for the topic, but these words do not relate directly to human experience.
- Visualization of produced topics is not intuitive: the optimal way of visualizing topics is current area of active research.
- Incorporating additional dimensions to a model (e.g. evolution of topics over time) is a feature which is not easily incorporated.
- Topic models scale poorly for large document collections: this is mainly due to the computational complexity of the algorithm.
There have been various efforts to overcome such limitations with varying degrees of success.
However solutions to these issues are not yet common practice and require additional knowledge and effort in order to take full advantage of them.
The aim of this project is to develop topic modelling tools which can help humanities researchers and literary scholars to make sense of large document collections without requiring extensive expertise in tuning model parameters.