Maintaining Topic Models for Growing Corpora

Felix Kuhr, Magnus Bender, Tanya Braun, Ralf Moller

Abstract

A reference library can be described as a corpus of an individual composition of documents. Over time, the corpus might grow because an agent decides to extend its corpus with additional documents, e.g., new publications, or new articles. Existing approaches use topic modelling techniques to compare documents with each other within the same corpus by the documents' topic distribution. However, for new documents, only the text, and no topic distribution is available. Thus, this paper describes three techniques for estimating topic distributions of new unseen documents considering the initial documents in a corpus. Additionally, we present an extensive evaluation about the performance and runtime of the three topic modelling techniques for various scenarios and different sized corpora.

Original languageEnglish
Title of host publication2020 IEEE 14th International Conference on Semantic Computing (ICSC)
Number of pages8
PublisherIEEE
Publication date02.2020
Pages451-458
Article number9031467
ISBN (Print)978-1-7281-6333-8
ISBN (Electronic)978-1-7281-6332-1
DOIs
Publication statusPublished - 02.2020
Event14th IEEE International Conference on Semantic Computing - San Diego, United States
Duration: 03.02.202005.02.2020
Conference number: 158497

Research Areas and Centers

  • Centers: Center for Artificial Intelligence Luebeck (ZKIL)
  • Research Area: Intelligent Systems

DFG Research Classification Scheme

  • 409-06 Information Systems, Process and Knowledge Management

Fingerprint

Dive into the research topics of 'Maintaining Topic Models for Growing Corpora'. Together they form a unique fingerprint.

Cite this