Informatics and Applications
2020, Volume 14, Issue 3, pp 129-135
USING TOPIC MODELS FOR PAIRWISE COMPARISON OF COLLECTIONS OF SCIENTIFIC PAPERS
- F. V. Krasnov
- A. V. Dimentov
- M. E. Shvartsman
Abstract
The authors propose a new technique for pairwise comparison of collections of scientific articles via a topic model. The developed methodology is called Comparative Topic Analysis (CTA). Comparative topic analysis allows getting not only quantitative assessment of similarity of collections but also structural differences of the compared text collections. The authors developed transparent visualization for text collections distance. This study compares existing approaches to topic modeling concerning the task of comparing collections of scientific papers. The authors consider probabilistic and generative topic models. The analysis of the requirements for text collections for the correct application of CTA was carried out. The CTA methodology has shown high efficiency in identifying structural differences in related collections. The authors developed an integral metric "Content Uniqueness Ratio" which allows comparing text collections with each other. As a result of the digital experiment, the thematic model with additive regularization (ARTM) proved to be the most informative.
[+] References (13)
- Verstak, A., A. Acharya, H. Suzuki, et al. 2014. On the shoulders of giants: The growing impact of older articles. arXiv.org. Available at: http://arxiv.org/abs/1411.0275 (accessed July 1, 2019).
- Jelodar, H., Y. Wang, C. Yuan, etal. 2019. Latent Dirichlet Allocation (LDA) and topic modeling: Models, applications, asurvey. Multimed. Tools Appl. 78(11):15169-15211. doi: 10.1007/s11042-018-6894-4.
- Shravan, K. B., and R. Vadlamani. 2017. LDA based feature selection for document clustering. 10th Annual ACM India Compute Conference Proceedings. ACM. 125-130.
- Onan, A., H. Bulut, and S. Korukoglu. 2017. An im-proved ant algorithm with LDA-based representation for text document clustering. J. Inform. Sci. 43(2):275-292.
- Hofmann, T. 1999. Probabilistic latent semantic analysis. 15th Conference on Uncertainty in Artificial Intelligence Proceedings. Stockholm, Sweden: Morgan Kaufmann Publishers Inc. 289-296.
- Roder, M., A. Both, and A. Hinneburg. 2015. Exploring the space of topic coherence measures. 8th ACM Conference (International) on Web Search and Data Mining Proceedings. Shanghai: ACM. 399-408.
- Newman, D., J. H. Lau, K. Grieser, and T. Baldwin. 2010. Automatic evaluation of topic coherence. Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics. 100-108.
- Mimno, D., H. M. Wallach, E. Talley, et al. 2011. Optimizing semantic coherence in topic models. Conference on Empirical Methods in Natural Language Processing Pro-ceedings. Edinburgh, Scotland: Association for Computational Linguistics. 262-272.
- Zellig, H. S. 1954. Distributional structure. Word 10(2- 3):146-162.
- Krasnov, F., and A. Sen. 2019. The number of topics optimization: Clustering approach. Machine Learning Knowledge Extraction 1(1):416-426.
- Vorontsov, K., and A. Potapenko. 2015. Additive regularization of topic models. Mach. Learn. 101(1-3):303- 323.
- Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3(1):993–1022
- Cheng, X., X. Yan, Y. Lan, and I. Guo. 2014. Btm: Topic modeling over short texts. IEEE T. Knowl. Data En. 26(12):2928-2941.
[+] About this article
Title
USING TOPIC MODELS FOR PAIRWISE COMPARISON OF COLLECTIONS OF SCIENTIFIC PAPERS
Journal
Informatics and Applications
2020, Volume 14, Issue 3, pp 129-135
Cover Date
2020-09-30
DOI
10.14357/19922264200318
Print ISSN
1992-2264
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
comparative topic analysis; comparative text model; deep text analysis; topic models metrics DOI: 10.14357/19922264200318
Authors
F. V. Krasnov , A. V. Dimentov , and M. E. Shvartsman ,
Author Affiliations
NAUMEN R&D, 49A Tatishcheva Str., Ekaterinburg 620028, Russian Federation
National Electronic Information Consortium, 5 Letnikovskaya Str., Moscow 115114, Russian Federation
Russian State Library, 3/5 Vozdvigenka Str., Moscow 119019, Russian Federation
|