Informatics and Applications

2021, Volume 15, Issue 1, pp 30-41

METHODS OF CROSS-LINGUAL TEXT REUSE DETECTION IN LARGE TEXTUAL COLLECTIONS

  • R. V. Kuznetsova
  • O. Yu. Bakhteev
  • Yu. V. Chekhovich

Abstract

The paper investigates the cross-lingual text reuse detection problem. The paper proposes a monolingual approach to this problem: to translate the suspicious document into the language of the collection for the further monolingual analysis. One of the major requirements for the proposed method is robustness to the machine translation ambiguity. The further document analysis is divided into two steps. At the first step, the authors retrieve documents-candidates which are likely to be the source of the text reuse. For the robustness, the authors propose to retrieve the documents using word clusters that are constructed using distributional semantics. At the second step, the authors compare the suspicious document with candidates using sentence embeddings that are obtained by deep learning neural networks. The experiment was conducted for the "English-Russian" language pair both on the synthetic data and on the articles included in the Russian Science Citation Index.

[+] References (22)

[+] About this article