Informatics and Applications

2021, Volume 15, Issue 1, pp 30-41

METHODS OF CROSS-LINGUAL TEXT REUSE DETECTION IN LARGE TEXTUAL COLLECTIONS

R. V. Kuznetsova
O. Yu. Bakhteev
Yu. V. Chekhovich

Abstract

The paper investigates the cross-lingual text reuse detection problem. The paper proposes a monolingual approach to this problem: to translate the suspicious document into the language of the collection for the further monolingual analysis. One of the major requirements for the proposed method is robustness to the machine translation ambiguity. The further document analysis is divided into two steps. At the first step, the authors retrieve documents-candidates which are likely to be the source of the text reuse. For the robustness, the authors propose to retrieve the documents using word clusters that are constructed using distributional semantics. At the second step, the authors compare the suspicious document with candidates using sentence embeddings that are obtained by deep learning neural networks. The experiment was conducted for the "English-Russian" language pair both on the synthetic data and on the articles included in the Russian Science Citation Index.

[+] References (22)

Nikitov, A. V., O.A. Orchakov, and Y.V. Chekhovich. 2012. Plagiat v rabotakh studentov i aspirantov: problema i metody protivodeystviya [Plagiarism in papers of students and graduate students: The problem and methods of counteraction]. Universitetskoe upravlenie: praktika i analiz [University Management: Practice and Analysis] 5:61-68.
Khritankov, A. S., P.V. Botov, N. S. Surovenko, S. V. Tsarkov, D. V. Viuchnov, and Y. V. Chekhovich. 2015. Discovering text reuse in large collections of documents: A study of theses in history sciences. Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference Proceedings. IEEE. 26-32.
Zelenkov, I. V., and I. V. Segalovich. 2007. Sravnitel'nyy analiz metodov opredeleniya nechetkikh dublikatov dlya Web-dokumentov [Comparative analysis of methods for determining fuzzy duplicates for Web-documents]. Tr. 9-y Vseross. nauchn. konf. "Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollekt- sii" [9th All-Russian Scientific Conference "Digital libraries: Advanced Methods and Technologies, Electronic Collections" Proceedings]. Pereslavl-Zalessky: Pereslavl- Zalessky University. 166-174.
Franco-Salvador, M., P. Gupta, and P. Rosso. 2013. Cross-language plagiarism detection using a multilingual semantic network. European Conference on Information Retrieval. Eds. P. Serdyukov, P. Braslavski, S. O. Kuznetsov, et al. Lecture notes in computer science ser. Berlin-Heidelberg: Springer. 7814:710-713.
Franco-Salvador, M., P. Gupta., P. Rosso, and R. E. Banchs. 2016. Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language. Knowl.-Based Syst. 111:87-99.
Grman, J., and R. Ravas. 2011. Improved implementation for finding text similarities in large collections of data. Notebook papers of CLEF 2011 Labs and Workshops. Eds. V. Petras, P. Forner, and P. D. Clough. 1177. 6 p. http://ceur-ws.org/Vol-1177/CLEF2011wn- PAN-GrmanEt2011.pdf (accessed January 18, 2021).
Grozea, C., and M. Popescu. 2011. The encoplot similarity measure for automatic detection of plagiarism. Notebook papers of CLEF 2011 Labs and Workshops. Eds. V. Petras, P. Forner, and P. D. Clough. Amsterdam, The Netherlands. 1177. Available at: http://ceur-ws.org/ Vol-1177/CLEF2011wn-PAN-GrozeaEt2011.pdf (accessed January 18, 2021).
Muhr, M., R. Kern, M. Zechner, and M. Granitzer. 2010. External and intrinsic plagiarism detection using a cross- lingual retrieval and segmentation system. Notebook paper of CLEF 2010 Labs and Workshops. Eds. M. Braschler, Harman, and E. Pianta. Padua, Italy. 1176. Available at: http://ceur-ws.org/Vol-1176/CLEF2010wn- PAN-MuhrEt2010.pdf (accessed January 18, 2021).
Bakhteev, O., R. Kuznetsova, A. Romanov, and A. Khritankov. 2015. A monolingual approach to detection of text reuse in Russian-English collection. Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference Proceedings. IEEE. 3-10.
Koehn, P., Hien Hoang, A. Birch, et al. 2007. Moses: Open source toolkit for statistical machine translation. 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions Proceedings. ACL. 177-180.
Tai, K. S., R. Socher, andC. D. Manning. 2015. Improved semantic representations from tree-structured long shortterm memory networks. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Joint Conference (International) on Natural Language Processing Proceedings. ACL. 1:1556-1566.
Wieting, J., M. Bansal, K. Gimpel, and K. Livescu. 2015. Towards universal paraphrastic sentence embeddings. 19 p. Available at: https://arxiv.org/abs/1511.08198 (accessed January 18, 2021).
Iyyer, M.,V. Manjunatha, J. Boyd-Graber, andH. Daume. 2015. Deep unordered composition rivals syntactic methods for text classification. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Joint Conference (International) on Natural Language Processing Proceedings. ACL. 1:1681-1691.
Kuznetsova, R., O. Bakhteev, and A. Ogaltsov. 2018. Variational learning across domains with triplet information. 3rd Workshop on Bayesian Deep Learning Proceedings. Available at: http://bayesiandeeplearning.org/2018/ papers/65.pdf (accessed January 18, 2021).
Wang, J., H.T Shen, J. Song, and J. Ji. 2014. Hashing for similarity search: A survey. 29 p. Available at: https:// arxiv.org/abs/1408.2927 (accessed January 18, 2021).
Alain, G., and Y. Bengio. 2014. What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 15(1):3563-3593.
Jenssen, M., F Joos, and W. Perkins. 2018. On kissing numbers and spherical codes in high dimensions. Adv. Math. 335:307-321.
Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control Signal. 2(4):303-314.
Sinteticheskaya vyborka dlya zadachi obnaruzheniya perevodnykh zaimstvovaniy [Synthetic dataset for the cross-lingual text reuse detection problem]. Available at: https://tiny.cc/cl_ru_en (accessed January 18, 2021).
Bojanowski, P., E. Grave, A. Joulin, andT. Mikolov. 2017. Enriching word vectors with subword information. Trans-actions Association for Computational Linguistics 5:135- 146.
Chung, J., C. Gulcehre, K. Cho, and Y. Bengio. 2014. Empirical evaluation of gated recurrent neural net works on sequence modeling. 9 p. Available at: https:// arxiv.org/abs/1412.3555 (accessed January 18, 2021).
Tiedemann, J. 2009. News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. Advances in natural language processing. Amster-dam/Philadelphia: John Benjamins. 5:237-248.

[+] About this article