Systems and Means of Informatics

2019, Volume 29, Issue 3, pp 77-91

SUPRACORPORA DATABASES IN LINGUISTIC PROJECTS

A. Yu. Egorova
I. M. Zatsman
O. S. Mamonova

Abstract

The paper considers the task of providing linguistic studies with means of supracorpora databases containing aligned parallel texts (each includes the original text and its translation) as well as bilingual annotations of the researched linguistic items and their translation equivalents formed on the basis of parallel texts. Each annotation, formed by a linguist, fixes a translation model of a linguistic item. The experience of implementing several linguistic projects at Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences showed that not all translation models that linguists extract from parallel texts during linguistic annotation are described in bilingual dictionaries and handbooks. Thus, supracorpora databases allow researchers to create new knowledge about the translation equivalents of the researched linguistic items. It is extracted by linguists when comparing and annotating the sentences of the original text and their translations. The main aim of the paper is to describe the functions of supracorpora databases that provide linguists with new knowledge in the process of annotation.

[+] References (27)

Dobrovol'skiy, D.O., A. A. Kretov, and S. A. Sharov. 2005. Korpus parallel'nykh tekstov: arkhitektura i vozmozhnosti ispol'zovaniya [Corpus of parallel texts: Architecture and applications]. Natsional'nyy korpus russkogo yazyka: 2003-2005 [Russian National Corpus: 2003-2005]. Moscow: Indrik. 263-296.
Loiseau, S., D.V. Sitchinava, Anna A. Zalizniak, and I. M. Zatsman. 2013. Information technologies for creating the database of equivalent verbal forms in the Russian-French multivariant parallel corpus. Informatika i ee Primeneniya - Inform. Appl. 7(2): 100-109.
Kruzhkov, M. G., N. V. Buntman, E. Yu. Loshchilova, D. V. Sitchinava, Anna A. Zalizniak, and I.M. Zatsman. 2014. A database of Russian verbal forms and their French translation equivalents. Computer Linguistics and Intellectual Technologies: Conference (International) "Dialog" Proceedings. Moscow: RGGU. 13(20):284-296.
Kruzhkov, M. G. 2015. Informatsionnyeresursy kontrastivnykhlingvisticheskikhissle- dovaniy: elektronnye korpusa tekstov [Information resources for contrastive studies: Electronic text corpora]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 25(2): 140-159.
Zaliznyak, Anna A., I.M. Zatsman, O.Yu. In'kova, and M. G. Kruzhkov. 2015. Nadkorpusnye bazy dannykh kak lingvisticheskiy resurs [Supracorpora databases as linguistic resource]. 7th Conference (International) on Corpus Linguistics Proceedings. St. Petersburg: St. Petersburg State University Publs. 211-218.
Zaliznyak, Anna A. 2016. Baza dannykh mezh"yazykovykh ekvivalentsiy kak instrument lingvisticheskogo analiza [A database of cross-linguistic equivalences as an instrument of linguistic analysis]. Komp'yuternaya lingvistika i intellektual'nye tekhnologii [Computer Linguistics and Intellectual Technologies] 15(22):854-866.
Zaliznyak, Anna A., I. M. Zatsman, and O.Yu. In'kova. 2017. Nadkorpusnaya baza dannykh konnektorov: postroenie sistemy terminov [Supracorpora database on connectives: Term system development]. Informatika i ee Primeneniya - Inform. Appl. 11 (1): 100-106.
Zatsman, I. M., and M. G. Kruzhkov. 2018. Nadkorpusnaya baza dannykh konnektorov: razvitie sistemy terminov proektirovaniya [Supracorpora database of connectives: Design-oriented evolution of the term system]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 28(4):156-167.
Zatsman, I., N. Buntman, M. Kruzhkov, V. Nuriev, and Anna A. Zalizniak. 2014. Conceptual framework for development of computer technology supporting crosslin- guistic knowledge discovery. 15th European Conference on Knowledge Management Proceedings. Reading, U.K.: Academic Publishing International Ltd. 3:1063-1071.
Zatsman, I., and N. Buntman. 2015. Outlining goals for discovering new knowledge and computerised tracing of emerging meanings discovery. 16th European Conference on Knowledge Management Proceedings. Reading, U.K.: Academic Publishing International Ltd. 851-860.
Zatsman, I., N. Buntman, A. Coldefy-Faucard, and V. Nuriev. 2016. WEB knowledge base for asynchronous brainstorming. 17th European Conference on Knowledge Management Proceedings. Reading, U.K.: Academic Publishing International Ltd. 1:976-983.
Zatsman, I. 2018. Goal-oriented creation of individual knowledge: Model and information technology. 19th European Conference on Knowledge Management Proceedings. Reading, U.K.: Academic Publishing International Ltd. 2:947-956.
Dobrovol'skiy, D.O., and Anna A. Zalizniak. 2018. Nemetskie konstruktsii s modal'nymi glagolami i ikh russkie sootvetstviya: proekt nadkorpusnoy bazy dannykh [German constructions with modal verbs and their Russian correlates: A supracorpora database project]. Computer Linguistics and Intellectual Technologies: Conference (International) "Dialog" Proceedings. Moscow: RGGU. 17(24): 172-184.
Zatsman, I. 2018. Stadii tselenapravlennogo izvlecheniya znaniy, implitsirovannykh v parallel'nykh tekstakh [Stages of goal-oriented discovery of knowledge implied in parallel texts]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 28(3): 175-188.
Zatsman, I. M. 2019. Tselenapravlennoe razvitie sistem lingvisticheskikh znaniy: vyyavlenie i zapolnenie lakun [Goal-oriented development of linguistic knowledge systems: Identifying and filling lacunae]. Informatika i ee Primeneniya - Inform. Appl. 13(1):91-98.
Ide, N., and J. Pustejovsky, eds. 2017. Handbook of linguistic annotation. Dordrecht, The Netherlands: Springer Science + Business Media. 1468 p.
Goncharov, A. A., O.Yu. Inkova, and M. G. Kruzhkov. 2019. Metodologiya an- notirovaniya v nadkorpusnykh bazakh dannykh [Annotation methodology of supracor- pora databases]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(2): 148-160.
Goncharov, A. A., and I. M. Zatsman. 2019. Informatsionnye transformatsii parallel'- nykh tekstov v zadachakh izvlecheniya znaniy [Information transformations of parallel texts in knowledge extraction]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(1): 180-193.
Sitchinava, D. V. 2015. Parallel'nye teksty v sostave Natsional'nogo korpusa russkogo yazyka: novye napravleniya razvitiya i rezul'taty [Parallel texts within the Russian National Corpus: New directions and results]. Trudy Instituta russkogo yazyka im. V. V. Vinogradova [Proceedings of the V. V. Vinogradov Russian Language Institute] 6:194-235.
Varga, D., L. Nemeth, P. Halacsy, A. Kornai, V. Tron, and V. Nagy. 2005. Parallel corpora for medium density languages. Conference (International) on Recent Advances in Natural Language Processing Proceedings. Shoumen, Bulgaria: INCOMA Ltd. 590-596.
Morfologicheskiy standart Natsional'nogo korpusa russkogo yazyka [The morphological standard of the Russian national corpus]. Available at: http://www.ruscorpora. ru/corpora-morph.html (accessed July 16, 2019)
Zatsman, I. 2018. Implitsirovannye znaniya: osnovaniya i tekhnologii izvlecheniya [Implied knowledge: Foundations and technologies of explication]. Informatika i ee Primeneniya - Inform. Appl. 12(3):74-82.
Goncharov, A. A., and O.Yu. Inkova. 2019. Metodika poiska implitsitnykh logiko- semanticheskikh otnosheniy v tekste [Methods for identification of implicit logical- semantic relations in texts]. Informatika i ee Primeneniya - Inform. Appl. 13(3):97- 104.
Goncharov, A. A., and O.Yu. Inkova. 2019. Sposoby vyrazheniya prichinnykh ot- nosheniy v russkom yazyke: opyt analiza s ispol'zovaniem krosslingvisticheskoy nad- korpusnoy bazy dannykh [Means of expressing causal relations in Russian: Analysis using a cross-linguistic supracorpora database]. Russian Grammar: Active Processes in Language and Discourse: International Scientific Symposium. 385-395.
PDTB Research Group. 2008. The Penn Discourse Treebank 2.0 annotation manual. Technical Report IRCS-08-01. Philadelphia, PA: Institute for Research in Cognitive Science, University of Pennsylvania. Available at: https://www.seas. upenn.edu/~pdtb/PDTBAPI/pdtb-annotation-manual.pdf (accessed July 16, 2019)
Webber, B., R. Prasad, A. Lee, and A. Joshi. 2019. The Penn Discourse Treebank 3.0 annotation manual. Available at: https://catalog.ldc.upenn.edu/docs/ LDC2019T05/PDTB3-Annotation-Manual.pdf (accessed July 16, 2019).
Zufferey, S., and L. Degand. 2013. Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguist. Ling. 13(2):399-423. doi: 10.1515/cllt-2013-0022.

[+] About this article