Informatics and Applications
2023, Volume 17, Issue 4, pp 81-87
PARALLEL CORPUS ANNOTATION: APPROACHES AND DIRECTIONS FOR DEVELOPMENT
Abstract
Possible directions for the development of parallel corpus annotation tools are presented considering the actual situation in this area. The main approaches to conducting research on corpus material - (i) corpus-based;
(ii) corpus-driven; and (iii) corpus-illustrated - are considered and the differences between them are briefly described. It is demonstrated that despite the abundance of corpus annotation tools, the vast majority of them are designed to deal with monolingual corpora and/or support a very limited functionality for annotating textual data. The largest number of functions are provided by supracorpora databases and web applications to access them which are being developed at FRC CSC RAS: (i) forming of original and translated text blocks necessary and sufficient for analyzing the occurrence of the studied language unit and its translation variant; (ii) identification of the occurrence of the studied language unit and its translation variant; (iii) selection of features characterizing the use of the studied language unit and its translation variant; and (iv) selection of features characterizing the translation correspondence. This set of functions provides solutions to a significant part of research problems but it can be extended. Three directions for the development of the existing functionality are suggested which can provide a more detailed description of linguistic material.
[+] References (21)
- Plungyan, V.A. 2005. Zachem nuzhen Natsional'nyy korpus russkogo yazyka? Neformal'noe vvedenie [What the Russian National Corpus is for? Informal introduction]. Natsional'nyy korpus russkogo yazyka: 2003-2005. Rezul'taty i perspektivy [Russian National Corpus: 2003-
2005. Results and prospects]. Moscow: Indrik. 6-20. EDN: PXFYCP.
- Pertsov, N. V. 2006. O roli korpusov v lingvisticheskikh issledovaniyakh [On the role of corpora in linguistic research]. Scientific Conference (International) "Corpus Linguistics" Proceedings. St. Petersburg: St. Petersburg University Press. 318-331. EDN: RGQPTB.
- Pertsov, N. V. 2006. K suzhdeniyam o faktakh russkogo yazyka v svete korpusnykh dannykh [Toward judgments about Russian language facts in the light of corpus data]. Russkiyyazykvnauchnom osveshchenii [Russian Language and Linguistic Theory] 1(11):227-245. EDN: PVNQUT.
- Plungyan, V. A. 2008. Korpus kak instrument i kak ideologiya: o nekotorykh urokakh sovremennoy korpusnoy lingvistiki [The corpus as tool and as ideology: On some lessons from modern corpus linguistics]. Russkiy yazyk v nauchnom osveshchenii [Russian Language and Linguistic Theory] 2(16):7-20. EDN: MTBALV
- Tognini-Bonelli, E. 2001. Corpus linguistics at work. Amsterdam/Philadelphia: John Benjamins Publishing Co.
235 p.
- Baker, P., A. Hardie, and T. McEnery. 2006. A glossary of corpus linguistics. Edinburgh: Edinburgh University Press. 187 p.
- McEnery, T, and A. Hardie. 2012. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge Uni-versity Press. 310 p.
- Zakharov, V. P., and S. Yu. Bogdanova. 2020. Korpusnaya lingvistika [Corpus linguistics]. 3rded. St. Petersburg: St. Petersburg University Press. 234 p.
- Kopotev, M.V. 2014. Vvedenie v korpusnuyu lingvistiku [Introduction to corpus linguistics]. Prague: Animedia Co.
- Dobrovol'skiy, D. O. 2020. Korpusnyy podkhod k issledovaniyu frazeologii: novye rezul'taty po dannym parallel'nykh korpusov [Corpus-based approach to phraseology research: New evidence from parallel corpora]. VestnikSankt-Peterburgskogo universiteta. Yazyk i literatura [Vestnik of Saint Petersburg University Language and Literature] 17(3):398-411. doi: 10.21638/spbu09.2020.303. EDN: QZIAAB.
- Meyer, Ch. F. 2015. Corpus-based and corpus-driven approaches to linguistic analysis: One and the same? Developments in English. Expanding electronic evidence. Cambridge: Cambridge University Press. 14-28. doi: 10.1017/CBO9781139833882.004.
- Xiao, R. 2009. Theory-driven corpus research: Using corpora to inform aspect theory. Corpus linguistics: An international handbook. Eds. A. Ludeling and M. Kyto. Berlin/New York: Walter de Gruyter. 2:987-1008. doi: 10.1515/9783110213881.2.987.
- Gries, St. Th. 2010. Corpus linguistics and theoretical linguistics. A love-hate relationship? Not necessarily... Int. J. Corpus Linguist. 15(3):327-343. doi: 10.1075/ IJCL.15.3.02GRI.
- Sinclair, J. 2004. Trust the text: Language, corpus and discourse. London/New York: Routledge. 224 p.
- Tools for corpus linguistics. Available at: https://corpusanalysis.com (accessed November 27, 2023).
- ACTRES corpus manager. Available at: https://actres. unileon.es/ACM2.0/home (accessed November 27, 2023).
- Sketch engine. Available at: https://www.sketchengine. eu (accessed November 27, 2023).
- Zatsman, I., M. Kruzhkov, and E. Loshchilova. 2019. Metody i sredstva informatiki dlya opisaniya struktury neodnoslovnykh konnektorov [Methods and means of informatics for multiword connectives structure description]. Struktura konnektorov i metody ee opisaniya [Connectives structure and methods of its description]. Ed.
O. Yu. Inkova. Moscow: TORUS PRESS. 205-230. doi: 10.30826/SEMANTICS19-06. EDN: YVAJWN.
- Kruzhkov, M. 2021. Kontseptsiya postroeniya nadkorpusnykh baz dannykh [Conceptual framework for supra- corpora databases]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 31(3):101-112. doi: 10.14357/08696527210309. EDN: UMWNIU.
- Goncharov, A. A., O.Yu. Inkova, and M. Kruzhkov 2019. Metodologiya annotirovaniya v nadkorpusnykh bazakh dannykh [Annotation methodology of supracorpo- ra databases]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(2):148-160. doi: 10.14357/ 08696527190213. EDN: GNDCJE.
- Goncharov, A. A. 2023. Poisk s isklyucheniem v parallel'nykh tekstakh [Search with exclusion in parallel texts]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 33(4):102-114. doi: 10.14357/38696527230410. EDN: CVPFDV.
[+] About this article
Title
PARALLEL CORPUS ANNOTATION: APPROACHES AND DIRECTIONS FOR DEVELOPMENT
Journal
Informatics and Applications
2023, Volume 17, Issue 4, pp 81-87
Cover Date
2023-12-10
DOI
10.14357/19922264230411
Print ISSN
1992-2264
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
parallel corpus; corpus linguistics; corpus annotation; linguistic annotation
Authors
A. A. Goncharov
Author Affiliations
Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
|