Informatics and Applications

2018, Volume 12, Issue 4, pp 96-105

USING SUPRACORPORA DATABASES FOR QUANTITATIVE ANALYSIS OF MACHINE TRANSLATIONS

N. V. Buntman
A. A. Goncharov
I. M. Zatsman
V. A. Nuriev

Abstract

The paper discusses an information technology that supports expertise of machine translations. The technology has been developed to meet the following conditions: (i) there are connectives in all translated contexts; (ii) the connectives may be both one-word (khotya 'although,' a 'and') and multiword (da esche 'and beside this,' no zato 'but instead'); and (iii) between words making up a given connective, there may be a space (esli (space) tak 'if (space) then'). With this technology, expertise of machine translations develops through three main stages: (i) linguistic annotation of machine translations in a supracorpora database; (ii) quantitative processing of annotations; and (iii) linguistic analysis of annotations and quantitative data. The paper describes technological aspects of the first two stages. The examples given are only those with multiword connectives. Source sentences chosen for machine translation have been collected from literary texts.

[+] References (22)

Moorkens, J., S. Castilho, F Gaspari, and S. Doherty, eds. 2018. Translation quality assessment: From principles to practice. Machine translation: Technologies and ap- plications ser. Cham: Springer International Publishing. Vol. 1.299 p.
Scott, B. 2018. Translation, brains and the computer: Aneurolinguistic solution to ambiguity and complexity in machine translation. Machine translation: Technologies and applications ser. Cham: Springer International Publishing. Vol. 2. 241 p.
Popovic. M. 2018. Error classification and analysis for machine translation quality assessment. Translation quality assessment: From principles to practice. Eds. J. Moorkens, S. Castilho, F Gaspari, and S. Doherty. Machine translation: Technologies and applications ser. Cham: Springer International Publishing. 1:129-158.
Inkova, O. Yu., ed. 2018. Semantika konnektorov: kontrastivnoe issledovanie [Semantics of connectives: A contrastive study]. Moscow: TORUS PRESS. 368 p.
Kruzhkov, M. G. 2015. Informatsionnye resursy kontrastivnykh lingvisticheskikh issledovaniy: elektronnye korpusa tekstov [Information resources for contrastive studies: Electronic text corpora]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 25(2):140-159.
Zaliznyak, Anna A., I. M. Zatsman, O. Yu. Inkova, and M.G. Kruzhkov. 2015. Nadkorpusnye bazy dannykh kak lingvisticheskiy resurs [Supracorpora databases as linguistic resource]. Conference (International) "Corpus linguistics-2015" Proceedings. St. Petersburg: St. Petersburg State University. 211-218.
Popkova, N.A., O.Yu. Inkova, I. M. Zatsman, and M. G. Kruzhkov. 2015. Metodika postroeniya monoekvivalentsiy v nadkorpusnoy baze dannykh konnektorov [Methodology of constructing monoequivalences in the Supracorpora database of connectors]. Tr. 2-y nauchn. konf. "Zadachi sovremennoy informatiki" [2nd Scientific Conference "Problems of Modern Informatics" Proceedings]. Moscow: FRC CSC RAS. 143-153.
Zatsman, I.M., O.Yu. Inkova, M.G. Kruzhkov, and N. A. Popkova. 2016. Predstavlenie kross-yazykovykh znaniy o konnektorakh v nadkorpusnykh bazakh dannykh [Representation of cross-lingual knowledge about connectors in Supracorpora databases]. Informatika i ee Primeneniya - Inform. Appl. 10(1):106-118.
Dobrovol'skiy, D. O., A. A. Kretov, and S. A. Sharov. 2005. Korpus parallel'nykh tekstov: arkhitektura i vozmozh- nosti ispol'zovaniya [Corpus of parallel texts: Architecture and applications]. Natsional'nyy korpus russkogo yazy- ka: 2003-2005 [Russian National Corpus: 2003-2005]. Moscow: Indrik. 263-296.
Wu, Y., M. Schuster, Z. Chen, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. Available at: https://arxiv.org/pdf/1609.08144.pdf (accessed September3,2018).
Johnson, M., M. Schuster, Q.V. Le, M. Krikun, Y. Wu, Zh. Chen, N. Thorat, F Viegas, M. Wattenberg, G. Corra- do, M. Hughes, and J. Dean. 2017. Google's multilingual neural machine translation system: Enabling zero-shot translation. T. Assoc. Computational Linguistics 5:339- 351.
Natsional'nyy korpus russkogo yazyka [Russian National Corpus]. Available at: http://www.ruscorpora.ru (accessed November 30, 2018).
Ulitkin, I. A. 2016. Avtomaticheskaya otsenka kachest- va mashinnogo perevoda nauchno-tekhnicheskogo teksta [Automatic evaluation of machine translation quality of a scientific text]. B.. MRSU. Ser. Linguistics 4:174-182.
Vilar, D., J. Xu, L. D'haro, and H. Ney. 2006. Error analysis of statistical machine translation output. 5th Conference (International) on Language Resources and Evaluation Proceedings. Genoa, Italy: European Language Resources Association. Available at: http://www.lrec- conf.org/proceedings/lrec2006/pdf/413_pdf.pdf (ac-cessed September 3, 2018).
Inkova, O. Yu., and M.G. Kruzhkov 2018. Statistical analysis of language specificity of connectives based on parallel texts. Informatika i ee Primeneniya - Inform. Appl. 12(3):83-90.
Nuriev, V., N. Buntman, and O. Inkova. 2018. Machine translation of Russian connectives into French: Errors and quality failures. Informatika i ee Primeneniya - Inform. Appl. 12(2):105-113.
Zaliznyak, Anna A., I. M. Zatsman, and O.Yu. Inkova. 2017. Nadkorpusnaya baza dannykh konnektorov: postroenie sistemy terminov [Supracorpora database on connectives: Term system development]. Informatika i ee Primeneniya - Inform. Appl. 11(1):100-106.
Inkova, O.Yu., and N. А. Popkova. 2017. Statistical data as information source for linguistic analysis of Russian connectors. Informatika i ee Primeneniya - Inform. Appl. 11(3):123-131.
Zatsman, I.M., M.G. Kruzhkov, and E. Yu. Loshchilo- va. 2017. Metody analiza chastotnosti modeley perevoda konnektorov i obratimost' generalizatsii statisticheskikh dannykh [Methods of frequency analysis of connectives translations and reversibility of statistical data generalization] . Sistemy i Sredstva Informatiki - Systems and Means of Informatics 27(4):164-176.
Zatsman, I. M. 2018. Stadii tselenapravlennogo izvlecheniya znaniy, implitsirovannykh v parallel'nykh tekstakh [Stages of goal-oriented discovery of knowledge implied in parallel texts]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 28(3):169-182.
Durnovo, A. A., I. M. Zatsman, and E. Yu. Loshchilo- va. 2016. Kross-lingvisticheskaya baza dannykh dlya an- notirovaniya logiko-semanticheskikh otnosheniy v tekste [Cross-lingual database for annotating logical-semantic relations in the text]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 26(4):124-137.
Zatsman, I. 2018. Goal-oriented creation of individual knowledge: Model and information technology. 19th European Conferenceon Knowledge Management Proceedings. Reading: Academic Publishing International Ltd. 2:947- 956.

[+] About this article