Systems and Means of Informatics

2019, Volume 29, Issue 4, pp 106-118

ON METHODS OF MACHINE TRANSLATION QUALITY ASSESSMENT

A. K. Rychikhin

Abstract

The article discusses approaches to determining the quality of machine translation (MT) and several methods of translation quality assessment. The aim of the article is to review a number of methods and approaches to human and automatic assessment of MT quality. The first part of the article describes the methods of relative human evaluation (ranking of translations) and absolute evaluation based on penalties for errors in translation, as well as software and algorithms that simplify human assessment. Most attention is paid to the DQF/MQM (Dynamic Quality Framework/Multidimensional Quality Metrics) error typology which is not aimed at a limited subject area as the most flexible one. The second part of the article is devoted to a review of metrics for automatic quality assessment of MT that do not use linguistic data as well as the correlation coefficients of human and automatic evaluation.

[+] References (27)

Size machine translation market is $250 Million - TAUS publishes new market report. 2014. Available at: https://www.taus.net/think-tank/news/press-release/size- machine-translation-market-is-250-million-taus-publishes-new-market-report (accessed August 1, 2019).
Och, F. 2012. Breaking down the language barrier - six years in. Available at: https://googleblog.blogspot.ie/2012/04/breaking-down-language-barriersix-years. html (accessed August 1, 2019).
Pichai, S. 2016. Google I/O. Keynote speech. Available at: https://events.google. com/io2016/ (accessed August 1, 2019).
Way, A. 2018. Quality expectations of machine translation. Translation quality assessment: From principles to practice. Eds. J. Moorkens, Sh. Castilho, F. Gaspari, and S. Doherty. Cham, Switzerland: Springer. 159-178.
Reiss, K., and H. J. Vermeer. 1984. Grundlegung einer allgemeinen Translationstheo- rie. Tubingen, Deutschland: Walter de Gruyter. 245 p.
Castilho, Sh., S. Doherty, F. Gaspari, and J. Moorkens. 2018. Approaches to human and machine translation quality assessment. Translation quality assessment: From principles to practice. Eds. J. Moorkens, Sh. Castilho, F. Gaspari, and S. Doherty. Cham, Switzerland: Springer. 9-38.
Knyazheva, E. A. 2015. O nekotorykh vozmozhnostyakh ispol'zovaniya metodov sis- temnogo analiza v tselyakh otsenki kachestva perevoda [Application of system analysis methods to translation quality evaluation]. Proceedings of Voronezh State University. Ser. Linguistics and Intercultural Communication 3:113-119.
Federmann, Ch. 2012. Appraise: An open-source toolkit for manual evaluation of machine translation output. Prague Bull. Math. Linguistics 98:25-35.
Koehn, P. 2012. Simulating human judgment in machine translation evaluation campaigns. 9th Workshop (International) on Spoken Language Translation Proceedings. Hong Kong, China: International Speech Communication Association. 179-184.
Hopkins, M., and J. May. 2013. Models of translation competitions. 51st Annual Meeting of the Association for Computational Linguistics Proceedings. Sofia, Bulgaria: Association for Computational Linguistics. 1:1416-1424.
Sakaguchi, K., M. Post, andB. VanDurme. 2014. Efficient elicitation of annotations for human evaluation of machine translation. 9th Workshop on Statistical Machine Translation Proceedings. Baltimore, MA. 1-11.
Fleiss, J. L., and J. Cohen. 1973 The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas. 33:613-619.
Kendall, M. 1938. A new measure of eank correlation. Biometrika 30(1-2):81-83.
Likert, R. 1932. A technique for the measurement of attitudes. Arch. Psychol. 14:1-55.
Pierce, J., J. Caroll, E. Hamp, D. Hays, C. Hockett, A. Oettinger, and A. Perlis. 1966. Language and machines - computers in translation and linguistics.. Washington, DC: National Academy of Sciences. ALPAC report. 124 p.
Lommel, A. 2018. Metrics for translation quality assessment: A case for standardising error typologies. Translation quality assessment: From principles to practice. Eds. J. Moorkens, Sh. Castilho, F. Gaspari, and S. Doherty. Cham, Switzerland: Springer. 109-127.
Popovic, M. 2018. Error classification and analysis for machine translation quality assessment. Translation quality assessment: From principles to practice. Eds. J. Moorkens, Sh. Castilho, F. Gaspari, and S. Doherty. Cham, Switzerland: Springer. 129-158.
Lommel, A., A. Burchardt, A. Gorog, H. Uszkoreit, and A. Melby, eds. 2015. Multidimensional Quality Metrics (MQM) issue types. Available at: http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html (accessed August 5, 2019).
Popovic, M. 2011. Hjerson: An open source tool for automatic error classification of machine translation output. Prague Bull. Math. Linguistics 96:59-68.
Levenshtein, V. I. 1965. Dvoichnye kody s ispravleniem vypadeniy, vstavok i zameshchenii simvolov [Binary codes capable of correcting deletions, insertions, and reversals]. Dokl. Akad. Nauk SSSR [Sov. Math. Dokl.] 163(4):845-848.
Tillmann, C., S. Vogel, H. Ney, H. Sawaf, and A. Zubiaga. 1997. Accelerated DP-based search for statistical translation. 5th European Conference on Speech Communication and Technology Proceedings. Rhodes, Greece. 2667-2670.
Koehn, P. 2010. Statistical machine translation. Cambridge: Cambridge University Press. 462 p.
Papineni, K., S. Roukos, T. Ward, and W.J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. 40th Annual Meeting on Association for Computational Linguistics Proceedings. Philadelphia, PA: Association for Computational Linguistics. 311-318.
Popovic, M. 2015. CHRF: Character n-gram F-score for automatic MT evaluation. 10th Workshop on Statistical Machine Translation Proceedings. Lisboa, Portugal: Association for Computational Linguistics. 392-395.
Popovic, M. CHRF deconstructed: в parameters and n-gram weights. 1st Conference on Machine Translation Proceedings. Berlin, Germany: Association for Computational Linguistics, 2016. 2:499-504.
Pearson, K. 1895. Notes on regression and inheritance in the case of two parents. P. R. Soc. London 58(347-352):240-242.
Banerjee, S., and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics Proceedings. Ann Arbor, MI: Association of Computational Linguistics. 65-72.

[+] About this article