Informatics and Applications
2022, Volume 16, Issue 2, pp 52-59
- A. A. Durnovo
- O. Yu. Inkova
- N. A. Popkova
The article deals with annotation strategies in corpora with discourse markup. It is shown that Rhetorical Structure Theory (RST)-based corpora only contain annotations of coherence relations, or rhetorical relations (RR). In contrast, the Penn Discourse Treebank (PDTB) of the University of Pennsylvania annotates relations markers, as does the Supracorpora Database of Connectives. The RST Signaling Corpus (RST-SC), also based on RST, has been shown to annotate RR markers, but cannot combine the markup of RRs and their markers in a single annotation. This problem is solved by the GUM corpus and the Supracorpora Database of Hierarchy of Logical-Semantic Relations. The latter has a few advantages: the ability to search, to obtain statistics, and to form bilingual annotations. This makes it possible to identify both universal phenomena in the discursive organization of the text and language-specific phenomena.
Key words
supracorpora database; corpus of texts' annotation; discourse relations; connective
A. A. Durnovo, O. Yu. Inkova, and N. A. Popkova
 Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
 University of Geneva, 22 Bd des Philosophes, CH-1205 Geneva 4, Switzerland