Systems and Means of Informatics
2024, Volume 34, Issue 1, pp pp 128-138
COLLECTIVE ENTITY RESOLUTION IN TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT
- I. M. Adamovich
- O. I. Volkov
Abstract
The article is devoted to the further development of a distributed technology of concrete historical investigation support based on the principles of crowdsourcing and focused on a wide range of users who are not professional historians and biographers. Development is carried out by including in the technology an entity resolution algorithm for nominative documents processing that performs collective resolution in which entities for matching links are determined jointly. This algorithm is a modification of the greedy agglomerative clustering algorithm. The article provides a detailed description of the approach underlying the algorithm and provides its high-level pseudocode. The analysis of its effectiveness on data with varying degrees of ambiguity of names is given and the degree of ambiguity of names of concrete historical data is estimated. The conclusion about the expediency of including the algorithm in the technology is made. The directions of further research on determining the configurable parameters of the algorithm are outlined.
[+] References (10)
- Antonov, D. N. 2000. Vosstanovlenie istorii semey: metod, istochniki, analiz [Restoring family history: Method, sources, and analysis]. Moscow. PhD Diss. 290 p. EDN: QDBKMR.
- Thorvaldsen, G. 2016. Nominativnye istochniki v kontekste vsemirnoy istorii perepisey: Rossiya i Zapad [Nominative data and global census history: Russia and the West]. Izvestiya Ural'skogo federal'nogo universiteta. Ser. 2: Gumanitarnye nauki [Izvestia. Ural Federal University J. Ser. 2: Humanities and Arts] 18(3):9{28. doi: 10.15826/ izv2.2016.18.3.041. EDN: WYDBXL.
- Yurchenko, N. L. 1993. Nekotorye problemy ispol'zovaniya revizskikh skazok kak istochnika po istoricheskoy demografii [Some problems of using census lists as a source for historical demography]. Vspomogatel'nye istoricheskie distsipliny [Auxiliary Historical Disciplines] XXIV:183{189.
- Adamovich, I. M., and O.I. Volkov. 2016. Tekhnologiya raspredelennogo avtomatizirovannogo analiza istoricheskikh tekstov [The distributed automated technology of historical texts analysis]. Sistemy i Sredstva Informatiki | Systems and Means of Informatics 26(3):148{161. doi: 10.14357/08696527160311. EDN: WWSZIJ.
- Adamovich, I. M., and O.I. Volkov. 2019. Edinaya tekhnologiya podderzhki konkretno-istoricheskikh issledovaniy [Unified technology of concrete historical re-search support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(1): 194{205. doi: 10.14357/08696527190116. EDN: MZLQGZ.
- Adamovich, I.M., and O.I. Volkov. 2022. Strukturnyy podkhod k svyazyvaniyu zapisey v tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [Structural approach to record linking in technology of concrete historical investigation support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 32(1 ):94{103. doi: 10.14357/08696527220109. EDN: RLZRXH.
- Bhattacharya, I., and L. Getoor. 2007. Collective entity resolution in relational data. ACMT. Knowl. Discov. D. 1(1):5. 36 p. doi: 10.1145/1217299.1217304.
- Vovchenko, A.E., L. A. Kalinichenko, and D.Yu. Kovalev. 2014. Metody razresheniya sushchnostey i sliyaniya dannykh v ETL-protsesse i ikh realizatsiya v srede Hadoop [Methods of entity resolution and data fusion in the ETL-process and their implementation in the Hadoop environment]. Informatika i ee primeneniya - Inform. Appl. 8(4):94{109. doi: 10.14357/19922264140412. EDN: PJYYLB.
- Roughgarden, T. 2019. Algorithms illuminated. Part 3: Greedy algorithms and dynamic programming. New York, NY: Soundlikeyourself Publs. 217 p.
- Konger, S. 2022. What does the unionfind algorithm actually do? Scientyfic World. Available at: https://scientyhcworld.org/union-hnd-algorithm/ (accessed March 4, 2024).
[+] About this article
Title
COLLECTIVE ENTITY RESOLUTION IN TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT
Journal
Systems and Means of Informatics
Volume 34, Issue 1, pp 128-138
Cover Date
2024-04-10
DOI
10.14357/08696527240111
Print ISSN
0869-6527
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
concrete historical investigation; distributed technology; entity resolution; greedy algorithm; relational similarity measure
Authors
I. M. Adamovich and O. I. Volkov
Author Affiliations
Federal Research Center "Computer Science and Control", Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
|