Systems and Means of Informatics
2023, Volume 33, Issue 3, pp 149-160
DATA CLEANSING IN THE TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT
- I. M. Adamovich
- O. I. Volkov
Abstract
The article continues the series of works devoted to the technology of concrete historical research supporting. The technology is based on the principles of co-creation and crowdsourcing and is designed for a wide range of users which are not professional historians and biographers. The expediency of expanding the list of concrete historical research tasks solved within the framework of the described technology using machine learning methods is shown.
The special importance of data preparation is noted due to the fragmentation and inconsistency of concrete historical information. This article is devoted to the specifics of concrete historical data cleansing and the analysis of the possibility of using mechanisms and algorithms already integrated into the technology for this purpose. The main directions in which data cleansing is carried out are listed. Suitable tools already included in the technology have been identified for each direction. Particular attention is paid to tools for eliminating inconsistencies. The stages of data cleansing are listed and the scheme of interaction of all mechanisms and algorithms described in the article is given.
[+] References (19)
- Gribach, S. V. 2010. Issledovanie semeynykh krizisov posredstvom psikholingvisticheskogo eksperimenta [The study of family crises through a psycholinguistic experiment]. Sborniki konferentsiy NITs Sotsiosfera [Conference Proceedings NIC Sociosfera] 6: 45-54.
- Adamovich, I. M., and O.I. Volkov. 2016. Tekhnologiya raspredelennogo avtomatizirovannogo analiza istoricheskikh tekstov [The distributed automated technology of historical texts analysis]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 26(3):148-161. doi: 10.14357/08696527160311.
- Adamovich, I. M., and O.I. Volkov. 2019. Edinaya tekhnologiya podderzhki konkretno-istoricheskikh issledovaniy [Unified technology of concrete historical research support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(1): 194-205. doi: 10.14357/08696527190116.
- Adamovich, I. M., and O. I. Volkov. 2019. Printsipy organizatsii dannykh dlya tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [The principles of data organization for the technology of concrete historical research support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(2): 161-171. doi: 10.14357/ 08696527190214.
- Fomina, E. E. 2019. Obzor metodov i programmnogo obespecheniya dlya vosstanovleniya propushchennykh znacheniy v massivakh sotsiologicheskikh dannykh [Review of software and methods for recovering missing values in sociological data sets]. Gumanitar- nyy vestnik [Humanities Bulletin] 4(78): 1-12. doi: 10.18698/2306-8477-2019-4-611.
- Gagarina, D.A., S. I. Kornienko, and N. G. Povroznik. 2017. Informatsionnye sistemy v tsifrovoy srede istoricheskoy nauki [Information systems in the digital environment of historical studies]. Istoriya [History]. 7(51). Available at: https://history.jes.su/s207987840001638-0-1/ (accessed August 16, 2023).
- Voronina, V.V., A.V. Mikheev, N. G. Yarushkina, and K. V. Svyatov. 2017. Teoriya i praktika mashinnogo obucheniya [Theory and practice of machine learning]. Ulyanovsk: UlGTU. 290 p.
- Rahm, E., and H. H. Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4):3-13.
- Kim, W., B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee. 2003. A taxonomy of dirty data. Data Min. Knowl. Disc. 7(1):81-99. doi: 10.1023/A:1021564703268.
- Osborne, J. W. 2012. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Newbury Park, CA: SAGE Publications Inc. 275 p.
- Adamovich, I. M., and O. I. Volkov. 2016. Ierarkhicheskaya forma predstavleniya biograficheskogo fakta [Hierarchial format of a biographical fact]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 26(2): 108-122. doi: 10.14357/08696527160207.
- Adamovich, I. M., and O. I. Volkov. 2022. Strukturnyy podkhod k svyazyvaniyu zapisey v tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [Structural approach to record linking in technology of concrete historical investigation support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 32(1):94-103. doi: 10.14357/08696527220109.
- Adamovich, I.M., and O.I. Volkov. 2023. Mekhanizm formirovaniya gipotez v tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [Hypothesis formation mechanism in the technology of concrete historical investigation support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 33(1 ):135-145. doi: 10.14357/08696527230113.
- Kovalev, S. P. 2019. Ispol'zovanie algoritma klasterizatsii DBSCAN dlya fil'tratsii vybrosov v dannykh [Using the DBSCAN clustering algorithm for filtering outliers in data]. Komp'yuternye sistemy i seti: 55-ya Yubileynaya nauchnaya konferen- tsiya aspirantov, magistrantov i studentov [55th Anniversary Scientific Conference of Graduate Students, Undergraduates and Students "Computer Systems and Networks" Proceedings]. Minsk: BSUIR. 198-200.
- Adamovich, I.M., and O.I. Volkov. 2022. Algoritmy klasterizatsii dlya tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [Clustering algorithms for technology of concrete historical investigation support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 32(4): 112-123. doi: 10.14357/08696527220411.
- Adamovich, I. M., and O. I. Volkov. 2020. Avtomatizirovannyy poisk protivorechiy v konkretno-istoricheskoy informatsii [Automated search for contradictions in concrete- historical information]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 30(3):145-153. doi: 10.14357/08696527200313.
- Adamovich, I. M., and O. I. Volkov. 2022. Podkhod k svyazyvaniyu zapisey v tekhnologii podderzhki konkretno-istoricheskikh issledovaniy, osnovannyy na nechetkikh mnozhestvakh [Approach to record linking in technology of concrete historical investigation support based on fuzzy sets]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 32(2):137-145. doi: 10.14357/08696527220213.
- Adamovich, I. M., and O. I. Volkov. 2021. Ispol'zovanie geoinformatsionnykh sistem v tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [The use of geographic information systems in technology of concrete historical investigation support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 31 (3): 158-169. doi: 10.14357/08696527210314.
- Eremchenko, E.N., V. S. Tikunov, and Chi-Gon Sun. 2013. Protivorechivost' i nesoglasovannost' prostranstvenno-vremennykh dannykh: vozmozhnost' resheniya problemy v geoinformatsionnoy srede [Conflicting and inconsistent spatio-temporal data: Problem solving ability in a geographic information environment]. Geodeziya i kartografiya [Geodesy and Cartography] 4:41-47.
[+] About this article
Title
DATA CLEANSING IN THE TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT
Journal
Systems and Means of Informatics
Volume 33, Issue 3, pp 149-160
Cover Date
2023-11-10
DOI
10.14357/08696527230313
Print ISSN
0869-6527
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
concrete historical investigation; distributed technology; machine learning; data cleansing; data inconsistency
Authors
I. M. Adamovich and O. I. Volkov
Author Affiliations
Federal Research Center "Computer Science and Control", Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
|