Systems and Means of Informatics
2023, Volume 33, Issue 4, pp 149-159
CLASS IMBALANCE IN THE TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT
- I. M. Adamovich
- O. I. Volkov
Abstract
The article continues a series of works devoted to the technology of concrete historical investigation support, built on the principles of co-creation and crowdsourcing and designed for a wide range of nonprofessional historians and biographers users. The article is devoted to the further development of the topic of data preparation for machine learning algorithms used in the technology.
The special importance of binary classification for concrete historical research is shown. The problem of class imbalance in binary classification using machine learning algorithms and its consequences are described. It is shown that concrete historical data can be highly imbalanced. An overview of approaches to solving the problem of class imbalance elimination is given. The analysis of the specifics of concrete historical data was carried out, and on its basis, the oversampling approach was chosen as the most suitable for the technology. Algorithms implementing this approach are described; their advantages and disadvantages are evaluated. The ADASYN algorithm has been selected as the most promising for use in the technology conditions. The possibilities of the already included in the technology means of data noise and outliers control to compensate such a disadvantage of the ADASYN algorithm as sensitivity to outliers are evaluated.
[+] References (20)
- Gribach, S. V. 2010. Issledovanie semeynykh krizisov posredstvom psikholingvisticheskogo eksperimenta [The study of family crises through a psycholinguistic experiment]. Sborniki konf. NITs Sotsiosfera [Conference Proceedings NIC Sociosfera] 6:45-54.
- Adamovich, I.M., and O.I. Volkov. 2016. Tekhnologiya raspredelennogo avtomatizirovannogo analiza istoricheskikh tekstov [The distributed automated technology of historical texts analysis]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 26(3):148-161. doi: 10.14357/08696527160311.
- Adamovich, I. M., and O.I. Volkov. 2019. Edinaya tekhnologiya podderzhki konkretno-istoricheskikh issledovaniy [Unified technology of concrete historical research support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(1): 194-205. doi: 10.14357/08696527190116.
- Adamovich, I. M., and O.I. Volkov. 2019. Printsipy organizatsii dannykh dlya tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [The principles of data organization for the technology of concrete historical research support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(2):161-171. doi: 10.14357/08696527190214.
- Gagarina, D.A., S. I. Korniyenko, and N. G. Povroznik. 2017. Informatsionnye sistemy v tsifrovoy srede istoricheskoy nauki [Information systems in the digital environment of historical studies]. Istoriya [Hystory] 7(51). 12 p. Available at: https://history.jes.su/s207987840001638-0-1/ (accessed October 16, 2023).
- Adamovich, I. M., and O. I. Volkov. 2023. Ochistka dannykh v tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [Data cleansing in the technology of concrete historical investigation support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 33(3):149-160. doi: 10.14357/08696527230313.
- Osborne, J. W. 2012. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Newbury Park, CA: SAGE Publications Inc. 275 p.
- Martyushov, L. N. 2016. Metody istoricheskogo issledovaniya [Methods of historical research]. Ekaterinburg: USPU. 86 p.
- Bocharov, A. V. 2007. Algoritmy ispol'zovaniya osnovnykh nauchnykh metodov v konkretno-istoricheskom issledovanii [Algorithms for the use of basic scientific methods in concrete historical research]. Tomsk: Publishing House of the Tomsk State University. 140 p.
- Klyuyeva, I. A. 2017. Issledovanie primenimosti SMOTE-algoritma pri klassifikatsii nesbalansirovannykh dannykh [The study of the applicability of the SMOTE-algorithm by classification of the imbalanced data]. Sovremennye tekhnologii v nauke i ob- razovanii: Sbornik tr. II Mezhdunar. nauchn.-tekhnich. i nauchn.-metodicheskoy konf. [Modern Technologies in Science and Education: 2nd Scientific-Technical and Scientific-Methodological Conference (International) Proceedings]. Ed. O. V. Milov- zorov. Ryazan: Ryazan State Radioengineering University. 1:143-146. EDN: ZCK- RHN.
- Naydenov, A. S. 2017. Primenenie semplinga v usloviyakh nesbalansirovannosti klassov [Application of sampling in conditions of unbalanced classes]. Informatsionnye tekhnologii: Mezhvuzovskiy sbornik nauchnykh trudov [Information technologies: Interuniversity collection of scientific papers]. Ryazan: Ryazan State Radioengineering University. 1:73-75. EDN: ZQUNVJ.
- Fedorov, A.N. 2010. Real'naya opora sovetskoy vlasti: sotsial'no-demograficheskie kharakteristiki gorodskogo naseleniya Rossii v 1917-1920 godakh (na materialakh Tsentral'nogo promyshlennogo rayona) [The real support of Soviet power: Sociodemographic characteristics of the urban population of Russia in 1917-1920 (on the materials of the Central Industrial District)]. Zh. issledovaniy sotsial'noy politiki [J. Social Policy Studies] 1(8):69-86. EDN: LDFPAF.
- Dyakonov, A. G. 2021. Disbalans klassov [Class imbalance]. Analiz malykh dannykh [Small data analysis]. Available at: https://alexanderdyakonov.wordpress.com/ 2021/05/27/imbalance/ (accessed October 16, 2023).
- Adamovich, I.M., and O.I. Volkov. 2016. Ierarkhicheskaya forma predstavleniya biograficheskogo fakta [Hierarchial format of a biographical fact]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 26(2): 108-122. doi: 10.14357/08696527160207.
- Japkowicz, N., and S. Stephen. 2002. The class imbalance problem: A systematic study. Intell. Data Anal. 6(5):429-449.
- Sevastianov, L.A., and E.Yu. Shchetinin. 2020. O metodakh povysheniya tochnosti mnogoklassovoy klassifikatsii na nesbalansirovannykh dannykh [On methods for improving the accuracy of multiclass classification on imbalanced data]. Informatika i ee Primeneniya - Informatics and Applications 14(1):63-70. doi: 10.14357/ 19922264200109.
- Chawla, N.V., K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16:321-357. doi: 10.1613/'j air.953.
- He, H., Ya. Bai, E.A. Garcia, and Sh. Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE Joint Conference (International) on Neural Networks (IEEE World Congress on Computational Intelligence). Piscataway, NJ: IEEE. 1322-1328. doi: 10.1109/IJCNN.2008.4633969.
- Category imbalance problems of machine learning (3) - sampling method. Cnblogs.com: Massquantity blog. Available at: https://www.cnblogs.com/ massquantity/p/9382710.html (accessed October 16, 2023).
- Adamovich, I.M., and O.I. Volkov. 2022. Algoritmy klasterizatsii dlya tekhnologii podderzhki konkretno-istoricheskikh issledovaniy [Clustering algorithms for technology of concrete historical investigation support]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 32(4): 112-123. doi: 10.14357/08696527220411.
[+] About this article
Title
CLASS IMBALANCE IN THE TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT
Journal
Systems and Means of Informatics
Volume 33, Issue 4, pp 149-159
Cover Date
2023-12-10
DOI
10.14357/08696527230414
Print ISSN
0869-6527
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
concrete historical investigation; distributed technology; machine learning; class imbalance; ADASYN algorithm
Authors
I. M. Adamovich and O. I. Volkov
Author Affiliations
Federal Research Center "Computer Science and Control", Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
|