Informatics and Applications
2014, Volume 8, Issue 4, pp 94-109
METHODS OF ENTITY RESOLUTION AND DATA FUSION IN THE ETL-PROCESS AND THEIR IMPLEMENTATION IN THE HADOOP ENVIRONMENT
- A. E. Vovchenko
- L. A. Kalinichenko
- D. Yu. Kovalev
Abstract
Entities extraction, their transformation and loading in the integrated repository are the main problem of data integration. These actions are part of the ETL-process (extract-transform-loading). An entity is a digital representation of a real world object (for example, information about a person). Entity resolution takes care of duplicate detection, deduplication, record linkage, object identification, reference matching, and other ETL- related tasks. After the entity resolution step, entities should be merged into the one reference entity (containing information from all related entities). Data fusion is the final step in the data integration process. The paper gives an overview of the entity resolution and data fusion methods. Also, the paper presents the techniques for programming the entity resolution and data fusion methods for implementing the ETL-process in the Hadoop environment. High-Level Integration Language (HIL), a declarative language that focuses on resolution and fusion of entities in the Hadoop-infrastructure, is used in this part of the paper.
[+] References (46)
- White, T 2012. Hadoop: The definitive guide. 3rd ed. O'Reilly Media. 688 p.
- Apache Hadoop 2.5.1. Available at: http://hadoop. apache.org/ (accessed November 01, 2014).
- Naumann, F., and M. Herschel. 2010. An introduction to duplicate detection. Synthesis lectures on data management. Morgan & Claypool. Lecture No. 3. 87 p.
- Christen, P. 2012. Data matching - concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications ser. Springer. 272 p.
- Fan, W., and F Geerts. 2012. Foundations of data quali-ty management. Synthesis lectures on data management. Morgan & Claypool. Lecture No. 29. 217 p.
- Bleiholder, J., and F Naumann. 2009. Data fusion. ACMComputing Surveys (CSUR) 41(1). Article No. 1. doi: 10.1145/1456650.1456651.
- Kopcke, H., A. Thor, and E. Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endowment 3(1-2):484-493.
- Kopcke, H., and E. Rahm. 2010. Frameworks for entity matching: A comparison. Data Knowledge Engineering 69(2): 197-210. doi: 10.1016/j.datak.2009.10.003.
- Ganti, V., and A. Das Sarma. 2013. Data cleaning: Aprac- tical perspective. Synthesis lectures on data management. Morgan & Claypool. Lecture No. 36. 85 p.
- Getoor, L., and A. Machanavajjhala. 2013. Entity resolution for big data. 19th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD'13) Proceedings. Chicago. 1527-1527.
- Bleiholder, J., and F Naumann. 2005. Declarative data fusion - syntax, semantics, and implementation. East European Conference on Advances in Databases and Information Systems (ADBIS) Proceedings. Tallinn. 58-73.
- Dong, L. X., and F. Naumann. 2009. Data fusion - resolving data conflicts in Integration. Proc. VLDB Endowment 2(2):1654-1655.
- Bleiholder, J. 2010. Data fusion and conflict resolution in integrated information systems. Potsdam. D.Sc. Diss. 184 p.
- Winkler, W. E. 2006. Overview of record linkage and current research directions. Research report ser. No. 2006-2. Washington, DC: Statistical Research Division, U.S. Census Bureau. 44 p. Available at: http:// www.census.gov/srd/papers/pdf/rrs2006-
2. pdf (accessed November 01, 2014).
- Adamic, L. A., and E. Adar. 2003. Friends and neighbors on the Web. Social Networks 25:211-230.
- Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar, and
S. Fienberg. 2003. Adaptive name matching in information integration. IEEEIntell. Syst. 18(5):16-23.
- Monge-Elkan distance function. Available at: http:// www.gabormelli.com/RKB/Monge-Elkan_Distance_ Function (accessed November 01, 2014).
- Cochinwala, M., V. Kurienb, G. Lalka, and D. Shasha. 2001. Efficient data reconciliation. Inform. Sci. Int. J. 137(1-4):1-15.
- Bilenko, M., and R. Mooney. 2003. Adaptve duplicate detecton using learnable string similarity measures. 9th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (SIGKDD 2003) Proceedings. Washington. 39-48.
- Christen, P. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. 14th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD'2008) Proceedings. Las Vegas. 151-159.
- Chen, Z., D.V. Kalashnikov, and S. Mehrotra. 2009. Exploiting context analysis for combining multiple entity resolution systems. 2009 ACM SIGMOD Conference (International) on Management of Bata (SIGMOD 2009) Proceedings. Providence. 207-218.
- Gupta, R., and S. Sarawagi. 2009. Answering table aug- mentaton queries from unstructured lists on the Web. Proc. VLDB Endowment 2(1):289-300.
- Ravikumar, P., andW. Cohen. 2004. Ahierarchicalgraphical model for record linkage. 20th Conference on Uncertainty in Artificial Intelligence (UAI2004) Proceedings. Virginia. 454-461.
- Tejada, S., C. A. Knoblock, and S. Minton. 2001. Learning object identification rules for information integration. Inform. Syst. Data Extraction Cleaning Reconciliation 26(8):607-633.
- Sarawagi, S., and A. Bhamidipaty. 2002. Interactive deduplication using active learning. 8th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD 2002) Proceedings. Edmonton. 269-278.
- Arasu, A., M. Gotz, andR. Kaushik. 2010. On active learning of record matching packages. 2010 ACM SIGMOD Conference (International) on Management of Data Proceedings. Indianapolis. 783-794.
- Bellare, K., S. Iyengar, A. G. Parameswaran, and V. Ras- togi. 2012. Active sampling for entity matching. 18th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD 2012) Proceedings. Beijing. 1131-1139.
- Adam, K., E. Wu, D. Karger, S. Madden, and R. Miller. 2011. Human-powered sorts and joins. Proc. VLDB Endowment 5(1): 13-24.
- Wang, J., T. Kraska, M. J. Franklin, and J. Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. Proc. VLDB Endowment 5(11):1483-1494.
- Ananthakrishna, R., S. Chaudhuri, and V. Ganti. 2002. Eliminating fuzzy duplicates in data warehouses. 28th Con - ference (International) on Very Large Data Bases (VLDB 2002) Proceedings. Hong Kong. 586-597.
- Fan, W, F Geerts, X. Jia, and A. Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. 2007IEEE 23rd Conference (International) on Data Engineering Proceeding. Istanbul. 746-755.
- Fan, W. 2008. Dependencies revisited for improving data quality. 27th ACMSIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems (PODS 2008) Proceedings. Vancouver. 159-170.
- Benjelloun, O., H. Garcia-Molina, D. Menestrina, Q. Su,
S. E. Whang, and J. Widom. 2009. Swoosh: A generic approach to Entity Resolution. VLDB Int. J. 18(1):255-276.
- Bhattacharya, I., and L. Getoor. 2007. Collective Entity Resolution in relational data. ACM Trans. Knowledge Discovery Data (TKDD) 1(1). Article No. 5. doi: 10.1145/1217299.1217304.
- Bhattacharya, I., and L. Getoor. 2007. A latent Dirich- let model for unsupervised Entity Resolution. 6th SIAM Conference (International) on Data Mining Proceedings. Maryland. 47-58.
- Broecheler, M., and L. Getoor. 2010. Probabilistic similarity logic. 26th Conference on Uncertainty in Artificial Intelligence Proceedings. Corvallis. 73-82.
- Rajaraman, A., and J. D. Ullman. 1996. Integrating information by outerjoins and full disjunctions. 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS1996) Proceedings. Montreal. 238-248.
- Dean, J., and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51(1):107-113.
- MapReduce tutorial. Available at: http://hadoop. apache.org/docs/ rl. 2.1/m a pred_tutorial.htm I (accessed November 01, 2014).
- Apache Pig Project. Available at: http://pig.apache.org/ (accessed November 01, 2014).
- The Apache Hive data warehouse. Available at: http:// hive.apache.org/ (accessed November 01, 2014).
- IBM InfoSphere BigInsights Version 3.0, Jaql ref-erence. Available at: http://www-01.ibm.com/ sup port/knowledgecenter/SS PT3X_3.0. Î/com. ibm.swg. im.infosphere.biginsights.jaql.doc/doc/c0057749.html (accessed November 01, 2014).
- Sarma, D.A., A. Jain, A. Machanavajjhala, and P. Bohannon. 2012. An automatic blocking mechanism for large-scale de-duplication tasks. 21st ACM Conference (In
ternational) on Information and Knowledge Management Proceedings. Maui. 1055-1064.
- Papadakis, G., E. Ioannou, C. Niederee, T Palpanas, and W Nejdl. 2012. Beyond 100 million entities: Large-scale blocking-based resolution for heterogenous data. 5th ACM Conference (International) on Web Search and Data Mining Proceedings. Seattle. 53-62.
- Kolb, L., A. Thor, and E. Rahm. 2012. Dedoop: Efficient deduplication with Hadoop. Proceedings of the VLDB Endowment 5(12):1878-1881.
- Hernandez, M., G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky 2013. HIL: A high-level scripting language for entity integration. 16th Conference (International) on Extending Database Technology (EDBT'13) Proceedings. Genoa. 549-560.
[+] About this article
Title
METHODS OF ENTITY RESOLUTION AND DATA FUSION IN THE ETL-PROCESS AND THEIR IMPLEMENTATION IN THE HADOOP ENVIRONMENT
Journal
Informatics and Applications
2014, Volume 8, Issue 4, pp 94-109
Cover Date
2014-10-30
DOI
10.14357/19922264140412
Print ISSN
1992-2264
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
data integration; ETL; entity resolution; data fusion; big data; Hadoop; Jaql; HIL
Authors
A. E. Vovchenko , L. A. Kalinichenko , , and D. Yu. Kovalev
Author Affiliations
Institute of Informatics Problems, Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
Faculty of Computational Mathematics and Cybernetics, M. V. Lomonosov Moscow State University, 1-52 Lenin-skiye Gory, GSP-1, Moscow 119991, Russian Federation
|