White, T 2012. Hadoop: The definitive guide. 3rd ed. O'Reilly Media. 688 p.
Apache Hadoop 2.5.1. Available at: http://hadoop. apache.org/ (accessed November 01, 2014).
Naumann, F., and M. Herschel. 2010. An introduction to duplicate detection. Synthesis lectures on data management. Morgan & Claypool. Lecture No. 3. 87 p.
Christen, P. 2012. Data matching - concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications ser. Springer. 272 p.
Fan, W., and F Geerts. 2012. Foundations of data quali-ty management. Synthesis lectures on data management. Morgan & Claypool. Lecture No. 29. 217 p.
Bleiholder, J., and F Naumann. 2009. Data fusion. ACMComputing Surveys (CSUR) 41(1). Article No. 1. doi: 10.1145/1456650.1456651.
Kopcke, H., A. Thor, and E. Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endowment 3(1-2):484-493.
Kopcke, H., and E. Rahm. 2010. Frameworks for entity matching: A comparison. Data Knowledge Engineering 69(2): 197-210. doi: 10.1016/j.datak.2009.10.003.
Ganti, V., and A. Das Sarma. 2013. Data cleaning: Aprac- tical perspective. Synthesis lectures on data management. Morgan & Claypool. Lecture No. 36. 85 p.
Getoor, L., and A. Machanavajjhala. 2013. Entity resolution for big data. 19th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD'13) Proceedings. Chicago. 1527-1527.
Bleiholder, J., and F Naumann. 2005. Declarative data fusion - syntax, semantics, and implementation. East European Conference on Advances in Databases and Information Systems (ADBIS) Proceedings. Tallinn. 58-73.
Dong, L. X., and F. Naumann. 2009. Data fusion - resolving data conflicts in Integration. Proc. VLDB Endowment 2(2):1654-1655.
Bleiholder, J. 2010. Data fusion and conflict resolution in integrated information systems. Potsdam. D.Sc. Diss. 184 p.
Winkler, W. E. 2006. Overview of record linkage and current research directions. Research report ser. No. 2006-2. Washington, DC: Statistical Research Division, U.S. Census Bureau. 44 p. Available at: http:// www.census.gov/srd/papers/pdf/rrs2006- 2. pdf (accessed November 01, 2014).
Adamic, L. A., and E. Adar. 2003. Friends and neighbors on the Web. Social Networks 25:211-230.
Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. 2003. Adaptive name matching in information integration. IEEEIntell. Syst. 18(5):16-23.
Monge-Elkan distance function. Available at: http:// www.gabormelli.com/RKB/Monge-Elkan_Distance_ Function (accessed November 01, 2014).
Cochinwala, M., V. Kurienb, G. Lalka, and D. Shasha. 2001. Efficient data reconciliation. Inform. Sci. Int. J. 137(1-4):1-15.
Bilenko, M., and R. Mooney. 2003. Adaptve duplicate detecton using learnable string similarity measures. 9th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (SIGKDD 2003) Proceedings. Washington. 39-48.
Christen, P. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. 14th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD'2008) Proceedings. Las Vegas. 151-159.
Chen, Z., D.V. Kalashnikov, and S. Mehrotra. 2009. Exploiting context analysis for combining multiple entity resolution systems. 2009 ACM SIGMOD Conference (International) on Management of Bata (SIGMOD 2009) Proceedings. Providence. 207-218.
Gupta, R., and S. Sarawagi. 2009. Answering table aug- mentaton queries from unstructured lists on the Web. Proc. VLDB Endowment 2(1):289-300.
Ravikumar, P., andW. Cohen. 2004. Ahierarchicalgraphical model for record linkage. 20th Conference on Uncertainty in Artificial Intelligence (UAI2004) Proceedings. Virginia. 454-461.
Tejada, S., C. A. Knoblock, and S. Minton. 2001. Learning object identification rules for information integration. Inform. Syst. Data Extraction Cleaning Reconciliation 26(8):607-633.
Sarawagi, S., and A. Bhamidipaty. 2002. Interactive deduplication using active learning. 8th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD 2002) Proceedings. Edmonton. 269-278.
Arasu, A., M. Gotz, andR. Kaushik. 2010. On active learning of record matching packages. 2010 ACM SIGMOD Conference (International) on Management of Data Proceedings. Indianapolis. 783-794.
Bellare, K., S. Iyengar, A. G. Parameswaran, and V. Ras- togi. 2012. Active sampling for entity matching. 18th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining (KDD 2012) Proceedings. Beijing. 1131-1139.
Adam, K., E. Wu, D. Karger, S. Madden, and R. Miller. 2011. Human-powered sorts and joins. Proc. VLDB Endowment 5(1): 13-24.
Wang, J., T. Kraska, M. J. Franklin, and J. Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. Proc. VLDB Endowment 5(11):1483-1494.
Ananthakrishna, R., S. Chaudhuri, and V. Ganti. 2002. Eliminating fuzzy duplicates in data warehouses. 28th Con - ference (International) on Very Large Data Bases (VLDB 2002) Proceedings. Hong Kong. 586-597.
Fan, W, F Geerts, X. Jia, and A. Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. 2007IEEE 23rd Conference (International) on Data Engineering Proceeding. Istanbul. 746-755.
Fan, W. 2008. Dependencies revisited for improving data quality. 27th ACMSIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems (PODS 2008) Proceedings. Vancouver. 159-170.
Benjelloun, O., H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. 2009. Swoosh: A generic approach to Entity Resolution. VLDB Int. J. 18(1):255-276.
Bhattacharya, I., and L. Getoor. 2007. Collective Entity Resolution in relational data. ACM Trans. Knowledge Discovery Data (TKDD) 1(1). Article No. 5. doi: 10.1145/1217299.1217304.
Bhattacharya, I., and L. Getoor. 2007. A latent Dirich- let model for unsupervised Entity Resolution. 6th SIAM Conference (International) on Data Mining Proceedings. Maryland. 47-58.
Broecheler, M., and L. Getoor. 2010. Probabilistic similarity logic. 26th Conference on Uncertainty in Artificial Intelligence Proceedings. Corvallis. 73-82.
Rajaraman, A., and J. D. Ullman. 1996. Integrating information by outerjoins and full disjunctions. 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS1996) Proceedings. Montreal. 238-248.
Dean, J., and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51(1):107-113.
MapReduce tutorial. Available at: http://hadoop. apache.org/docs/ rl. 2.1/m a pred_tutorial.htm I (accessed November 01, 2014).
Apache Pig Project. Available at: http://pig.apache.org/ (accessed November 01, 2014).
The Apache Hive data warehouse. Available at: http:// hive.apache.org/ (accessed November 01, 2014).
IBM InfoSphere BigInsights Version 3.0, Jaql ref-erence. Available at: http://www-01.ibm.com/ sup port/knowledgecenter/SS PT3X_3.0. О/com. ibm.swg. im.infosphere.biginsights.jaql.doc/doc/c0057749.html (accessed November 01, 2014).
Sarma, D.A., A. Jain, A. Machanavajjhala, and P. Bohannon. 2012. An automatic blocking mechanism for large-scale de-duplication tasks. 21st ACM Conference (In ternational) on Information and Knowledge Management Proceedings. Maui. 1055-1064.
Papadakis, G., E. Ioannou, C. Niederee, T Palpanas, and W Nejdl. 2012. Beyond 100 million entities: Large-scale blocking-based resolution for heterogenous data. 5th ACM Conference (International) on Web Search and Data Mining Proceedings. Seattle. 53-62.
Kolb, L., A. Thor, and E. Rahm. 2012. Dedoop: Efficient deduplication with Hadoop. Proceedings of the VLDB Endowment 5(12):1878-1881.
Hernandez, M., G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky 2013. HIL: A high-level scripting language for entity integration. 16th Conference (International) on Extending Database Technology (EDBT'13) Proceedings. Genoa. 549-560.

Informatics and Applications

METHODS OF ENTITY RESOLUTION AND DATA FUSION IN THE ETL-PROCESS AND THEIR IMPLEMENTATION IN THE HADOOP ENVIRONMENT

Abstract

[+] References (46)

[+] About this article

Title

Journal

Cover Date

DOI

Print ISSN

Publisher

Additional Links

Key words

Authors

Author Affiliations