Systems and Means of Informatics

2015, Volume 25, Issue 1, pp 34-53

MULTICRITERIA METHOD FOR DETECTING NEAR-DUPLICATES IN A STREAM OF TEXT MESSAGES

A. Andreev
D. Berezkin
I. Kozlov
K. Simakov

Abstract

The problem of near-duplicate detection in a stream of text messages is considered. A model of a text document and a multicriteria duplicate identification method is proposed. The model provides flexible adjustment for different domains. The method is based on binary classification using support vector machine. The paper also provides a method of candidates prefiltration in order to ensure high efficiency of the approach. Several experiments with data obtained from a stream of news articles were carried out. The results show feasibility of the suggested approach.

[+] References (15)

Lande, D.V., A. T. Darmokhval, and A.Yu. Morozov. 2006. Podkhod k vyyavleniyu dublirovaniya soobshcheniy v novostnykh informatsionnykh potokakh [The approach to duplication detection in news information flows]. Elektronnye biblioteki: Perspektivnye metody i tekhnologii, elektronnye kollektsii: Tr. 8-y Vseross. konf. (RCDL2006) [Digital Libraries: Advanced Methods and Technologies, Digital Collections: 8th All-Russian Scientific Conference RCDL-2006 Proceedings]. Suzdal. 115-119.
Zelenkov, Yu.G., and I.V. Segalovich. 2007. Sravnitel'nyy analiz metodov oprede- leniya nechetkikh dublikatov dlya Web-dokumentov [Comparative analysis of near- duplicate detection methods of Web documents]. Elektronnye biblioteki: Perspek- tivnye metody i tekhnologii, elektronnye kollektsii: Tr. 9-y Vseross. nauch. konf. (RCDL'2007) [Digital Libraries: Advanced Methods and Technologies, Digital Col- lections: 9th All-Russian Scientific Conference RCDL'2007 Proceedings]. Pereslavl'- Zalessky. 166-174.
Andreev, A.M., D.V. Berezkin, I.A. Kozlov, and K.V. Simakov. 2012. Metod obnaruzheniya izmeneniy struktury veb-saytov v sisteme sbora novostnoy informatsii [The method of detecting structure changes of news web sites]. Elektronnye biblioteki: Perspektivnye metody i tekhnologii, elektronnye kollektsii: Tr. 14-y Vseross. nauch. konf. (RCDL2012) [Digital Libraries: Advanced Methods and Technologies, Dig- ital Collections: 14th All-Russian Scientific Conference RCDL2012 Proceedings]. Pereslavl'-Zalessky. 124-133.
Blokh, M.Ya. 2000. Teoreticheskie osnovy grammatiki [Theoretical Foundations of Grammar]. 2nd ed. Moscow: Vysshaya Shkola. 160 p.
Chowdhury, A.,O. Frieder,D. Grossman, andM.McCabe. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inform. Syst. (TOIS) 20(2):171- 191.
Ilyinsky, S., M. Kuzmin, A. Melkov, and I. Segalovich. 2002. An efficient method to detect duplicates ofWeb documents with the use of inverted index. 11th World Wide Web Conference (International) (WWW'2002) Proceedings.Honolulu,Hawaii:ACM Press. Available at: http://www2002.org/CDROM/poster/187/index.html (accessed April 1, 2015).
Brin, S., J.Davis, andH. Garcia-Molina. 1995. Copy detection mechanisms for digital documents. ACM SIGMOD Record 24(2):398-409.
Broder,A. Algorithms for duplicate documents. Available at: http://www.cs.princeton. edu/courses/archive/spr05/cos598E/bib/Princeton.pdf (accessedDecember 17, 2014).
Manber, U. 1994. Finding similar files in a large file system. USENIX Winter 1994 Technical Conference Proceedings. Berkeley, CA, USA: USENIX Association. 1-10.
Shivakumar, N., and H. Garcia-Molina. 1995. SCAM: A copy detection mechanism for digital documents. 2nd Conference in Theory and Practice of Digital Libraries (DL'95) Proceedings. Available at: http://ilpubs.stanford.edu:8090/95/1/1995- 28.pdf (accessed April 1, 2015).
Shevelev, O.G. 2007. Metody avtomaticheskoy klassifikatsii tekstov na estestvennom yazyke [Methods of automatic classification of natural language texts]. Tomsk: TML- Press. 144 p.
Kolcz, A., A. Chowdhury, and J. Alspector. 2004. Improved robustness of signature- based near-replica detection via lexicon randomization. 10th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining Proceedings. 605-610.
Fetterly, D., M. Manasse, M. Najork, and J. Wiener. 2003. A large-scale study of the evolution of web pages. 12th Conference (International) on World Wide Web (WWW'03) Proceedings. New York, NY: ACN Press. 669-678.
Bilenko, M., and R. J. Mooney. 2003. Adaptive duplicate detection using learn- able string similarity measures. 9th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining Proceedings. 39-48.
Knyazeva, A.A., I.Yu. Turchanovskiy, and O. S. Kolobov. 2013. Vyyavlenie dublika- tov v bibliograficheskikh bazakh dannykh [Identification of duplicates in bibliographic databases]. Elektronnye biblioteki: Perspektivnye metody i tekhnologii, elektronnye kollektsii: Tr. 15-y Vseross. nauch. konf. (RCDL2013) [Digital Libraries: Advanced Methods and Technologies, Digital Collections: 15th All-Russian Scientific Conference RCDL 2013 Proceedings]. Yaroslavl. 207-213.

[+] About this article