Systems and Means of Informatics
2015, Volume 25, Issue 1, pp 34-53
MULTICRITERIA METHOD FOR DETECTING NEAR-DUPLICATES IN A STREAM OF TEXT MESSAGES
- A. Andreev
- D. Berezkin
- I. Kozlov
- K. Simakov
Abstract
The problem of near-duplicate detection in a stream of text messages
is considered. A model of a text document and a multicriteria duplicate
identification method is proposed. The model provides flexible adjustment for
different domains. The method is based on binary classification using support
vector machine. The paper also provides a method of candidates prefiltration
in order to ensure high efficiency of the approach. Several experiments with
data obtained from a stream of news articles were carried out. The results show
feasibility of the suggested approach.
[+] References (15)
- Lande, D.V., A. T. Darmokhval, and A.Yu. Morozov. 2006. Podkhod k vyyavleniyu
dublirovaniya soobshcheniy v novostnykh informatsionnykh potokakh [The approach to
duplication detection in news information flows]. Elektronnye biblioteki: Perspektivnye
metody i tekhnologii, elektronnye kollektsii: Tr. 8-y Vseross. konf. (RCDL2006)
[Digital Libraries: Advanced Methods and Technologies, Digital Collections: 8th
All-Russian Scientific Conference RCDL-2006 Proceedings]. Suzdal. 115-119.
- Zelenkov, Yu.G., and I.V. Segalovich. 2007. Sravnitel'nyy analiz metodov oprede-
leniya nechetkikh dublikatov dlya Web-dokumentov [Comparative analysis of near-
duplicate detection methods of Web documents]. Elektronnye biblioteki: Perspek-
tivnye metody i tekhnologii, elektronnye kollektsii: Tr. 9-y Vseross. nauch. konf.
(RCDL'2007) [Digital Libraries: Advanced Methods and Technologies, Digital Col-
lections: 9th All-Russian Scientific Conference RCDL'2007 Proceedings]. Pereslavl'-
Zalessky. 166-174.
- Andreev, A.M., D.V. Berezkin, I.A. Kozlov, and K.V. Simakov. 2012. Metod
obnaruzheniya izmeneniy struktury veb-saytov v sisteme sbora novostnoy informatsii
[The method of detecting structure changes of news web sites]. Elektronnye biblioteki:
Perspektivnye metody i tekhnologii, elektronnye kollektsii: Tr. 14-y Vseross. nauch.
konf. (RCDL2012) [Digital Libraries: Advanced Methods and Technologies, Dig-
ital Collections: 14th All-Russian Scientific Conference RCDL2012 Proceedings].
Pereslavl'-Zalessky. 124-133.
- Blokh, M.Ya. 2000. Teoreticheskie osnovy grammatiki [Theoretical Foundations of
Grammar]. 2nd ed. Moscow: Vysshaya Shkola. 160 p.
- Chowdhury, A.,O. Frieder,D. Grossman, andM.McCabe. 2002. Collection statistics
for fast duplicate document detection. ACM Trans. Inform. Syst. (TOIS) 20(2):171-
191.
- Ilyinsky, S., M. Kuzmin, A. Melkov, and I. Segalovich. 2002. An efficient method to
detect duplicates ofWeb documents with the use of inverted index. 11th World Wide
Web Conference (International) (WWW'2002) Proceedings.Honolulu,Hawaii:ACM
Press. Available at: http://www2002.org/CDROM/poster/187/index.html (accessed
April 1, 2015).
- Brin, S., J.Davis, andH. Garcia-Molina. 1995. Copy detection mechanisms for digital
documents. ACM SIGMOD Record 24(2):398-409.
- Broder,A. Algorithms for duplicate documents. Available at: http://www.cs.princeton.
edu/courses/archive/spr05/cos598E/bib/Princeton.pdf (accessedDecember 17, 2014).
- Manber, U. 1994. Finding similar files in a large file system. USENIX Winter 1994
Technical Conference Proceedings. Berkeley, CA, USA: USENIX Association. 1-10.
- Shivakumar, N., and H. Garcia-Molina. 1995. SCAM: A copy detection mechanism
for digital documents. 2nd Conference in Theory and Practice of Digital Libraries
(DL'95) Proceedings. Available at: http://ilpubs.stanford.edu:8090/95/1/1995-
28.pdf (accessed April 1, 2015).
- Shevelev, O.G. 2007. Metody avtomaticheskoy klassifikatsii tekstov na estestvennom
yazyke [Methods of automatic classification of natural language texts]. Tomsk: TML-
Press. 144 p.
- Kolcz, A., A. Chowdhury, and J. Alspector. 2004. Improved robustness of signature-
based near-replica detection via lexicon randomization. 10th ACM SIGKDD Conference (International) on Knowledge Discovery and Data Mining Proceedings.
605-610.
- Fetterly, D., M. Manasse, M. Najork, and J. Wiener. 2003. A large-scale study of
the evolution of web pages. 12th Conference (International) on World Wide Web
(WWW'03) Proceedings. New York, NY: ACN Press. 669-678.
- Bilenko, M., and R. J. Mooney. 2003. Adaptive duplicate detection using learn-
able string similarity measures. 9th ACM SIGKDD Conference (International) on
Knowledge Discovery and Data Mining Proceedings. 39-48.
- Knyazeva, A.A., I.Yu. Turchanovskiy, and O. S. Kolobov. 2013. Vyyavlenie dublika-
tov v bibliograficheskikh bazakh dannykh [Identification of duplicates in bibliographic
databases]. Elektronnye biblioteki: Perspektivnye metody i tekhnologii, elektronnye
kollektsii: Tr. 15-y Vseross. nauch. konf. (RCDL2013) [Digital Libraries: Advanced
Methods and Technologies, Digital Collections: 15th All-Russian Scientific Conference
RCDL 2013 Proceedings]. Yaroslavl. 207-213.
[+] About this article
Title
MULTICRITERIA METHOD FOR DETECTING NEAR-DUPLICATES IN A STREAM OF TEXT MESSAGES
Journal
Systems and Means of Informatics
Volume 25, Issue 1, pp 34-53
Cover Date
2013-11-30
DOI
10.14357/08696527150103
Print ISSN
0869-6527
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
near-duplicate detection; similarity measure; binary classification
Authors
A. Andreev , D. Berezkin ,
I. Kozlov , and K. Simakov
Author Affiliations
Bauman Moscow State Technical University, 5 Baumanskaya 2nd Str.,
Moscow 105005, Russian Federation
|