Informatics and Applications
2017, Volume 11, Issue 3, pp 60-72
IMPROVING CLASSIFICATION QUALITY FOR THE TASK OF FINDING INTRINSIC PLAGIARISM
- I. O. Molybog
- A. P. Motrenko
- V. V. Strijov
Abstract
The paper addresses the classification problem in multidimensional spaces. The authors propose a supervised modification of the t-distributed Stochastic Neighbor Embedding Algorithm. Additional features of the proposed modification are that, unlike the original algorithm, it does not require retraining if new data are added to the training set and can be easily parallelized. The novel method was applied to detect intrinsic plagiarism in a collection of documents. The authors also tested the performance of their algorithm using synthetic data and showed that the quality of classification is higher with the algorithm than without or with other algorithms for dimension reduction.
[+] References (25)
- Fefferman, C., S. Mitter, andH. Narayanan. 2016. Testing the manifold hypothesis. J. Am. Math. Soc. 29(4):983-1049.
- Van der Maaten, L., andG. Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov):2579—2605.
- Narayanan, H., and S. Mitter. 2010. Sample complexity of testing the manifold hypothesis. Advances in neural information processing systems. Eds. J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, et al. Curran Associates, Inc. 23:1786—1794.
- Zu Eissen, S. M., and B. Stein. 2006. Intrinsic plagiarism detection. European Conference on Information Retrieval. Springer. 565—569.
- Kuznetsov, M. P., A. P. Motrenko, M. V. Kuznetsova, and V. V. Strijov. 2016. Methods for intrinsic plagiarism detection and author diarization. Working Notes of CLEF. Eds. K. Balog, L. Cappellato, N. Ferro, and C. Macdonald. Evora, Portugal: CEUR-WS. 1609:912-919.
- Stamatatos, E. 2009. Intrinsic plagiarism detection using character n-gram profiles. SEPLN Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse. 38-46.
- Muhr, M., R. Kern, M. Zechner, and M. Granitzer. 2010. External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system. Working Notes for CLEF Conference. Eds. M. Braschler, D. Harman, E. Pianta, and N. Ferro. Padua, Italy: CEUR-WS. Vol. 1176. Available at: http://ceur-ws.org/Vol-1176/CLEF2010wn-PAN-MuhrEt2010.pdf (accessed September 15, 2017).
- Kestemont, M., K. Luyckx, and W. Daelemans. 2011. Intrinsic plagiarism detection using character trigram distance scores. Working Notes for CLEF Conference. Eds. V. Petras, P. Forner, P. Clough, andN. Ferro. Amsterdam, The Netherlands: CEUR-WS. Vol. 1177. Available at: http://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-KestemontEt2011.pdf (accessed September 15, 2017).
- Potthast, M., A. Eiselt, L.A. Cedeoo, B. Stein, and P. Rosso. 2011. Overview of the 3rd international competition on plagiarism detection. Working Notes for CLEF Conference. Eds. V. Petras, P. Forner, P. Clough, and N. Ferro. Amsterdam, The Netherlands: CEUR-WS. Vol. 1177. Available at: http://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-PotthastEt2011a.pdf (accessed September 15, 2017).
- Fodor, I. K. 2002. A survey of dimension reduction techniques. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory. Technical Report. 1-18.
- Brooke, J., and G. Hirst. 2012. Paragraph clustering for intrinsic plagiarism detection using a stylistic vector-space model with extrinsic features. Working Notes for CLEF Conference. Eds. P. Forner, J. Karlgren, C. Womser-Hacker, and N. Ferro. Rome, Italy: CEUR-WS. Vol. 1178. Available at: http://ceur-ws.org/Vol-1178/CLEF2012wn-PAN-BrookeEt2012.pdf (accessed September 15, 2017).
- Brooke, J., A. Hammond, and G. Hirst. 2012. Unsupervised stylistic segmentation of poetry with change curves andextrinsic features. 1stNAACL-HLT Workshop on Computational Linguistics for Literature Proceedings. Stroudsburg, PA: Association for Computational Linguistics. 26-35.
- Gorban, A. N., B. Kegl, D. C. Wunsch, etal. 2008. Principal manifolds for data visualization and dimension reduction. Springer. 58 p.
- Tenenbaum, J. B., V. De Silva, and J. C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319-2323.
- Belkin, M., and P. Niyogi. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems. Eds. T G. Dietterich, S. Becker, and Z. Ghahramani. NIPS Foundation, Inc. 14:585-591.
- Roweis, S.T., and L. K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323-2326.
- Donoho, D. L., and C. Grimes. 2003. Hessian eigen-maps: Locally linear embedding techniques for high-dimensional data. P. Natl. Acad. Sci. USA 100(10):5591-5596.
- Zhang, Z., and H. Zha. 2004. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J. Shanghai University (English Edition) 8(4):406-424.
- Weinberger, K. Q., and L. K. Saul. 2006. Unsupervised learning of image manifolds by semidefinite programming. Int. J. Comput. Vision 70(1):77-90.
- Chen, C., J. Zhang, and R. Fleischer. 2010. Distance approximating dimension reduction of Riemannian manifolds. IEEE T. Syst. Man Cy B 40(1):208-217.
- Van der Maaten, L. 2009. Learning a parametric embedding by preserving local structure. RBM 500:26.
- Van der Maaten, L. 2014. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1):3221—3245.
- Kim, H., H. Park, and H. Zha. 2007. Distance preserving dimension reduction for manifold learning. SIAM Conference (International) on Data Mining Proceedings. 527-532.
- Bottou, L. 2012. Stochastic gradient descent tricks. Neural networks: Tricks ofthe trade. Eds. G. Montavon, G. B. Orr, and K.-R. Muller. Lecture notes in computer science ser. 2nded. Berlin-Heidelberg: Springer. 7700:421-436.
- Potthast, M., B. Stein, A. Barron-Cedeno, and P. Rosso. 2010. An evaluation framework for plagiarism detection. 23rd Conference (International) on Computational Linguistics Posters. 997-1005.
[+] About this article
Title
IMPROVING CLASSIFICATION QUALITY FOR THE TASK OF FINDING INTRINSIC PLAGIARISM
Journal
Informatics and Applications
2017, Volume 11, Issue 3, pp 60-72
Cover Date
2017-09-30
DOI
10.14357/19922264170307
Print ISSN
1992-2264
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
data analysis; dimension reduction; nonlinear dimension reduction; manifold learning; intrinsic plagiarism detection
Authors
I. O. Molybog , , A. P. Motrenko , and V. V. Strijov
Author Affiliations
Center for Energy Systems, Skolkovo Institute of Science and Technology, Skolkovo Innovation Center, 3 Nobel Str., Moscow 143026, Russian Federation
Moscow Institute of Physics and Technology, 9 Institutskiy Per., Dolgoprudny, Moscow Region 141700, Russian Federation
A. A. Dorodnicyn Computing Center, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 40 Vavilov Str., Moscow 119333, Russian Federation
|