Informatics and Applications
2018, Volume 12, Issue 2, pp 75-82
AUTOMATIC METADATA EXTRACTION FROM SCIENTIFIC PDF DOCUMENTS
- A. V. Ogaltsov
- O. Y. Bakhteev
Abstract
The authors investigate the task of metadata extraction. The authors consider scientific PDF documents in Russian. One of the features of PDF is a rich layout. It is difficult to extract metadata due to this fact. The authors propose a method based on considering blocks from pdf-parser as objects in a machine learning task. The feature space is constructed not only of text statistics but also of formatting and positioning features of the block.
The authors performed computational experiments and compared their approach with the baseline.
[+] References (13)
- Bergmark, D. 2000. Automatic extraction of reference linking information from online documents. Ithaca, NY: Cornell University. 20 p.
- Klink, S., A. Dengel, and T. Kieninger. 2010. Document structure analysis based on layout and textual features. Workshop (International) on Document Analysis Systems Proceedings. Boston, MA. 99-111.
- Mao, S., J. W. Kim, and G. R. Thoma. 2004. A dynamic feature generation system for automated metadata extraction in preservation of digital materials. 1st Workshop (International) on Document Image Analysis for Libraries Proceedings. Palo Alto, CA: IEEE Computer Society. 225-232.
- Han, H., C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. 2003. Automatic document metadata extraction using support vector machines. 3rd ACM/IEEECS Joint Conference on Digital Libraries Proceedings. 3748.
- Kovacevic, A., D. Ivanovic, B. Milosavljevic, Z. Konjovic, and D. Surla. 2011. Automatic extraction of metadata from scientific publications for CRIS systems. Program 45(4):376-396.
- Seymore, K., A. McCallum, and R. Rosenfeld. 1999. Learning hidden Markov model structure for information
extraction. AAAI 99 Workshop on Machine Learning for Information Extraction Proceedings. 37-42.
- Councill, I., C. L. Giles, and M.-Y. Kan. 2008. ParsCit: An open-source CRF reference string parsing package. 6th Conference (International) on Language Resources and Evaluation Proceedings. Marrakech, Morocco. 661-667.
- Rangoni, Y., and A. Belaid. 2006. Document logicalstruc- ture analysis based on perceptive cycles. IAPR Workshop on Document Analysis Systems. 117-128.
- Rangoni, Y., A. Bela'id, and S. Vajda. 2012. Labelling logical structures of document images using a dynamic perceptive neural network Int. J. Doc. Anal. Recog. 15(1):45-55.
- Tao, X., Z. Tang, and C. Xu. 2013. Document page struc-ture learning for fixed-layout e-books using conditional random fields. Document Recognition and Retrieval XXI: Proc. SPIE 9021:1-9.
- Apache PDFBox. Available at: http://pdfbox.apache. org/ (accessed June 19, 2018).
- Van der Maaten, L. J. P., and G. E. Hinton. 2008. Visual-izing high-dimensional data using t- SNE. J. Mach. Learn. Res. 9:2579-2605.
- Breiman, L. 2001. Random forests. Mach. Learn. 45:5-32.
[+] About this article
Title
AUTOMATIC METADATA EXTRACTION FROM SCIENTIFIC PDF DOCUMENTS
Journal
Informatics and Applications
2018, Volume 12, Issue 2, pp 75-82
Cover Date
2018-05-30
DOI
10.14357/19922264180211
Print ISSN
1992-2264
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
metadata extraction; natural language processing; layout features; information retrieval; metadescriptions
Authors
A. V. Ogaltsov , and O. Y. Bakhteev ,
Author Affiliations
National Research University Higher School of Economics, 20 Myasnitskaya Str., Moscow 101000, Russian Federation
Antiplagiat JSC, 33 Varshavskoe Shosse, Moscow 117105, Russian Federation
Moscow Institute of Physics and Technology, 9 Institutskiy Per., Dolgoprudny, Moscow Region 141700, Russian Federation
|