Informatics and Applications

2018, Volume 12, Issue 2, pp 75-82

AUTOMATIC METADATA EXTRACTION FROM SCIENTIFIC PDF DOCUMENTS

  • A. V. Ogaltsov
  • O. Y. Bakhteev

Abstract

The authors investigate the task of metadata extraction. The authors consider scientific PDF documents in Russian. One of the features of PDF is a rich layout. It is difficult to extract metadata due to this fact. The authors propose a method based on considering blocks from pdf-parser as objects in a machine learning task. The feature space is constructed not only of text statistics but also of formatting and positioning features of the block.
The authors performed computational experiments and compared their approach with the baseline.

[+] References (13)

[+] About this article