Informatics and Applications

September 2013, Volume 7, Issue 3, pp 2-13

UNSUPERVISED APPROACH TO WEB WRAPPER MAINTENANCE

  • A.M. Andreev
  • D. V. Berezkin
  • I.A. Kozlov
  • K. V. Simakov

Abstract

HTML-wrapper applications rely on formatting regularities of targeted websites. Therefore, maintenance of such applications is connected with the problem of detecting markup changes of web pages. This article describes the unsupervised approach to this problem. The proposed method of detection consists of two parts: the real-time one based on clustering considering HTML-document as a vector of some features and the time-lagged one based on comparison of distributions of such features for learning and testing sets of HTML-documents. There have been carried out several experiments with data obtained from real wrapper. The results reveal feasibility of the suggested approach.

References (21)

About this article