Informatics and ApplicationsSeptember 2013, Volume 7, Issue 3, pp 2-13UNSUPERVISED APPROACH TO WEB WRAPPER MAINTENANCE
AbstractHTML-wrapper applications rely on formatting regularities of targeted websites. Therefore, maintenance of such applications is connected with the problem of detecting markup changes of web pages. This article describes the unsupervised approach to this problem. The proposed method of detection consists of two parts: the real-time one based on clustering considering HTML-document as a vector of some features and the time-lagged one based on comparison of distributions of such features for learning and testing sets of HTML-documents. There have been carried out several experiments with data obtained from real wrapper. The results reveal feasibility of the suggested approach.References (21) About this article |