Intelligent Web Robot for Content Extraction
DOI:
Author:
Affiliation:

Automation Department, Xiamen University, Xiamen 361005;
College of Mathematics, Sichuan University, Chengdu 610065

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    The main content of a news web page is a source of data for Natural Language Processing (NLP) and Information Retrieval (IR), which contains large quantities of valuable information. This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem. In terms of feature extraction, we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties, such as text length, tag path, tag properties and so on. In consideration that the essence of the problem is the classification model, we use Xgboost to help select nodes. Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages, and achieves about 98% accuracy over 1083 news pages from 10 different new sites, and the average processing time per page is within 10ms.

    Reference
    Related
    Cited by
Get Citation

Wenxing HONG, Jie LI, Weiwei WANG, Yang WENG.[J]. Instrumentation,2019,6(3):52-58

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:
  • Revised:
  • Adopted:
  • Online: October 29,2020
  • Published:
License
  • Copyright (c) 2023 by the authors. This work is licensed under a Creative
  • Creative Commons Attribution-ShareAlike 4.0 International License.