Intelligent Web Robot for Content Extraction

Home > Archive>Volume 6, Issue 3, 2019 >52-58

Intelligent Web Robot for Content Extraction
DOI:
                        
Author:
                        
Affiliation:Automation Department, Xiamen University, Xiamen 361005;
 College of Mathematics, Sichuan University, Chengdu 610065
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

The main content of a news web page is a source of data for Natural Language Processing (NLP) and Information Retrieval (IR), which contains large quantities of valuable information. This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem. In terms of feature extraction, we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties, such as text length, tag path, tag properties and so on. In consideration that the essence of the problem is the classification model, we use Xgboost to help select nodes. Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages, and achieves about 98% accuracy over 1083 news pages from 10 different new sites, and the average processing time per page is within 10ms.

Reference

Cited by

Get Citation

Wenxing HONG, Jie LI, Weiwei WANG, Yang WENG.[J]. Instrumentation,2019,6(3):52-58

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:
Revised:
Adopted:
Online: October 29,2020
Published:

License

Creative Commons Attribution-ShareAlike 4.0 International License.

Get Citation

Share

Article Metrics

History

License

Contact Us