爬虫工程师-网页内容通用解析方案对比

文章：http://www.qdaily.com/articles/65298.html

列表页：http://www.qdaily.com/

newspaper（Released: Jun 14, 2017）

安装：pip3 install newspaper3k

github地址：https://github.com/codelucas/newspaper

# 传入url，指定 language的效果比较好
from newspaper import Article
In [10]: article = Article(url, language='zh')
In [11]: article.download()
In [12]: article.parse()

In [21]: article.html
Out[21]: '<!DOCTYPE html><html><head> <meta charset="UTF-8">
In [23]: article.title
Out[23]: '大公司头条：拼多多年活跃买家接近阿里；三星发布新款移动处理器，抢占 5G 芯片份额'
In [24]: article.text
Out[24]: '我们每天为你摘取最重要的商业新闻
In [25]: article.publish_date
In [29]: article.authors
Out[29]: []

# 传入文本
from newspaper import fulltext
>>> html = requests.get(...).text
>>> article = fulltext(html)

# 列表页提取，效果不太理想
In [33]: paper = newspaper.build(home_url, language='zh')

In [34]: list(paper.category_urls())
Out[34]: []
In [36]: paper.articles
Out[36]: []
In [37]: paper.article_urls()
Out[37]: []

readability（Released: Jan 13, 2019）

安装：pip install readability-lxml

github: https://github.com/buriy/python-readability

# 只能提取内容和标题

In [43]: html = requests.get(url).text
In [46]: doc = Document(html)

In [48]: doc.title()
Out[48]: '大公司头条：拼多多年活跃买家接近阿里；三星发布新款移动处理器，抢占 5G 芯片份额_商业_好奇心日报'

In [50]: doc.summary()
Out[50]: '<html><body><div><div class="detail">  \n\n<h3 nocleanhtml="true">拼多多年活跃买家 接近阿里<br>\n</h3>\n<p>昨夜，<a href="https://investor.pinduoduo.com/static-files/0d0c22e7-188a-4a4f-9692-a1c0c452c474" rel="nofollow">拼多多发布三季度财报</a>，7-9 月营收 142 亿人民币 ，比去年同期增长 89%，也是拼多多今年三个季度最高增速，此前 Q1 和 Q2 的分别为 43.89%和 67%。这个季度拼多多净亏损缩窄至 7.84 亿元，去年同期亏损 22.'

gne（Released: Oct 7, 2021）

安装：pip install gne

github: https://github.com/GeneralNewsExtractor/GeneralNewsExtractor

文档：https://generalnewsextractor.readthedocs.io/zh_CN/latest/

In [68]: from gne import GeneralNewsExtractor
In [69]: extractor = GeneralNewsExtractor()

# noise_node_list 去除噪声节点
# with_body_html 保留 html
In [70]: extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])
Out[70]:
{'title': '大公司头条:拼多多年活跃买家接近阿里;三星发布新款移动处理器,抢占 5G 芯片份额',
 'author': '',
 'publish_time': '2020-11-13 10:36:44',
 'content': '龚方毅\n2020-11-13 10:36:44\n我们每天为你摘取最重要的商业新闻。\n拼多多年活跃买 家接近阿里\n昨夜',
  'images': ['http://img',]}

# 列表页提取，效果不是很好
>>> from gne import ListPageExtractor
>>> html = '''经过渲染的网页 HTML 代码'''
>>> list_extractor = ListPageExtractor()
>>> result = list_extractor.extract(html, feature='列表中任意元素的 XPath')