文章:http://www.qdaily.com/articles/65298.html

列表页:http://www.qdaily.com/

newspaper(Released: Jun 14, 2017)

安装:pip3 install newspaper3k

github地址:https://github.com/codelucas/newspaper

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 传入url,指定 language的效果比较好
from newspaper import Article
In [10]: article = Article(url, language='zh')
In [11]: article.download()
In [12]: article.parse()

In [21]: article.html
Out[21]: '<!DOCTYPE html><html><head> <meta charset="UTF-8">
In [23]: article.title
Out[23]: '大公司头条:拼多多年活跃买家接近阿里;三星发布新款移动处理器,抢占 5G 芯片份额'
In [24]: article.text
Out[24]: '我们每天为你摘取最重要的商业新闻
In [25]: article.publish_date
In [29]: article.authors
Out[29]: []

# 传入文本
from newspaper import fulltext
>>> html = requests.get(...).text
>>> article = fulltext(html)

# 列表页提取,效果不太理想
In [33]: paper = newspaper.build(home_url, language='zh')

In [34]: list(paper.category_urls())
Out[34]: []
In [36]: paper.articles
Out[36]: []
In [37]: paper.article_urls()
Out[37]: []

readability(Released: Jan 13, 2019)

安装:pip install readability-lxml

github: https://github.com/buriy/python-readability

1
2
3
4
5
6
7
8
9
10
# 只能提取内容和标题

In [43]: html = requests.get(url).text
In [46]: doc = Document(html)

In [48]: doc.title()
Out[48]: '大公司头条:拼多多年活跃买家接近阿里;三星发布新款移动处理器,抢占 5G 芯片份额_商业_好奇心日报'

In [50]: doc.summary()
Out[50]: '<html><body><div><div class="detail"> \n\n<h3 nocleanhtml="true">拼多多年活跃买家 接近阿里<br>\n</h3>\n<p>昨夜,<a href="https://investor.pinduoduo.com/static-files/0d0c22e7-188a-4a4f-9692-a1c0c452c474" rel="nofollow">拼多多发布三季度财报</a>,7-9 月营收 142 亿人民币 ,比去年同期增长 89%,也是拼多多今年三个季度最高增速,此前 Q1 和 Q2 的分别为 43.89%和 67%。这个季度拼多多净亏损缩窄至 7.84 亿元,去年同期亏损 22.'

gne(Released: Oct 7, 2021)

安装:pip install gne

github: https://github.com/GeneralNewsExtractor/GeneralNewsExtractor

文档:https://generalnewsextractor.readthedocs.io/zh_CN/latest/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [68]: from gne import GeneralNewsExtractor
In [69]: extractor = GeneralNewsExtractor()

# noise_node_list 去除噪声节点
# with_body_html 保留 html
In [70]: extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])
Out[70]:
{'title': '大公司头条:拼多多年活跃买家接近阿里;三星发布新款移动处理器,抢占 5G 芯片份额',
'author': '',
'publish_time': '2020-11-13 10:36:44',
'content': '龚方毅\n2020-11-13 10:36:44\n我们每天为你摘取最重要的商业新闻。\n拼多多年活跃买 家接近阿里\n昨夜',
'images': ['http://img',]}

# 列表页提取,效果不是很好
>>> from gne import ListPageExtractor
>>> html = '''经过渲染的网页 HTML 代码'''
>>> list_extractor = ListPageExtractor()
>>> result = list_extractor.extract(html, feature='列表中任意元素的 XPath')