urllib.parse 在爬虫开发过程中使用地频率非常高,这里总结了一些常用用法

urlparse(url解析)

1
2
3
4
5
6
In [16]: from urllib.parse import urlparse

In [17]: u = urlparse(url)

In [18]: u
Out[18]: ParseResult(scheme='http', netloc='example.com', path='/cxs', params='', query='name=cxs', fragment='age')

补充一下 queryparams 的区别:

  • query方式生成的url为 /xx?id=id,params方式生成的url为 xx/id

  • 当path不为空时只能使用query方式

parse_qs(query解析)

1
2
3
4
5
6
In [19]: from urllib.parse import parse_qs

In [20]: query='a=1&b=2'
...: parse_qs(query)

Out[20]: {'a': ['1'], 'b': ['2']}

注意一点,加号 会被解码替换成 空格,比如下面的例子:

1
2
3
4
In [21]: query='a=1&b=2&c=1+2'
...: parse_qs(query)

Out[21]: {'a': ['1'], 'b': ['2'], 'c': ['1 2']}

url参数转换成字典

1
2
query_params = parse_qs(urlparse(res.url).query)
query_params_dict = {k: v[0] for k, v in query_params.items()}

urlencode(query转换)

1
2
3
4
5
6
7
8
9
In [22]: from urllib.parse import urlencode

In [23]: query = {
...: 'name': 'cxs',
...: 'age': 18,
...: }
...: urlencode(query)

Out[23]: 'name=cxs&age=18'

quote & quote_plus(url编码)

1
2
3
4
5
6
7
8
9
In [25]: from urllib.parse import quote

In [26]: quote('a&b/c') # 未编码斜杠
Out[26]: 'a%26b/c'

In [27]: from urllib.parse import quote_plus

In [28]: quote_plus('a&b/c') # 编码了斜杠
Out[28]: 'a%26b%2Fc'

unquote & unquote_plus(url解码)

1
2
3
4
5
6
7
8
9
In [29]: from urllib.parse import unquote

In [30]: unquote('a%26b%2Fc9+2') # 不解码加号
Out[30]: 'a&b/c9+2'

In [31]: from urllib.parse import unquote_plus

In [32]: unquote_plus('a%26b%2Fc9+2') # 把加号解码为空格
Out[32]: 'a&b/c9 2'

urljoin(url拼接)

1
2
3
4
5
6
7
In [1]: from urllib.parse import urljoin

In [35]: url = "http://www.cxs.com"
In [36]: path = "/item/name"

In [37]: urljoin(url, path)
Out[37]: 'http://www.cxs.com/item/name'

url参数解析成字典

1
2
3
4
5
# 解析URL中的请求参数
parsed_url = urlparse(order_url)
query_params = parse_qs(parsed_url.query)
# 将请求参数转换为字典
query_params_dict = {k: v[0] for k, v in query_params.items()}