urllib.parse 在爬虫开发过程中使用地频率非常高,这里总结了一些常用用法
urlparse(url解析)
1 2 3 4 5 6
| In [16]: from urllib.parse import urlparse
In [17]: u = urlparse(url)
In [18]: u Out[18]: ParseResult(scheme='http', netloc='example.com', path='/cxs', params='', query='name=cxs', fragment='age')
|
补充一下 query
和 params
的区别:
parse_qs(query解析)
1 2 3 4 5 6
| In [19]: from urllib.parse import parse_qs
In [20]: query='a=1&b=2' ...: parse_qs(query) Out[20]: {'a': ['1'], 'b': ['2']}
|
注意一点,加号
会被解码替换成 空格
,比如下面的例子:
1 2 3 4
| In [21]: query='a=1&b=2&c=1+2' ...: parse_qs(query) Out[21]: {'a': ['1'], 'b': ['2'], 'c': ['1 2']}
|
url参数转换成字典
1 2
| query_params = parse_qs(urlparse(res.url).query) query_params_dict = {k: v[0] for k, v in query_params.items()}
|
urlencode(query转换)
1 2 3 4 5 6 7 8 9
| In [22]: from urllib.parse import urlencode
In [23]: query = { ...: 'name': 'cxs', ...: 'age': 18, ...: } ...: urlencode(query) Out[23]: 'name=cxs&age=18'
|
quote & quote_plus(url编码)
1 2 3 4 5 6 7 8 9
| In [25]: from urllib.parse import quote
In [26]: quote('a&b/c') Out[26]: 'a%26b/c'
In [27]: from urllib.parse import quote_plus
In [28]: quote_plus('a&b/c') Out[28]: 'a%26b%2Fc'
|
unquote & unquote_plus(url解码)
1 2 3 4 5 6 7 8 9
| In [29]: from urllib.parse import unquote
In [30]: unquote('a%26b%2Fc9+2') Out[30]: 'a&b/c9+2'
In [31]: from urllib.parse import unquote_plus
In [32]: unquote_plus('a%26b%2Fc9+2') Out[32]: 'a&b/c9 2'
|
urljoin(url拼接)
1 2 3 4 5 6 7
| In [1]: from urllib.parse import urljoin
In [35]: url = "http://www.cxs.com" In [36]: path = "/item/name"
In [37]: urljoin(url, path) Out[37]: 'http://www.cxs.com/item/name'
|
url参数解析成字典
1 2 3 4 5
| parsed_url = urlparse(order_url) query_params = parse_qs(parsed_url.query)
query_params_dict = {k: v[0] for k, v in query_params.items()}
|