爬虫工程师-网页内容通用解析方案对比
文章:http://www.qdaily.com/articles/65298.html
列表页:http://www.qdaily.com/
newspaper(Released: Jun 14, 2017)安装:pip3 install newspaper3k
github地址:https://github.com/codelucas/newspaper
123456789101112131415161718192021222324252627282930# 传入url,指定 language的效果比较好from newspaper import ArticleIn [10]: article = Article(url, language='zh')In [11]: article.download()In [12]: article.parse()In [21]: article.htmlOut[21]: '<!DOCTYPE html><html><head> <meta charset="UT ...
Python框架-Scrapy笔记
初始配置1234567891011121314download_delay = 20 # 下载延迟custom_settings = { "HTTPERROR_ALLOWED_CODES": [404], # 允许404 "COOKIES_ENABLED": True, "DOWNLOAD_DELAY": 5, "DOWNLOAD_TIMEOUT": 5, "REFERER_ENABLED": False, # 关闭自动refer "REDIRECT_ENABLED": False, # 禁跳转 "RETRY_HTTP_CODES": [429, 401, 403, 408, 414, 500, 502, 503, 504], # 重试http码 "DEFAULT_REQUEST_HEADERS": { ...... &qu ...
Python编程-关于多线程和多进程
线程池12345678910111213141516171819202122from concurrent.futures import ThreadPoolExecutordef main(url): passPool = ThreadPoolExecutor(max_workers=10)list(Pool.map(main, links))# 传入多个参数def add(x, y): return x + ynums = [(1, 2), (3, 4), (5, 6)]with ThreadPoolExecutor(max_workers=3) as executor: results = executor.map(add, *zip(*nums))# 传参为空import threadingfor i in range(10): threading.Thread(target=main).start()
进程池1234567891011121314import multiprocessing as mppool = mp.Pool(processes=10) ...
FastAPI和aiohttp
FastAPI基本使用12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455from fastapi import FastAPI, Form, Fileimport uvicornapp = FastAPI()app = FastAPI(docs_url=None, redoc_url=None) # 关闭帮助文档# 根路径@app.get("/")async def root(): return "Hello World# 子路径@app.get("/items/{item_id}")async def read_item(item_id: int): return fake_items_db[item_id]# 设置查询参数,GET /items?skip=1&limit=2@app.get("/items")async def re ...
web逆向-谷歌翻译旧接口
把谷歌的接口逆向重新整理一下,方便需要的时候可以用到
有一个第三方的依赖 googletrans,但是已经两年没更新,用不了了,所以还是自己逆向比较实在
首先是抓包
通过多次对比,发现只需确认 f.sid _reqid 两个参数,其他都是固定的
那就来看看这两个是怎么来的吧,无痕模式和fiddler开起来
首先是 f.sid 就在第一次访问 https://translate.google.cn/ 时候返回的html里面
f.sid 只需要一次请求就行,后面都是不会变动的,生成代码如下:
12345def get_sid(): url = "https://translate.google.cn/" resp = requests.get(url) sid = re.search(r'"FdrFJe":"(.+?)",', resp.text).group(1) return sid
reqid就不一样了,每一次请求都是不一样的,这就需要逆向分析它的生成逻辑
先打断点调 ...
东方财富的一些脚本整理
689a0eb7533bd43fcc8e56a8f7ebf6e9a433ea36c39be806eaecf42523f84c1107955146cc7b86176331f7fab3322cbb475598a9c67f08e1c601f3daafcbc58159b3f442b77f810a7b85277e885858a61ad821a2ce1e4060e474929f0fce7efe2871b8782cb1ef07ad3553391568fb99538b38991169f93edf1b79db1546b10ffae53f7250c9f64c5d8599c3d8c78c45b8bd1c7634c4dcb436728dc1e601be81ed4cb71d360e1a5e7d73fdb6ab856a0a65634e42c08cbf8afa43db2367924c4eb6bc6a098141e398fd7ce8329cac3b20933ac29c8f00d4dfc42e9e15db89607a345878b340301293638730194d95ecaf6e633ab00723636fc ...
web逆向-公某部加速乐反爬
https://www.mps.gov.cn/n2254098/n4904352/index.html
直接访问的话,返回的是一段js代码,主要是对cookie进行赋值
直接复制出来,到控制台执行
执行的结果先不管,我们这次逆向的目标,主要是cookies中两个值的生成
其中,__jsluid_s 是第一次请求的时候服务器返回的
那 __jsl_clearance_s 会不会是第一次返回的那段js执行后的结果呢?对比一下:
12345# 执行结果__jsl_clearance_s=1654493484.607|-1|eU3CZr0IIIZOWUJfz3vNntOeewo%3D# cookies__jsl_clearance_s=1654493484.979|0|70p4p0lk5rF59kkzj6oslUuf8Qk%3D
看来不是,如果那么简单我也不会写这篇博客了 😬
还是看一下抓包,第二次请求的时候也返回了一段js
注意到开头是一段大数组,很明显的ob混淆的特诊,先解一下混淆:
复制结果放到浏览器执行,跳到代码最后:
可以确定,就是最后的go函数,对cookie进行 ...
激活和破解
0701657335e84f0c327f6bba26337f2ad6ca4ba8e3335a3698f9be1e4802627d713488decb93c54b9a8407de3c232f27db93ddb34ec2c54ab8d7a69d5629f9f4758697157de5149b7d4bd85502ba6c7c3c809a9e95d8fc328e966329c8c9eb8b0aad6786cc9aaec37a84bc6f6c3d739c896926604f770379b0ad30b1959d1838937e0cab467c55ceac9dae10de076212bb6813ab4d9324a225541a9359c82ef21ed9ad0aab7687f4a8ef6209d94275b8f6fd1d745eef7c057614aff0ae8085d176fc395a451069deabe2fb8aa7c3070093c6c6745a95daba56609317974a79a26288a213a6288f75d1269f5bc8c21ff53ea92fba497346e0b ...
web逆向-cloudflare 5秒盾绕过研究
cloudscraper(Released: Mar 15, 2022)PyPi文档
12345import cloudscraperurl = "https://www.curseforge.com/"scraper = cloudscraper.create_scraper(interpreter="nodejs")print(scraper.get(url).text)
cloudflare-scrape(23 Feb 2020)版本太老,弃用 https://github.com/Anorov/cloudflare-scrape
Python编程-魔术方法汇总
iter()用来生成迭代器,搭配 next() 使用
1234567891011121314def dtime_iterator(): dtime_lst = [1, 2, 3] return iter(dtime_lst)it = dtime_iterator()In [2]: next(it)Out[2]: 1In [3]: next(it)Out[3]: 2In [4]: next(it)Out[4]: 3
__call__
调用 function abc(x1, x2 ...) 其实就相当于 type(abc).\__call__(abc, x1, x2 ...)
123456class Person: def __call__(self, other): return f'Hi {other}'In [13]: Person()("cxs")Out[13]: 'Hi cxs'
_str_ 和 _repr_1234567891011121314class C ...