Python爬蟲實(shí)戰(zhàn)之批量下載快手平臺(tái)視頻數(shù)據(jù)
知識(shí)點(diǎn)
- requests
- json
- re
- pprint
開發(fā)環(huán)境:
- 版 本:anaconda5.2.0(python3.6.5)
- 編輯器:pycharm
案例實(shí)現(xiàn)步驟:
一. 數(shù)據(jù)來源分析
(只有當(dāng)你找到數(shù)據(jù)來源的時(shí)候, 才能通過代碼實(shí)現(xiàn))
1.確定需求 (要爬取的內(nèi)容是什么?)
- 爬取某個(gè)關(guān)鍵詞對(duì)應(yīng)的視頻 保存mp4
2.通過開發(fā)者工具進(jìn)行抓包分析 分析數(shù)據(jù)從哪里來的(找出真正的數(shù)據(jù)來源)?
- 靜態(tài)加載頁(yè)面
- 筆趣閣為例
- 動(dòng)態(tài)加載頁(yè)面
- 開發(fā)者工具抓數(shù)據(jù)包
【付費(fèi)VIP完整版】只要看了就能學(xué)會(huì)的教程,80集Python基礎(chǔ)入門視頻教學(xué)
點(diǎn)這里即可免費(fèi)在線觀看
二. 代碼實(shí)現(xiàn)過程
- 找到目標(biāo)網(wǎng)址
- 發(fā)送請(qǐng)求 get post
- 解析數(shù)據(jù) (獲取視頻地址 視頻標(biāo)題)
- 發(fā)送請(qǐng)求 請(qǐng)求每個(gè)視頻地址
- 保存視頻
今天的目標(biāo)
三. 單個(gè)視頻
導(dǎo)入所需模塊
import json import requests import re
發(fā)送請(qǐng)求
data = { 'operationName': "visionSearchPhoto", 'query': "query visionSearchPhoto($keyword: String, $pcursor: String, $searchSessionId: String, $page: String, $webPageArea: String) {\n visionSearchPhoto(keyword: $keyword, pcursor: $pcursor, searchSessionId: $searchSessionId, page: $page, webPageArea: $webPageArea) {\n result\n llsid\n webPageArea\n feeds {\ntype\nauthor {\n id\n name\n following\n headerUrl\n headerUrls {\n cdn\n url\n __typename\n }\n __typename\n}\ntags {\n type\n name\n __typename\n}\nphoto {\n id\n duration\n caption\n likeCount\n realLikeCount\n coverUrl\n photoUrl\n liked\n timestamp\n expTag\n coverUrls {\n cdn\n url\n __typename\n }\n photoUrls {\n cdn\n url\n __typename\n }\n animatedCoverUrl\n stereoType\n videoRatio\n __typename\n}\ncanAddComment\ncurrentPcursor\nllsid\nstatus\n__typename\n }\n searchSessionId\n pcursor\n aladdinBanner {\nimgUrl\nlink\n__typename\n }\n __typename\n }\n}\n", 'variables': { 'keyword': '張三', 'pcursor': ' ', 'page': "search", 'searchSessionId': "MTRfMjcwOTMyMTQ2XzE2Mjk5ODcyODQ2NTJf5oWi5pGHXzQzMQ" } response = requests.post('https://www.kuaishou.com/graphql', data=data)
加請(qǐng)求頭
headers = { # Content-Type(內(nèi)容類型)的格式有四種(對(duì)應(yīng)data):分別是 # 爬蟲基礎(chǔ)/xml: 把xml作為一個(gè)文件來傳輸 # multipart/form-data: 用于文件上傳 'content-type': 'application/json', # 用戶身份標(biāo)識(shí) 'Cookie': 'kpf=PC_WEB; kpn=KUAISHOU_VISION; clientid=3; did=web_721a784b472981d650bcb8bbc5e9c9c2', # 瀏覽器信息 (偽裝成瀏覽器發(fā)送請(qǐng)求) 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', }
json序列化操作
# json數(shù)據(jù)交換格式, 在JSON出現(xiàn)之前, 大家一直用XML來傳遞數(shù)據(jù) # 由于各個(gè)語(yǔ)言都支持 JSON ,JSON 又支持各種數(shù)據(jù)類型,所以JSON常用于我們?nèi)粘5?HTTP 交互、數(shù)據(jù)存儲(chǔ)等。 # 將python對(duì)象編碼成Json字符串 data = json.dumps(data) json_data = requests.post('https://www.kuaishou.com/graphql', headers=headers, data=data).json()
字典取值
feeds = json_data['data']['visionSearchPhoto']['feeds'] for feed in feeds: caption = feed['photo']['caption'] photoUrl = feed['photo']['photoUrl'] new_title = re.sub(r'[/\:*?<>/\n] ', '-', caption)
再次發(fā)送請(qǐng)求
resp = requests.get(photoUrl).content
保存數(shù)據(jù)
with open('video\\' + title + '.mp4', mode='wb') as f: f.write(resp) print(title, '爬取成功?。?!')
四. 翻頁(yè)爬取
導(dǎo)入模塊
import concurrent.futures import time
發(fā)送請(qǐng)求
def get_json(url, data): response = requests.post(url, headers=headers, data=data).json() return response
修改標(biāo)題
def change_title(title): # windows系統(tǒng)文件命名 不能含有特殊字符... # windows文件命名 字符串不能超過 256... new_title = re.sub(r'[/\\|:?<>"*\n]', '_', title) if len(new_title) > 50: new_title = new_title[:10] return new_title
數(shù)據(jù)提取
def parse(json_data): data_list = json_data['data']['visionSearchPhoto']['feeds'] info_list = [] for data in data_list: # 提取標(biāo)題 title = data['photo']['caption'] new_title = change_title(title) url_1 = data['photo']['photoUrl'] info_list.append([new_title, url_1]) return info_list
保存數(shù)據(jù)
def save(title, url_1): resp = requests.get(url_1).content with open('video\\' + title + '.mp4', mode='wb') as f: f.write(resp) print(title, '爬取成功?。。?)
主函數(shù) 調(diào)動(dòng)所有的函數(shù)
def run(url, data): """主函數(shù) 調(diào)動(dòng)所有的函數(shù)""" json_data = get_json(url, data) info_list = parse(json_data) for title, url_1 in info_list: save(title, url_1) if __name__ == '__main__': start_time = time.time() with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: for page in range(1, 5): url = 'https://www.kuaishou.com/graphql' data = { 'operationName': "visionSearchPhoto", 'query': "query visionSearchPhoto($keyword: String, $pcursor: String, $searchSessionId: String, $page: String, $webPageArea: String) {\n visionSearchPhoto(keyword: $keyword, pcursor: $pcursor, searchSessionId: $searchSessionId, page: $page, webPageArea: $webPageArea) {\n result\n llsid\n webPageArea\n feeds {\ntype\nauthor {\n id\n name\n following\n headerUrl\n headerUrls {\n cdn\n url\n __typename\n }\n __typename\n}\ntags {\n type\n name\n __typename\n}\nphoto {\n id\n duration\n caption\n likeCount\n realLikeCount\n coverUrl\n photoUrl\n liked\n timestamp\n expTag\n coverUrls {\n cdn\n url\n __typename\n }\n photoUrls {\n cdn\n url\n __typename\n }\n animatedCoverUrl\n stereoType\n videoRatio\n __typename\n}\ncanAddComment\ncurrentPcursor\nllsid\nstatus\n__typename\n }\n searchSessionId\n pcursor\n aladdinBanner {\nimgUrl\nlink\n__typename\n }\n __typename\n }\n}\n", 'variables': { 'keyword': '曹芬', # 'keyword': keyword, 'pcursor': str(page), 'page': "search", 'searchSessionId': "MTRfMjcwOTMyMTQ2XzE2Mjk5ODcyODQ2NTJf5oWi5pGHXzQzMQ" } } data = json.dumps(data) executor.submit(run, url, data, ) print('一共花費(fèi)了:', time.time()-start_time)
耗時(shí)為57.7秒
到此這篇關(guān)于Python爬蟲實(shí)戰(zhàn)之批量下載快手平臺(tái)視頻數(shù)據(jù)的文章就介紹到這了,更多相關(guān)Python 批量下載快手視頻內(nèi)容請(qǐng)搜索本站以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持本站!
版權(quán)聲明:本站文章來源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有,歡迎引用、轉(zhuǎn)載,請(qǐng)保持原文完整并注明來源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站,禁止在非www.sddonglingsh.com所屬的服務(wù)器上建立鏡像,否則將依法追究法律責(zé)任。本站部分內(nèi)容來源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來,僅供學(xué)習(xí)參考,不代表本站立場(chǎng),如有內(nèi)容涉嫌侵權(quán),請(qǐng)聯(lián)系alex-e#qq.com處理。