亚洲国产精品日韩一线满,日韩高清一区

python爬蟲高效爬取某趣閣小說
這次的代碼是根據(jù)我之前的筆趣閣爬取的基礎(chǔ)上修改的，因?yàn)槭褂玫氖亲约旱膇p，所以在請(qǐng)求每個(gè)章節(jié)的時(shí)候需要設(shè)置sleep（4~5）才不會(huì)被封ip，那么在計(jì)算保存的時(shí)間，每個(gè)章節(jié)會(huì)花費(fèi)6-7秒，如果爬取一部較長(zhǎng)的小說時(shí)，時(shí)間會(huì)特別的長(zhǎng)，所以這次我使用了代理ip。這樣就可以不需要設(shè)置睡眠時(shí)間，直接大量訪問。

一，獲取免費(fèi)ip

關(guān)于免費(fèi)ip，我選擇的是站大爺。因?yàn)槊赓M(fèi)ip的壽命很短，所以盡量要使用實(shí)時(shí)的ip，這里我專門使用getip.py來獲取免費(fèi)ip，代碼會(huì)爬取最新的三十個(gè)ip，并以字典的形式返回兩種，如{'http‘：'ip‘}，{'https‘：'ip‘}

！?。。。?！這里是另寫了一個(gè)py文件，后續(xù)正式寫爬蟲的時(shí)候會(huì)調(diào)用。

import requests
from lxml import etree
from time import sleep
def getip():
 base_url = 'https://www.zdaye.com'
 url = 'https://www.zdaye.com/dayProxy.html'
 headers = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
 }
 res = requests.get(url, headers=headers)
 res.encoding = "utf-8"
 dom = etree.HTML(res.text)
 sub_urls = dom.xpath('//h3[@class ="thread_title"]/a/@href')
 sub_pages =[]
 for sub_url in sub_urls:
  for i in range(1, 11):
sub_page = (base_url + sub_url).rstrip('.html') + '/' + str(i) + '.html'
sub_pages.append(sub_page)
 http_list = []
 https_list = []
 for sub in sub_pages[:3]:
  sub_res = requests.get(sub, headers=headers)
  sub_res.encoding = 'utf-8'
  sub_dom = etree.HTML(sub_res.text)
  ips = sub_dom.xpath('//tbody/tr/td[1]/text()')
  ports = sub_dom.xpath('//tbody/tr/td[2]/text()')
  types = sub_dom.xpath('//tbody/tr/td[4]/text()')
  sleep(3)
  sub_res.close()
  for ip,port,type in zip(ips, ports,types):
proxies_http = {}
proxies_https= {}
http = 'http://' + ip + ':' + port
https = 'https://' + ip + ':' + port
#分別存儲(chǔ)http和https兩種
proxies_http['http'] = http
http_list.append(proxies_http)
proxies_https['https'] = https
https_list.append(proxies_https)
 return  http_list,https_list
if __name__ == '__main__':
 http_list,https_list = getip()
 print(http_list)
 print(https_list)

二，具體實(shí)現(xiàn)

完整代碼放在最后后面了，這里的 from getip import getip 就是前面獲取ip部分。
這里我收集數(shù)十個(gè)常用的請(qǐng)求頭，將它們與三十個(gè)IP隨機(jī)組合，共可以得到300個(gè)左右的組合。

這里我定義了三個(gè)函數(shù)用于實(shí)現(xiàn)功能。
biquge_get()函數(shù)：輸入搜索頁(yè)面的url，關(guān)于搜索的實(shí)現(xiàn)是修改url中的kw，在main函數(shù)中有體現(xiàn)。
--------------------------返回書籍首頁(yè)的url和書名。

get_list()函數(shù)：輸入biquge_get返回的url。
---------------------返回每個(gè)章節(jié)的url集合。

info_get()函數(shù)：輸入url，ip池，請(qǐng)求頭集，書名。
---------------------將每次的信息保存到本地。

info_get()函數(shù)中我定義四個(gè)變量a,b,c,d用于判斷每個(gè)章節(jié)是否有信息返回，在代碼中有寫足夠清晰的注釋。
這里我講一下我的思路，在for循環(huán)中，我循環(huán)的是章節(jié)長(zhǎng)度的十倍。a，b，c的初始值都是0。
通過索引，url=li_list[a]可以請(qǐng)求每個(gè)章節(jié)內(nèi)容，a的自增實(shí)現(xiàn)跳到下一個(gè)url。但是在大量的請(qǐng)求中也會(huì)有無法訪問的情況，所以在返回的信息 ' text1 ‘ 為空的情況a-=1，那么在下一次循環(huán)是依舊會(huì)訪問上次沒有結(jié)果的url。

這里我遇到了一個(gè)坑，我在測(cè)試爬取的時(shí)候會(huì)打印a的值用于觀察，出現(xiàn)它一直打印同一個(gè)章節(jié)數(shù)‘340'直到循環(huán)結(jié)束的情況，此時(shí)我以為是無法訪問了。后來我找到網(wǎng)頁(yè)對(duì)照，發(fā)現(xiàn)這個(gè)章節(jié)本來就沒有內(nèi)容，是空的，所以程序會(huì)一直卡在這里。所以我設(shè)置了另外兩個(gè)變量b，c。

1，使用變量b來存放未變化的a，若下次循環(huán)b與a相等，說明此次請(qǐng)求沒有成功，c++，因?yàn)槟承╉?yè)面本身存在錯(cuò)誤沒有數(shù)據(jù)，則需要跳過。
2，若c大于10，說明超過十次的請(qǐng)求,都因?yàn)橐恍┚売墒×?，則a++，跳過這一章節(jié)，同時(shí)變量d減一，避免后續(xù)跳出循環(huán)時(shí)出現(xiàn)索引錯(cuò)誤

最后是變量d，d的初始值設(shè)置為章節(jié)長(zhǎng)度，d = len(li_list)，a增加到與d相同時(shí)說明此時(shí)li_list的所有url都使用完了，那么就需要跳出循環(huán)。
然后就是將取出的數(shù)據(jù)保存了。

最后測(cè)試，一共1676章，初始速度大概一秒能下載兩章內(nèi)容左右。

爬取完成，共計(jì)用了10分鐘左右。

import requests
from lxml import etree
from getip import getip
import random
import time

headers= {
  "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
 }
'''
kw輸入完成搜索,打印所有的搜索結(jié)果
返回選擇的書籍的url
'''
def biquge_get(url):
 book_info = []
 r = requests.get(url =url,
headers = headers,
timeout = 20
)
 r.encoding = r.apparent_encoding
 html = etree.HTML(r.text)
 # 獲取搜索結(jié)果的書名
 bookname = html.xpath('//td[@class = "odd"]/a/text()')
 bookauthor = html.xpath('//td[@class = "odd"]/text()')
 bookurl = html.xpath('//td[@class = "odd"]/a/@href')
 print('搜索結(jié)果如下:\n')
 a = 1
 b = 1
 for i in bookname:
  print(str(a) + ':', i, '\t作者：', bookauthor[int(b - 1)])
  book_info.append([str(a),i,bookurl[a-1]])
  a = a + 1
  b = b + 2
 c = input('請(qǐng)選擇你要下載的小說(輸入對(duì)應(yīng)書籍的編號(hào)):')
 book_name = str(bookname[int(c) - 1])
 print(book_name, '開始檢索章節(jié)')
 url2 = html.xpath('//td[@class = "odd"]/a/@href')[int(c) - 1]
 r.close()
 return url2,book_name

'''
輸入書籍的url，返回每一章節(jié)的url
'''
def get_list(url):
 r = requests.get(url = url,
headers = headers,
timeout = 20)
 r.encoding = r.apparent_encoding
 html = etree.HTML(r.text)
 # 解析章節(jié)
 li_list = html.xpath('//*[@id="list"]/dl//a/@href')[9:]
 return li_list
#請(qǐng)求頭集
user_agent = [
 "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
 "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
 "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
 "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
 "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
 "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
 "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
 "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
 "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"]
'''
參數(shù)：url，ip池，請(qǐng)求頭集，書名
'''
def info_get(li_list,ip_list,headers,book_name):
 print('共計(jì)'+str(len(li_list))+'章')
 '''
 a,用于計(jì)數(shù)，成功請(qǐng)求到html并完成后續(xù)的存寫數(shù)據(jù)才會(huì)繼續(xù)請(qǐng)求下一個(gè)url
 b,在循環(huán)中存放未經(jīng)過信息返回存儲(chǔ)判斷的a，用于與下一次循環(huán)的a作比較，判斷a是否有變化
 c,若超過10次b=a，c會(huì)自增，則說明應(yīng)該跳過此章節(jié)，同時(shí)d減一
 d,章節(jié)長(zhǎng)度
 '''
 a = 0
 b = 0
 c = 0
 d = len(li_list)
 fp = open('./'+str(book_name)+'.txt', 'w', encoding='utf-8')
 #這里循環(huán)了10倍次數(shù)的章節(jié)，防止無法爬取完所有的信息。
 for i in range(10*len(li_list)):
  url = li_list[a]
  #判斷使用http還是https
  if url[4:5] == "s":
proxies = random.choice(ip_list[0])
  else:
proxies = random.choice(ip_list[1])
  try:
r = requests.get(url=url,
 headers={'User-Agent': random.choice(headers)},
 proxies=proxies,
 timeout=5
)
r.encoding = r.apparent_encoding
r_text = r.text
html = etree.HTML(r_text)
try:
 title = html.xpath('/html/body/div/div/div/div/h1/text()')[0]
except:
 title = html.xpath('/html/body/div/div/div/div/h1/text()')
text = html.xpath('//*[@id="content"]/p/text()')
text1 = []
for i in text:
 text1.append(i[2:])
'''
使用變量b來存放未變化的a，若下次循環(huán)b與a相等，說明此次請(qǐng)求沒有成功，c++，因?yàn)槟承╉?yè)面本身存在錯(cuò)誤沒有數(shù)據(jù)，則需要跳過。
若c大于10，說明超過十次的請(qǐng)求,都因?yàn)橐恍┚売墒×?，則a++，跳過這一章節(jié)，同時(shí)變量d減一，避免后續(xù)跳出循環(huán)時(shí)出現(xiàn)索引錯(cuò)誤
'''
if b == a:
 c += 1
if c > 10:
 a += 1
 c = 0
 d -=1
b = a

#a+1，跳到下一個(gè)url，若沒有取出信息則a-1.再次請(qǐng)求,若有數(shù)據(jù)返回則保存
a+=1
if len(text1) ==0:
 a-=1
else:
 fp.write('第'+str(a+1)+'章'+str(title) + ':\n' +'\t'+str(','.join(text1) + '\n\n'))
 print('《'+str(title)+'》','下載成功！')
r.close()
  except EnvironmentError as e:
pass
  # a是作為索引在li_list中取出對(duì)應(yīng)的url，所以最后a的值等于li_list長(zhǎng)度-1，并以此為判斷標(biāo)準(zhǔn)是否跳出循環(huán)。
  if a == d:
break
 fp.close()

if __name__ == '__main__':
 kw = input('請(qǐng)輸入你要搜索的小說:')
 url = f'http://www.b520.cc/modules/article/search.php?searchkey={kw}'
 bookurl,book_name = biquge_get(url)
 li_list = get_list(bookurl)
 ip_list = getip()
 t1 = time.time()
 info_get(li_list,ip_list,user_agent,book_name)
 t2 = time.time()
 print('耗時(shí)'+str((t2-t1)/60)+'min')

到此這篇關(guān)于python爬蟲之爬取筆趣閣小說升級(jí)版的文章就介紹到這了,更多相關(guān)python爬取筆趣閣內(nèi)容請(qǐng)搜索本站以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持本站！

版權(quán)聲明：本站文章來源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有，歡迎引用、轉(zhuǎn)載，請(qǐng)保持原文完整并注明來源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站，禁止在非www.sddonglingsh.com所屬的服務(wù)器上建立鏡像，否則將依法追究法律責(zé)任。本站部分內(nèi)容來源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來，僅供學(xué)習(xí)參考，不代表本站立場(chǎng)，如有內(nèi)容涉嫌侵權(quán)，請(qǐng)聯(lián)系alex-e#qq.com處理。

排名優(yōu)化：網(wǎng)站排名優(yōu)化方法有什么，如何做有效果

老域名：怎樣才算老域名，老域名建站有什么影響

內(nèi)容優(yōu)化：關(guān)鍵字排名要做哪些方面的優(yōu)化，怎樣做

技巧：網(wǎng)站轉(zhuǎn)化率究竟是什么，有什么提升的技巧

一下吧：外貿(mào)站優(yōu)化有哪些基本的做法和注意事項(xiàng)

概要：競(jìng)價(jià)推廣費(fèi)用大概要多少呢，競(jìng)價(jià)推廣好不好

一下吧：SEO中site是什么意思，作用和應(yīng)用是怎樣的

郵箱：付費(fèi)郵箱有哪些優(yōu)勢(shì)，付費(fèi)郵箱挑選要考慮什么

集群是什么意思：集群是什么意思，都有哪些優(yōu)勢(shì)呢