python網(wǎng)絡(luò)爬蟲精解之Beautiful Soup的使用說明
一、Beautiful Soup的介紹
Beautiful Soup是一個強(qiáng)大的解析工具,它借助網(wǎng)頁結(jié)構(gòu)和屬性等特性來解析網(wǎng)頁。
它提供一些函數(shù)來處理導(dǎo)航、搜索、修改分析樹等功能,Beautiful Soup不需要考慮文檔的編碼格式。Beautiful Soup在解析時實際上需要依賴解析器,常用的解析器是lxml。
二、Beautiful Soup的使用
test03.html測試實例:
<!DOCTYPE html> <html> <head> <meta content="text/html;charset=utf-8" http-equiv="content-type" /> <meta content="IE=Edge" http-equiv="X-UA-Compatible" /> <meta content="always" name="referrer" /> <link rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css" /> <title>百度一下,你就知道 </title> </head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div id="u1"> <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a> <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a> <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a> <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a> <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a> <a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a> </div> </div> </div> </div> </body> </html>
1、節(jié)點選擇器
我們之前了解到,一個網(wǎng)頁是由若干個元素節(jié)點組成的,通過提取某個節(jié)點的具體內(nèi)容,就可以獲取到界面呈現(xiàn)的一些數(shù)據(jù)。使用節(jié)點選擇器能夠簡化我們獲取數(shù)據(jù)的過程,在不使用正則表達(dá)式的前提下,精準(zhǔn)的獲取數(shù)據(jù)。
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.head) print(soup.head.title) print(soup.a)
【運行結(jié)果】
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
<title>百度一下,你就知道 </title>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
分析:
第一條打印數(shù)據(jù)為獲取網(wǎng)頁的head節(jié)點;
第二條打印內(nèi)容是獲取head節(jié)點中title節(jié)點,獲取該節(jié)點使用了一個嵌套選擇,因為title節(jié)點是嵌套在head節(jié)點里面的;
第三條打印內(nèi)容是獲取a節(jié)點,在源碼中我們看到有許多條a節(jié)點,而只匹配到第一個a節(jié)點就結(jié)束了。當(dāng)有多個節(jié)點時,這種選擇方式指只會選擇第一個匹配的節(jié)點,其他后面節(jié)點會忽略。
2、提取信息
一般我們需要的數(shù)據(jù)位于節(jié)點名、屬性值、文本值中,以下代碼展示了如何獲取這三個地方的數(shù)據(jù):
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.body.name) print(soup.body.a.attrs['class']) print(soup.body.a.attrs['href']) print(soup.body.a.string)
【運行結(jié)果】
body
['mnav']
http://news.baidu.com
新聞
分析:
第一條獲取body節(jié)點名;
第二條獲取a節(jié)點class屬性值;
第三條獲取a節(jié)點href屬性值;
第四條獲取a節(jié)點的文本值;
3、關(guān)聯(lián)選擇
(1)子節(jié)點和子孫節(jié)點
子節(jié)點可以調(diào)用contents屬性和children屬性,子孫節(jié)點可以調(diào)用descendants屬性,他們返回結(jié)果都是生成器類型,通過for循環(huán)輸出匹配到的信息。
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') # print(soup.body.contents) for i,content in enumerate(soup.body.contents): print(i,content)
【運行結(jié)果】
0
1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
</div>
</div>
</div>
</div>
2
(2)父節(jié)點和祖先節(jié)點
獲取某個節(jié)點的父節(jié)點可以調(diào)用parent屬性,例如獲取實例中title節(jié)點的父節(jié)點:
file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.title.parent)
【運行結(jié)果】
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
同理,如果是想要獲取節(jié)點的祖先節(jié)點,則可調(diào)用parents屬性。
(3)兄弟節(jié)點
調(diào)用next_sibling獲取節(jié)點的下一個兄弟元素;
調(diào)用previous_sibling獲取節(jié)點的上一個兄弟元素;
調(diào)用next_siblings取節(jié)點的下一個兄弟節(jié)點;
調(diào)用previous_siblings獲取節(jié)點的上一個兄弟節(jié)點;
4、方法選擇器
find_all()
查找所有符合條件的元素,其使用方法如下:
find_all(name,attrs,recursive,text,**kwargs)
(1)name
根據(jù)節(jié)點名來查詢元素,例如查詢實例中a標(biāo)簽元素:
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.find_all(name = "a")) for a in soup.find_all(name = "a"): print(a)
【運行結(jié)果】
[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
(2)attrs
在查詢時我們還可以傳入標(biāo)簽的屬性,attrs參數(shù)的數(shù)據(jù)類型是字典。
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.find_all(name = "a",attrs = {"class":"bri"}))
【運行結(jié)果】
[<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
可以看到,在加上class=“bri”屬性時,查詢結(jié)果就只剩一條a標(biāo)簽元素。
(3)text
text參數(shù)可以用來匹配節(jié)點的文本,傳入的可以是字符串,也可以是正則表達(dá)式對象。
import re from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.find_all(name = "a",text = re.compile('新聞')))
【運行結(jié)果】
[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>]
只包含文本內(nèi)容為“新聞”的a標(biāo)簽。
find()
find()的使用與前者相似,唯一不同的是,find進(jìn)匹配搜索到的第一個元素,然后返回單個元素,find_all()則是匹配所有符合條件的元素,返回一個列表。
5、CSS選擇器
使用CSS選擇器時,調(diào)用select()方法,傳入相應(yīng)的CSS選擇器;
例如使用CSS選擇器獲取實例中的a標(biāo)簽
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.select('a')) for a in soup.select('a'): print(a)
【運行結(jié)果】
[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
獲取屬性
獲取上述a標(biāo)簽中的href屬性
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') for a in soup.select('a'): print(a['href'])
【運行結(jié)果】
http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/
獲取文本
獲取上述a標(biāo)簽的文本內(nèi)容,使用get_text()方法,或者是string獲取文本內(nèi)容
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') for a in soup.select('a'): print(a.get_text()) print(a.string)
【運行結(jié)果】
新聞
新聞
hao123
hao123
地圖
地圖
視頻
視頻
貼吧
貼吧
更多產(chǎn)品
更多產(chǎn)品
到此這篇關(guān)于python網(wǎng)絡(luò)爬蟲精解之Beautiful Soup的使用說明的文章就介紹到這了,更多相關(guān)python Beautiful Soup 內(nèi)容請搜索本站以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持本站!
版權(quán)聲明:本站文章來源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有,歡迎引用、轉(zhuǎn)載,請保持原文完整并注明來源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站,禁止在非www.sddonglingsh.com所屬的服務(wù)器上建立鏡像,否則將依法追究法律責(zé)任。本站部分內(nèi)容來源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來,僅供學(xué)習(xí)參考,不代表本站立場,如有內(nèi)容涉嫌侵權(quán),請聯(lián)系alex-e#qq.com處理。