本文共 1158 字,大约阅读时间需要 3 分钟。
我正在做一些网页抓取,但我想出了一些我想不出来的东西。基本上,我需要检查ResultSet元素releaseDate的第0个元素是否包含[]
但是当“content”不在标记中时,我会得到一个错误,比如Traceback (most recent call last):
File "", line 1, in
File "imdbQuestion.py", line 18, in
if releaseDate[0]['content']:
File "build/bdist.macosx-10.8-intel/egg/bs4/element.py", line 879, in __getitem__
KeyError: 'content'
如何检查'content'是否在releaseDate中而不会导致错误?
另外,如何从ResultSet对象中提取所需的内容?
完整代码是:import requests
from bs4 import BeautifulSoup
file = codecs.open('imdb.txt', 'w', encoding = 'utf-8')
#iterate through last value
for increment in range(7,10):
imdbNum = '015008' + str(increment)
url = 'http://www.imdb.com/title/tt' + imdbNum
urlCode = requests.get(url)
soup = BeautifulSoup(urlCode.content)
#get release date
releaseDate = soup.findAll(attrs={'itemprop':'datePublished'})
abc = releaseDate
#error checking - assign '.' to releaseDate if releaseDate[0] is blank
#if not blank, check if 'content' is in releaseDate[0]. if so, we are good. if not, assign 'CHECK' to releaseDate[0]
if releaseDate:
if releaseDate[0]['content']:
releaseDate = releaseDate[0]['content']
else:
releaseDate = 'CHECK'
else:
releaseDate = '.'
print releaseDate
file.close()
转载地址:http://cwsxl.baihongyu.com/