我写了一个爬虫,从问答网站上获取信息。由于并非所有字段都始终在页面上显示,因此我使用了多个 try-except 语句来处理这种情况。
def answerContentExtractor( loginSession, questionLinkQueue , answerContentList) :
while True:
URL = questionLinkQueue.get()
try:
response = loginSession.get(URL,timeout = MAX_WAIT_TIME)
raw_data = response.text
#These fields must exist, or something went wrong...
questionId = re.findall(REGEX,raw_data)[0]
answerId = re.findall(REGEX,raw_data)[0]
title = re.findall(REGEX,raw_data)[0]
except requests.exceptions.Timeout ,IndexError:
print >> sys.stderr, URL + " extraction error..."
questionLinkQueue.task_done()
continue
try:
questionInfo = re.findall(REGEX,raw_data)[0]
except IndexError:
questionInfo = ""
try:
answerContent = re.findall(REGEX,raw_data)[0]
except IndexError:
answerContent = ""
result = {
'questionId' : questionId,
'answerId' : answerId,
'title' : title,
'questionInfo' : questionInfo,
'answerContent': answerContent
}
answerContentList.append(result)
questionLinkQueue.task_done()
有时候,这段代码在运行时可能会抛出以下异常:
UnboundLocalError: local variable 'IndexError' referenced before assignment
行号指出第二个except IndexError:
发生错误的位置。
感谢大家的建议,很想给予你们应得的评分,但很遗憾我只能将一个标记为正确答案...
as
关键字来捕获异常。 - Karl Knechtel