如何使用Python从包含JavaScript的<a>标签中获取href？

Question

如何使用Python从包含JavaScript的<a>标签中获取href？

4

我将尝试使用Python + Selenium从一个标签中获取href，但是href中包含了"JavaScript"，所以我无法获得目标URL。

我正在使用Python 3.7.3和selenium 3.141.0。

HTML:

<a href="javascript:GoPDF('FS1546')" style="TEXT-DECORATION: Underline">Aberdeen Standard Wholesale Australian Fixed Income</a>

代码：

from selenium import webdriver
driver = webdriver.Chrome("chromedriver.exe")
driver.get("http://www.colonialfirststate.com.au/Price_performance/performanceNPrice.aspx?menutabtype=performance&CompanyCode=001&Public=1&MainGroup=IF&BrandName=FC&ProductIDs=91&Product=FirstChoice+Wholesale+Investments&ACCodes=&ACText=&SearchType=Performance&Multi=False&Hedge=False&IvstType=Investment+products&IvstGroup=&APIR=&FundIDs=&FundName=&FundNames=&SearchProdIDs=&Redirect=1")
print(driver.find_elements_by_xpath("tbody/tr[5]/td[1]/a")

我需要的是目标URL，格式为：

https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

但是它给了我：

javascript:GoPDF('FS2311')

- m.gibin

分享你的HTML代码。 - PySaad

2个回答

1

对于做背景工作的被接受的答案致以赞扬。

我建议使用标准库中的urllib.parse工具。URL并不像它们一开始看起来那么简单，而编写urllib的人们是RFC 808 URL标准的专家。

由于您正在进行网络爬虫，因此在后续过程中，您可能需要将相同的过程应用于各种URL，包括具有不同域名、多位数字查询组件（?1234和一整套其他可能性）甚至片段（?1234#example等）。被接受的答案将无法处理所有这些情况。

以下代码乍一看似乎更加复杂，但是将棘手（并且可能脆弱）的URL问题委托给了urllib。它还使用更健壮和灵活的方法提取GoPDF文件ID和URL的不变部分。

from urllib.parse import urlparse, urlunparse


def build_pdf_url(model_url, js_href):
    url = urlparse(model_url)
    pdf_fileid = get_fileid_from_js_href(js_href)
    pdf_path = build_pdf_path(model_url, pdf_fileid)
    return urlunparse((url.scheme, url.netloc, pdf_path, url.params,
                      url.query, url.fragment))


def get_fileid_from_js_href(href):
    """extract fileid by extracting text between single quotes"""
    return href.split("'")[1].lower()


def build_pdf_path(url, pdf_fileid):
    prefix = pdf_fileid[:2]
    major_version = pdf_fileid[2]
    minor_version = pdf_fileid[3]
    filename = pdf_fileid + '.pdf'
    return '/'.join([invariant_path(url), prefix, major_version, minor_version, filename])


def invariant_path(url, dropped_components=4):
    """
    return all but the dropped components of the URL 'path'
    NOTE: path components are separated by '/'
    """
    path_components = urlparse(url).path.split('/')
    return '/'.join(path_components[:-dropped_components])


js_href = "javascript:GoPDF('FS1546')"
model_url = "https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3"
print(build_pdf_url(model_url, js_href))


$ python urlbuild.py
https://www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

- Nic

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CodeIt · Accepted Answer

我检查了弹出窗口中的PDF链接，并发现它们是如何生成URL的。他们使用文件名（例如FS2065）来生成PDF URL。 PDF的URL看起来像这样， https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/0/fs2065.pdf?3 到目前为止，所有PDF都具有相同的路径。

https://www3.colonialfirststate.com.au/content/dam/prospects/

在那部分之后，我们使用文件ID生成了一个路径。

fs/2/0/fs2065.pdf?3
 | | |     |     ||
 | | |     |     ++--- Not needed (But you can keep if you want)
 | | |     |
 | | |     +---- File Name
 | | +---------- 4th character in the file name 
 | +------------ 3rd character in the file name 
 +-------------- First two characters in the file name

我们可以使用这个方法来绕过，以获得确切的URL地址。

url = "javascript:GoPDF('FS2311')" # javascript URL  

pdfFileId = url[18:-2].lower() # extracts the file name from the Javascript URL

pdfBaseUrl = "https://www3.colonialfirststate.com.au/content/dam/prospects/%s/%s/%s/%s.pdf?3"%(pdfFileId[:2],pdfFileId[2],pdfFileId[3],pdfFileId) 

print(pdfBaseUrl)
# prints https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3

在此链接中查看其运行效果。