我已经按照许多SO用户的请求,使用beautifulsoup重新编写了完整的代码来获取href和src链接。以下是代码:
import os
from bs4 import BeautifulSoup
from urllib.parse import urlparse
path = urlpars(http://www.example.com/dynamic/search.aspx?searchtype=cat&class_id=2566&city_id=55)
lpath = os.path.dirname(path.path)
html = u"<html class=\"\"><head id=\"pageHead\"><title>\n Beauty Salons | Best Beauty Care & Treatments | Listings @ Phonebook Online\n</title>\n <!--\n <meta http-equiv=\"Cache-Control\" content=\"no-cache, no-store, must-revalidate\" /><meta http-equiv=\"Pragma\" content=\"no-cache\" /><meta http-equiv=\"Expires\" content=\"0\" />\n -->\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><link rel=\"stylesheet\" href=\"../css_responsive/category.css\" type=\"text/css\" media=\"screen\">\n <script async=\"\" src=\"//www.google-analytics.com/analytics.js\"></script><script async=\"\" src=\"//www.google.com/adsense/search/async-ads.js\"></script><script type=\"text/javascript\" src=\"../styles/scripts/jquery-1.9.1.min.js\"></script>\n <link rel=\"shortcut icon\" type=\"image/png\" href=\"/PhoneBook.ico\">\n <!-- #Begin Css Plugin -->\n <link rel=\"stylesheet\" href=\"../css_responsive/fontsss.css\"><link rel=\"stylesheet\" href=\"../css_responsive/bootstrap-3.3.4-dist/css/bootstrap.css\" type=\"text/css\" media=\"screen\"><link rel=\"stylesheet\" href=\"../styles/scripts/fancybox/jquery.fancybox.css\" type=\"text/css\" media=\"screen\"><link rel=\"stylesheet\" href=\"../css_responsive/icon-detail.css\" type=\"text/css\" media=\"screen\">\n <!-- #Finish Css Plugin-->\n <!--<script src=\"http://www.google.com/adsense/search/ads.js\" type=\"text/javascript\"></script> -->\n <script type=\"text/javascript\" charset=\"utf-8\">\n (function (G, o, O, g, L, e) {\n G[g] = G[g] || function () {\n (G[g]['q'] = G[g]['q'] || []).push(\n arguments)\n }, G[g]['t'] = 1 * new Date; L = o.createElement(O), e = o.getElementsByTagName(\n O)[0]; L.async = 1; L.src = '//www.google.com/adsense/search/async-ads.js';\n e.parentNode.insertBefore(L, e)\n })(window, document, 'script', '_googCsa');\n </script>\n <!-- Script For Mobile Base Banner-->\n <script async=\"\" src=\"//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>\n <script>\n (adsbygoogle = window.adsbygoogle || []).push({\n google_ad_client: \"ca-pub-6517686434458516\",\n enable_page_level_ads: true\n });\n </script>\n <!-- Script For Mobile Base Banner END-->\n\n\n <script type=\"text/javascript\">\n function AddClass(Class, Element, HasPriority) {\n if (HasPriority == 0) {\n this.className = 'container ' + Class;\n }\n }\n </script>\n \n<meta name=\"description\" content=\"Best Beauty Salons in Abbottabad for quality beauty care and treatments. \"><meta name=\"keywords\" content=\"beauty salons,beauty care,beauty treatments\"><style type=\"text/css\">.fancybox-margin{margin-right:17px;}</style></head>\n<body style=\"text-shadow: rgba(255, 255, 255, 0.4) 0px 1px 1px; background-color: rgb(240, 240, 240);\">\n<div class=\"wapper\">\n <div class=\"pagecontent search_width c-no-t-margin\">\n <div class=\"cblock ele-margin-t-b-15 m-on-mob-hide\"><a href=\"../../default.aspx\">Home</a> > <a href=\"../../dynamic/categories.aspx\">Search by category</a> > <a href=\"../../dynamic/categories.aspx?class_id=12\">Personal Care</a> > <a href=\"../../dynamic/categories.aspx?class_id=134\">Barbers, Beauty Salons & Spas</a> > Beauty Salons in Abbottabad</div>\n <div class=\"refine\">\n <span>Refine Result</span>\n <span>Show Result With</span>\n <ul>\n <li>\n <input class=\"csortType csortTypeAll \" type=\"checkbox\" value=\"100\" name=\"\" checked=\"checked\" disabled=\"disabled\">\n <span class=\"\">All</span>\n </li>\n <li>\n <input class=\"csortType css-checkbox\" type=\"checkbox\" value=\"1\" name=\"\">\n <i class=\"icon-star-full c-icon-starfull-stroke\"></i>\n <span>Reviews</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"2\" name=\"\">\n <i class=\"icon-price-tag cColor-Red\"></i>\n <span>Deals & Coupons</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"5\" name=\"\">\n <i class=\"icon-bullhorn\"></i>\n <span>Announcements</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"3\" name=\"\">\n <i class=\"icon-location\"></i>\n <span>Map</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"4\" name=\"\">\n <i class=\"icon-film\"></i>\n <span>Video</span>\n </li>\n </ul>\n \n <div class=\"tab\" onclick=\"SlideTogle('Location')\">\n Search by location\n </div>\n \n <ul id=\"Location\" style=\"display: none;\">\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=1\">Karachi</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=2\">Lahore</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=56\">Islamabad</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=79\">Rawalpindi</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=49\">Faisalabad</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=81\">Gujranwala</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=78\">Peshawar</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=82\">Sialkot</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=53\">Sargodha</a></li>\n \n </ul>\n \n <div class=\"tab\" onclick=\"SlideTogle('Category')\">\n Search by category\n </div>\n \n <ul id=\"Category\" style=\"display: none;\">\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2571\">Hairstylists</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2575\">Hair Removal, Wax, Threading Body & Face</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2584\">Manicuring</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2574\">Nail Salons & Services</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2572\">Spas-Beauty, Health And Destination</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2564\">Beauty Institutes</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2569\">Estheticians</a></li>\n \n </ul>\n </div>\n <div id=\"cResultMainControl\">\n <div class=\"result_hldr\" id=\"cResultContainer\">\n <div class=\"h1\"><h1>Beauty Salons in Abbottabad.</h1></div>\n <div class=\"h1 page_desc cfont-12 cNo-Margin ele-pad-r-l-20 m-on-mob-hide\"><p class=\"cNo-Margin margin-t m-ele-top-no-margin \" style=\"line-height:18px;\">Best Beauty Salons in Abbottabad for quality beauty care and treatments, <a href=\"http://www.phonebook.com.pk/dynamic/search.aspx?SearchType=kl&k=bridal+makeup\" title=\"Bridal Makeup\" target=\"_blank\">bridal makeup</a>, <a href=\"http://www.phonebook.com.pk/dynamic/search.aspx?SearchType=kl&k=body+massage\" title=\"Body Massage\" target=\"_blank\">body massage</a>.</p></div>\n <div class=\"cMobileHidden col-md-12 col-xs-12 text-center overflow-visible cheight-25 margin-t\" style=\"background-color: rgb(240, 240, 240);\">\n <script async=\"\" src=\"//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>\n <!-- New Line Link Ad -->\n <ins class=\"adsbygoogle\" style=\"display:inline-block;width:468px;height:15px;background-color: rgb(240, 240, 240);\" data-ad-client=\"ca-pub-6517686434458516\" data-ad-slot=\"4522680219\"></ins>\n <script>\n (adsbygoogle = window.adsbygoogle || []).push({});\n </script>\n </div>\n <div id=\"cAlpNav\" class=\"margin-t-10 cAlpNav m-on-mob-hide\">\n <div class=\"text-center\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55\">all</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=a\">a</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=b\">b</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=c\">c</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=d\">d</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=e\">e</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=f\">f</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=g\">g</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=h\">h</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=i\">i</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=j\">j</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=k\">k</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=l\">l</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=m\">m</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=n\">n</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=o\">o</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=p\">p</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=q\">q</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=r\">r</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=s\">s</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=t\">t</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=u\">u</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=v\">v</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=w\">w</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=x\">x</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=y\">y</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=z\">z</a></div></div>\n <div>\n <div id=\"cListingHldr\" class=\"listing\">\n \n<div class=\"container\">\n <div class=\"comp_info\">\n <h2><a href=\"../../company/51529-Beena-Beauty-Parlour\">Beena's Beauty Parlour</a></h2>\n <!--<img class=\"margin-t\" alt=\"Comapny Rating\" src=\"../../images/Stars>.png\" />-->\n <i class=\"cfont-12 cnoPad left icon-zero-star\"></i>\n \n <span class=\"blue margin-t\">(No Review)</span>\n \n <span class=\"cfontBold margin-t cColor-Black cColor-SilverDark\">\n Main Mansehra Road, Near Radio Pakistan, Abbottabad.\n </span>\n \n <div class=\"inline-block cMobile-Right\">\n <ul class=\"margin-t cMobile-Text-Align-Right\">\n <li>\n <a data-fancybox-type=\"iframe\" href=\"../../dynamic/emailtocustomer.aspx?Request_ID=26207&comp_name=Beena-Beauty-Parlour&isAdvertizer=0\" class=\"other_links fancybox\">Email</a>\n </li>\n <li>\n <a title=\"Call Now\" href=\"tel:+92-992-335556\" class=\"c_circle cMobileShow\"></a>\n </li>\n <li>\n <a class=\"other_links\" href=\"../../company/51529-Beena-Beauty-Parlour\" title=\"Company Detail\">Detail</a>\n </li>\n \n </ul>\n </div>\n </div>\n <div class=\"comp_info contact_info\">\n <strong><a class=\"tel\" href=\"tel:+92-992-335556\">+92-992-335556</a></strong>\n \n </div>\n</div>\n<div class=\"container\">\n <div class=\"comp_info\">\n <h2><a href=\"../../company/86977-Unique-Beauty-Salon\">Unique Beauty Salon</a></h2>\n <!--<img class=\"margin-t\" alt=\"Comapny Rating\" src=\"../../images/Stars>.png\" />-->\n <i class=\"cfont-12 cnoPad left icon-zero-star\"></i>\n \n <span class=\"blue margin-t\">(No Review)</span>\n \n <span class=\"cfontBold margin-t cColor-Black cColor-SilverDark\">\n Palki Wedding Hall, Mandian , Abbottabad.\n </span>\n \n <div class=\"inline-block cMobile-Right\">\n <ul class=\"margin-t cMobile-Text-Align-Right\">\n <li>\n <a data-fancybox-type=\"iframe\" href=\"../../dynamic/emailtocustomer.aspx?Request_ID=61717&comp_name=Unique-Beauty-Salon&isAdvertizer=0\" class=\"other_links fancybox\">Email</a>\n </li>\n <li>\n <a title=\"Call Now\" href=\"tel:+92-313-5856739\" class=\"c_circle cMobileShow\"></a>\n </li>\n <li>\n <a class=\"other_links\" href=\"../../company/86977-Unique-Beauty-Salon\" title=\"Company Detail\">Detail</a>\n </li>\n \n </ul>\n </div>\n </div>\n <div class=\"comp_info contact_info\">\n <strong><a class=\"tel\" href=\"tel:+92-313-5856739\">+92-313-5856739</a></strong>\n \n </div>\n</div></div>\n <div id=\"cRecoredInfo\" class=\"listing dotted\">Displaying listings from 1 to 10 of 10</div>\n <div class=\"text-center m-pad-l-r-10\">\n <div id=\"related-suggestions\" class=\"listing inline-block text-center cPad-b-t-10\"><span class=\"left cfont-14\"><b>Related Searches:</b></span> <div class=\"newsssss left inline\" style=\"font-style: italic;font-weight:bold;\"><a href=\"search.aspx?searchtype=cat&class_id=2584\" class=\"left ele-pad-r-l-20 text-underline cfont-14\">Manicuring</a></div><div class=\"newsssss left inline\" style=\"font-style: italic;font-weight:bold;\"><a href=\"search.aspx?searchtype=cat&class_id=2575\" class=\"left ele-pad-r-l-20 text-underline cfont-14\">Hair Removal, Wax, Threading Body & Face</a></div><div class=\"newsssss left inline\" style=\"font-style: italic;font-weight:bold;\"><a href=\"search.aspx?searchtype=cat&class_id=2571\" class=\"left ele-pad-r-l-20 text-underline cfont-14\">Hairstylists</a></div>\n <div class=\"text-left ele-margin-t-b-15 left inline\"><b>Need help with your search?</b> Browse by:<a class=\"text-left ele-pad-r-l-20 text-underline\" onclick=\"hide_show('#related-locations',this);$('#related-categories').addClass('hide');\" href=\"javascript:void(0)\">other locations <img alt=\"\" class=\"margin-l\" width=\"18\" src=\"../../images/plus.png\"></a><a class=\"text-left ele-pad-r-l-20 text-underline\" onclick=\"hide_show('#related-categories',this);$('#related-locations').addClass('hide');\" href=\"javascript:void(0)\">similar categories <img alt=\"\" class=\"margin-l\" width=\"18\" src=\"../../images/plus.png\"></a></div><ul id=\"related-locations\" class=\"col-xs-12 col-sm-12 sugesstion-box hide\">\n <li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=1\" class=\"left\">Karachi</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=2\" class=\"left\">Lahore</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=56\" class=\"left\">Islamabad</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=79\" class=\"left\">Rawalpindi</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=49\" class=\"left\">Faisalabad</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=81\" class=\"left\">Gujranwala</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=78\" class=\"left\">Peshawar</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=82\" class=\"left\">Sialkot</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=53\" class=\"left\">Sargodha</a></li></ul>\n <ul id=\"related-categories\" class=\"col-xs-12 col-sm-12 sugesstion-box hide\">\n <li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2574\" class=\"left\">Nail Salons & Services</a></li><li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2572\" class=\"left\">Spas-Beauty, Health And Destination</a></li><li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2564\" class=\"left\">Beauty Institutes</a></li><li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2569\" class=\"left\">Estheticians</a></li></ul>\n </div>\n </div>\n <div class=\"text-center\">\n </div>\n </div>\n </div>\n </div>\n </div>\n </div>\n \n<div class=\"container-fluid bg-silver m-on-mob-hide\">\n <div class=\"row cPad-b-t-10\" style=\"border-bottom:1px solid #ECECEC;\">\n \n </div>\n</div>\n<script>\n (function (i, s, o, g, r, a, m) {\n i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {\n (i[r].q = i[r].q || []).push(arguments)\n }, i[r].l = 1 * new Date(); a = s.createElement(o),\n m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)\n })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');\n\n ga('create', 'UA-2028280-1', 'auto');\n ga('send', 'pageview');\n</script>\n<script type=\"text/javascript\" src=\"../css_responsive/script/global_functions.js\"></script>\n<script type=\"text/javascript\" src=\"../styles/scripts/fancybox/jquery.fancybox.js?v=2.1.5\"></script>\n<script type=\"text/javascript\" src=\"../css_responsive/bootstrap-3.3.4-dist/js/bootstrap.js\"></script>\n</body></html>"
soup = BeautifulSoup(html, "lxml")
for allLinks in soup.find_all(href=True):
if allLinks['href'] and not allLinks['href'].startswith("http") and not allLinks['href'].startswith("jav"):
print (allLinks['href'])
for allLinks in soup.find_all(src=True):
if allLinks['src'] and not allLinks['src'].startswith("http") and not allLinks['src'].startswith("jav"):
print (allLinks['src'])
这段代码会在控制台打印所有链接,我可以使用 if-elif-else 来区分 "../../"、"../"、"/" 和 "//" 并将它们成功地转换为绝对路径。但问题是,当我尝试使用 "re.sub" 替换它们时,整个 HTML 再次混乱了。我使用 BS4 而不是正则表达式,但问题仍然存在。由于字符计数的限制,我无法在此处发布输出,但出于知识的考虑,它还会破坏 "" 或任何其他 HTML 标签。请建议我任何一种方法来更改这些链接并将它们放回它们应该在的位置。注意:根据 akashkarothiya's 的建议,代码已经被最小化了。