如何使用BeautifulSoup解析<script>标签

3

我正在尝试从一个Glassdoor评论网站读取window.appCache

url = "https://www.glassdoor.com/Reviews/Alteryx-Reviews-E351220.htm"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(html.content,'html.parser') 
text = soup.findAll("script")[0].text

这将隔离我需要的字典,但是当我尝试执行json.loads()时,我会收到以下错误:

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) 

我检查了text的类型,它是str

当我将text打印到文件中时,它看起来像这样(只是输出的一���部分,因为输出大约有5000行):

window.appCache={"appName":"reviews","appVersion":"7.14.12","initialState"
{"surveyEndpoint":"https:\u002F\u002Femployee-pulse-survey-b2c.us-east-1.prod.jagundi.com",
"i18nStrings":{"_":"JSON MESSAGE BUNDLE - do not remove",
"eiHeader.seeAllPhotos":"
See All Photos","eiHeader.viewJobs":"View Jobs",
"eiHeader.bptw.description":"This employer is a winner of the [year] Best Places to Work award. 
Winners were determined by the people who know these companies best...

我只关心数据中藏在一半处的 "reviews":[ 字段,但我似乎无法将该字符串解析为 JSON 并检索出所需内容。


你很可能需要做类似于 json.loads(text.split('=', 1)[1].strip()) 这样的操作,但是可能需要更多的字符串操作来隔离你需要的部分。 - dskrypa
有时,从脚本标签中提取某些内容的最简单方法(一旦您拥有原始文本)就是使用正则表达式 - dskrypa
3个回答

2
bs4只解析HTML,不解析JavaScript(也不解析CSS);正如一些评论所提到的,一个常见的方法是在=处分割text,然后使用json.loads解析window.appCache,但在这种情况下,会仍然引发JSONDecodeError错误, 因为window.appCache包含js函数和js原始值(例如undefined)。
我有一个函数findObj_inJS,它使用slimit来解析包含JavaScript代码的字符串并从中提取对象/变量。例如,findObj_inJS(text, '"reviews"')将返回:
{'name': 'Native_infosite_reviews_fluid_en-US', 'id': 'div-AdSlot-native-infosite-reviews', 'fluid': True}

"findObj_inJS(text, '"reviews"', findAll=True)"将会返回
[
 {'name': 'Native_infosite_reviews_fluid_en-US', 'id': 'div-AdSlot-native-infosite-reviews', 'fluid': True},
 [
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 72183587, 'reviewDateTime': '2022-12-29T01:05:43.253', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 6, 'employmentStatus': None, 'jobEndingYear': None, 'jobTitle': None, 'location': None, 'originalLanguageId': None, 'pros': '-Great managers/leaders -Great benefits -Remote work', 'prosOriginal': None, 'cons': 'Typical like other companies where newbies get higher salary and you have to work your way up for promotions nothing really bad', 'consOriginal': None, 'summary': 'Great place to work, been here 4+ years', 'summaryOriginal': None, 'advice': "Don't rush too finish a project", 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71986771, 'reviewDateTime': '2022-12-19T12:56:24.837', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 0, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:46094'}, 'location': None, 'originalLanguageId': None, 'pros': 'Great people, great culture, and exciting times ahead', 'prosOriginal': None, 'cons': 'Nothing to complain about for internal issues', 'consOriginal': None, 'summary': 'Best culture ever!', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71858088, 'reviewDateTime': '2022-12-14T08:39:44.030', 'ratingOverall': 4, 'ratingCeo': None, 'ratingBusinessOutlook': None, 'ratingWorkLifeBalance': 0, 'ratingCultureAndValues': 0, 'ratingDiversityAndInclusion': 0, 'ratingSeniorLeadership': 0, 'ratingRecommendToFriend': None, 'ratingCareerOpportunities': 0, 'ratingCompensationAndBenefits': 0, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 0, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': 'Alteryx has a good comp plan and if you’re a high performer they let you maximize your earnings. The product is amazing and we have a very fanatic customer base that love what the platform does.', 'prosOriginal': None, 'cons': 'A boys club in sales leadership. We have a female CRO and the diversity ends there. 90% of sales leaders are men and are buddies of current leaders that are brought over from their past jobs. There are leaders who have HR complaints against them but still hold jobs because they’re friends with SVP. The only female segment leader isn’t even given the same title as her male peers for holding the same job, she is an RVP while her 3 male counterparts are VPs. The sexism in leadership and in the sales org is pretty blatant and has not improved. They hired a DEI leader who does not seem to want to investigate the issues in sales even though they’ve been raised by lots of reps.', 'consOriginal': None, 'summary': 'Amazing product, great benefits, sexist sales culture', 'summaryOriginal': None, 'advice': 'You need to listen to individual contributors and lower level folks, and not just rely on your SVPs and VPs to get a pulse on the org. Younger reps care about diversity and inclusion and real equity, not just lip service and you will struggle to get any talent under 50 years old (like you have for years) to join the company since they will prefer organizations with better policies like salesforce. The fact that we cannot recruit female sellers because our maternity policy is still not on par with our tech peers should be concerning but nobody seems to discuss that other than 1st line manager who end up giving up and settling for having 1 woman per team. At some point you will fall so behind you won’t be able to catch up with the industry and become a company with a modern culture.', 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 6, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414519'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 72218335, 'reviewDateTime': '2022-12-30T17:50:21.263', 'ratingOverall': 2, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'NEGATIVE', 'ratingWorkLifeBalance': 2, 'ratingCultureAndValues': 3, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 4, 'ratingRecommendToFriend': 'NEGATIVE', 'ratingCareerOpportunities': 2, 'ratingCompensationAndBenefits': 4, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 2, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:34553'}, 'location': None, 'originalLanguageId': None, 'pros': 'Genuinely nice people are working at Alteryx. Great vision and hands-on c-suite leaders.', 'prosOriginal': None, 'cons': "Not many nice people aren't highly-skilled people. Many of PMs did not have prior PM experience from a tech company. Middle-managers are inexperienced except a few superstar PM Directors. Coming from tech industry with many years of experience, Alteryx is an extremely frustrating workplace. The best people from the tech industry who joined the company is leaving quickly because of that. The recent acquisitions made many of us in the states to work in early morning and evening because they came with off-shore offices, and WLB declined significantly this year.", 'consOriginal': None, 'summary': 'Sales driven senior leaders with below average Product and Engineering', 'summaryOriginal': None, 'advice': 'Please hire silicon valley top talents for the middle-manager roles instead of keep hiring their former colleagues from some mediocre companies. Otherwise, you will keep losing more talented professionals. Let our CPO lead the product engineering innovation. Too many teams outside of PE have so much to say and influence.', 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71891292, 'reviewDateTime': '2022-12-15T09:29:44.833', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 3, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 2, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': 'Supportive Executives Supporting teams Great compensation', 'prosOriginal': None, 'cons': 'Mid-level management are overkill Enterprise Team is confused on their objective', 'consOriginal': None, 'summary': 'Great place to work', 'summaryOriginal': None, 'advice': 'Keep it simple for sales', 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71907512, 'reviewDateTime': '2022-12-15T22:28:47.670', 'ratingOverall': 3, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 3, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 4, 'ratingSeniorLeadership': 3, 'ratingRecommendToFriend': 'NEGATIVE', 'ratingCareerOpportunities': 3, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 4, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:29284'}, 'location': {'__ref': 'City:1148161'}, 'originalLanguageId': None, 'pros': '- Great culture - Diverse team - Good base salary compared to companies in the same field - Opportunities for networking - Competitive benefits package - Product adoption has been increasing over the years - Teammates always willing to help - Opportunity to learn from the best in the field', 'prosOriginal': None, 'cons': '- Lack of transparency in the workplace - Poor employee promotion/retention plan - Meritocracy is not used for promotions - Some professionals are underappreciated and undervalued, while low performers are highly recognized - Difficulties in finding meaning in the work', 'consOriginal': None, 'summary': 'Good company, but poor leadership', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 2, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71631140, 'reviewDateTime': '2022-12-05T11:39:02.247', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 1, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': "Alteryx is the most employee focused organization that I've worked for. In addition to excellent compensation and benefits, examples of how Alteryx goes above and beyond include: providing time to focus on mental health (two days off a year), encouraging employees to take time off to volunteer through Alteryx for Good, and providing opportunities for employees to grow their career through an emerging leaders program. The culture of the organization, from the leadership team on down, is very transparent, passionate, and positive which is why I think the company continues to hire and maintain some of the best and brightest in the tech space. If you have a chance to work at Alteryx, I would encourage you to make the move!", 'prosOriginal': None, 'cons': "None - I'm looking forward to 2023!", 'consOriginal': None, 'summary': 'The Best of the Best!', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414520'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71567507, 'reviewDateTime': '2022-12-02T08:53:58.140', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 1, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': 'Compensation is in line with top tech firms - I came from a fortune 100 technology company and got a raise and additional equity here. The tools provided to you are world class, and they consistently invest in helping their people be more productive. Leadership is incredible, best I have ever seen across my 12 year career in sales. Marketing and sales talk to each other, so events are clearly communicated and the sales team has input on creating events sponsored by marketing/getting marketing dollars to cover events that the company should have exposure at. On-premises product is outstanding.', 'prosOriginal': None, 'cons': 'Account penetration in certain segments can be difficult because we are unknown to them. Once your foot is in the door, the ability to solve business problems and integrate into the current technology stack is unmatched. Cloud platform is still developing, but currently does not have the capabilities to be an option for enterprise level companies without the on-premises technology supporting it. Still 12-24 months away.', 'consOriginal': None, 'summary': 'They get it', 'summaryOriginal': None, 'advice': "Keep doing what you're doing.", 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414562'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71924396, 'reviewDateTime': '2022-12-16T12:34:18.837', 'ratingOverall': 4, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 4, 'ratingCultureAndValues': 4, 'ratingDiversityAndInclusion': 4, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 4, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 2, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:274217'}, 'location': None, 'originalLanguageId': None, 'pros': 'Passionate employees, growing and scaling in the Data Analytics space.', 'prosOriginal': None, 'cons': 'none come to mind worth mentioning at this time', 'consOriginal': None, 'summary': 'Great company', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71542996, 'reviewDateTime': '2022-12-01T11:24:00.407', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 1, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:2766820'}, 'location': {'__ref': 'City:1146798'}, 'originalLanguageId': None, 'pros': "Alteryx has truly exceeded all my expectations! From the culture it's created for employees, to the amazing peers and leaders I work with. Sr. Leaders have a clear vision and roadmap for the Org. and I'm so excited to be able to be part of it. Best decision I've ever made! Proud to be an Alteryx employee", 'prosOriginal': None, 'cons': 'Absolutely NONE! This organization Rocks!', 'consOriginal': None, 'summary': 'Amazing Company, Amazing People!', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414563'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None}
 ]
]

我认为你可能想要使用 findObj_inJS(text, '"reviews"', findAll=True)[1]

1

好的,json.loads() 应该接受一个包含 JSON 文档的字符串。但是,由于开头有 window.appCache=,所以 text 的值不是有效的 JSON。

不仅如此,我尝试切片 text 来排除 window.appCache= 部分:

text = text[len("window.appCache="):]

然后它给了我这个错误:

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 68113 (char 68112)

所以我检查了text [68110:]的值,结果发现它抱怨是因为实际上text不是一个有效的JSON文档:

undefined,"useErrorPages":true},"parsedRequest":{"urlData":{"url":"\u002FReviews\u002FAlteryx-Reviews-E351220.htm","params":{"employerId":351220,"page":null},"pagePrefix":"P","origin":"http:\u002F\u002Fwww.glassdoor.com"}},"seoConfig":{"appName":"reviews","staticPaths":[],"seoABTest":{},"pageType":"EMPLOYER_INFO","pageContentType":"REVIEWS","urlRegexMatchers":[function genericEiReviewsUrlMatcher(originalUrl) {
  var url = decodeURIComponent(originalUrl);
  var result = {
    params: {},
    helpers: {
      dos2ExperimentHelpers: _dos2ExperimentHelpers["default"]
    }
  };

  var getHumanReadableText = function getHumanReadableText(data) {
    return data.replace(/[+-]/g, ' ');
  };

这是 text[68110:] 的结果,它是一个 JavaScript 对象,但不是有效的 JSON 对象。
JSON 值不能是以下数据类型之一:
  • 函数
  • 日期
  • 未定义
正如您所看到的那样,text 对于某些字段具有 undefined 和函数值。
如果您想获取特定字段的值(例如您提到的 "reviews"),我建议手动解析该字符串,可以使用正则表达式或类似工具。

1

一种解决方案是使用re/json模块解析所需数据:

import json
import pprint
import re

import requests

url = "https://www.glassdoor.com/Reviews/Alteryx-Reviews-E351220.htm"

html = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text

reviews = re.search(r'"reviews":(\[.*?}\])}', html, flags=re.S).group(1)
reviews = json.loads(reviews)

pprint.pprint(reviews)

输出:

[{'__typename': 'EmployerReview',
  'advice': "Don't rush too finish a project",
  'adviceOriginal': None,
  'cons': 'Typical like other companies where newbies get higher salary and '
          'you have to work your way up for promotions nothing really bad',
  'consOriginal': None,
  'countHelpful': 0,
  'countNotHelpful': 0,
  'divisionLink': None,
  'divisionName': None,
  'employer': {'__ref': 'Employer:351220'},
  'employerResponses': [],
  'employmentStatus': None,
  'isCovid19': False,
  'isCurrentJob': True,
  'isLanguageMismatch': False,
  'isLegal': True,
  'jobEndingYear': None,
  'jobTitle': None,
  'languageId': 'eng',
  'lengthOfEmployment': 6,
  'location': None,

...and so on.

谢谢!这正是我所需要的。为了以后参考,你能解释一下这里的正则表达式语句具体是做什么的吗? - Dunc
1
@Dunc "reviews":(\[.*?}\])} 将尝试查找跟随 "reviews": 并以 }]} 结尾的文本(这应该是 "reviews" 变量中的最后三个字母)。 - Andrej Kesely

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接