如何在Python 3中从多行字符串中提取数字

Question

如何在Python 3中从多行字符串中提取数字

3

我一直在努力从名为price（产品的也可以）的多行字符串中取出所有数字。

我正在使用Python爬取网站上的产品名称和价格，并将结果写入文件，如下所示：

Master C141,"

6

                    999

                        .
                        -
                "
Master 220,"

6

                    499

                        .
                        -
                "
Master C170,"

12

                    499
                        .
                        -
                "

我尝试了来自Stackoverflow和其他几个网站的很多不同的代码示例，但都没有奏效。我想要实现的输出如下所示：

Master C141, 6999

Master 220, 6499

Master C170, 12499

这是代码：

content = driver.page_source

products=[] #List to store name of the product
prices=[] #List to store price of the product

soup = BeautifulSoup(content,"html.parser")
for a in soup.findAll('div', attrs={'class':'c-product-listing__col'}):
    name=a.find('h2', attrs={'class':'c-product-card__heading'})
    price=a.find('div', attrs={'class':'c-price-tag__price'})

    print(re.findall("\d+", price.text))
    
    products.append(name.text)
    prices.append(price.text)

df = pd.DataFrame({'Product Name':products,'Price':prices}) 
df.to_csv('products.txt', index=False, encoding='utf-8')

- Jiess

3个回答

2

此答案假设我们从您的问题中开始：

output = re.sub(r'\b(Master \w+,).*?(\d+).*?(\d+).*?(?=\bMaster|$)', r'\1 \2\3\n', text, flags=re.S).strip()
print(output)

这将打印：

Master C141, 6999
Master 220, 6499
Master C170, 12499

在这里，我们只是捕获了Master术语以及其后的两个数字，然后组合生成所需的输出。请注意，我们使用点全部标志，以便匹配跨行的内容。

- Tim Biegeleisen

嗨，Tim，谢谢。不过我无法让你的代码工作，所以我在pythex.org上尝试了正则表达式部分，但没有匹配到任何内容...另外，你能否使用对象“name”和“price”的值来举例说明，而不是静态文本“Master”？产品名称会变化，可能是任何东西。谢谢。 - Jiess

也许用\w+代替Master会更好？请在此处检查演示，以查看我的答案是否有效。你可以从那里开始。 - Tim Biegeleisen

感谢Tim的贡献！ - Jiess

1

发生了什么？

你已经有一个解决方案，唯一的问题是你没有将它附加到你的列表中。

如何修复？

将正则表达式的结果附加到你的列表中：

prices.append(re.findall("\d+", price.text))

添加到你的示例中：

...
products=[] #List to store name of the product
prices=[] #List to store price of the product

soup = BeautifulSoup(content,"html.parser")
for a in soup.find_all('div', attrs={'class':'c-product-listing__col'}):
    name=a.find('h2', attrs={'class':'c-product-card__heading'})
    price=a.find('div', attrs={'class':'c-price-tag__price'})

    products.append(name.text)
    prices.append(re.findall("\d+", price.text))
...

- HedgeHog

你好，HedgeHog 我无法让这行代码起作用：prices.append(re.findall("\d+", price.text)) 所以我使用了下面这段代码： strPrice = price.text : strPrice = re.sub('\D', '', strPrice) 感谢您的帮助！ - Jiess

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jiess · Accepted Answer

今日免费次数已满, 请开通会员/明日再来

.....
.....
content = driver.page_source

products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product

soup = BeautifulSoup(content,"html.parser")
for a in soup.findAll('div', attrs={'class':'c-product-listing__col'}):
    name=a.find('h2', attrs={'class':'c-product-card__heading'})
    price=a.find('div', attrs={'class':'c-price-tag__price'})
    strProduct = name.text
    strPrice = price.text
    
    strProduct = re.match('[^,]+', strProduct)[0]
    strPrice = re.sub('\D', '', strPrice)
    
    products.append(strProduct)
    prices.append(strPrice)

df = pd.DataFrame({'Product Name':products,'Price':prices}) 
df.to_csv('products.csv', index=False, encoding='utf-8')
driver.quit()