Python分割空格,但不包括单词之间和逗号后面的空格

3
我想要拆分以下内容:
11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey

11/2/2019 Pending sale $959,000

进入

['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']
['11/2/2019', 'Pending sale', '$959,000']

我尝试使用正则表达式,但是没有成功找出如何使用 re.split() 组合来实现除单词和逗号后面之外的分割。

我该如何实现这一点呢?

3个回答

3
您可以使用这个正则表达式,它寻找的是一个空格,该空格不是由字母或逗号前导的,也不是在一个字母后面:
(?<![a-z,]) | (?![a-z])

在regex101上的演示

在Python中:

import re
a = "11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey"
b = "11/2/2019 Pending sale $959,000"

print(re.split(r'(?<![a-z,]) | (?![a-z])', a, 0, re.IGNORECASE))
print(re.split(r'(?<![a-z,]) | (?![a-z])', b, 0, re.IGNORECASE))

输出:

['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']
['11/2/2019', 'Pending sale', '$959,000']

好的,谢谢!这个方法可行!我之前试着用回顾后发断言来实现,但是没能弄清楚格式怎么写。现在明白了,再次感谢! - Sam Skinner

0
你的数据从哪里获取?是从CSV文件中吗?你能把分隔符改成逗号或其他符号吗?
目前你只能使用空格作为分隔符。

E.g.:

>>> x = '11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey'
>>> x.split(" ")
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne', 'Freeze-Manning,', 'Kevin
', 'Garvey']

注意它将字符串“Suzanne Freeze-Manning,Kevin Garvey”分割成几部分。

如果您的分隔符是制表符,您可以轻松地执行以下操作:

E.g.:

>>> x = '11/27/2019\tSold\t$900,000\t-6.2%\tSuzanne Freeze-Manning, Kevin Garvey'
>>> print(x)
11/27/2019  Sold    $900,000    -6.2%   Suzanne Freeze-Manning, Kevin Garvey
>>> x.split("\t")
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']

如果您的数据始终包含5列,例如第一个字符串,您可以选择在第四次迭代后停止拼接。

E.g.:

>>> x.split(" ",4)
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']

有关分隔符的更多详细信息,请参见https://docs.python.org/3.6/library/stdtypes.html#str.split


0

尝试这段代码:

import re
l = '11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey'

l = l.replace(" ", '&')  # replace the & for a character that you are ensure that won't be in you string

l = l.replace(',&', ', ') # This ensures the maintence of the "after comma words"

result = re.sub(r'([^0-9, %])&([^0-9, $])', r'\1 \2', l) # Now every white space is a & char, you know that it must be splited if the previous item is a number (price in this case) a percentage symbol, the next word should be the $ (also indicating the price), or a number. If the pattern does't follow this rules, it is considered a word that won't be splited. Note, the code replace just the & ('after words' and 'after commas) for ' ' and keep the rest of the regex pattern intact. 

result = result.split('&') # Now just the itens that must be splited has the & between them. 

print(result)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接