从字符串中删除所有标点符号，除非它位于数字之间。

Question

从字符串中删除所有标点符号，除非它位于数字之间。

4

我有一段包含单词和数字的文本。以下是一个代表性的例子：

string = "This is a 1example of the text. But, it only is 2.5 percent of all data"

我想把它转换成类似于：

"This is a  1 example of the text But it only is  2.5  percent of all data"

所以，去除标点符号（可以是 . ，或任何 string.punctuation 中的其他符号），并在将数字和单词连接在一起时加上空格。但保留像我的示例中的2.5这样的浮点数。

我使用了以下代码：

item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item

结果是：

 >> "This is a  1 example of the text. But, it only is  2 . 5  percent of all data"

我快到了，但是想不出最后一部分的意思。

- deltascience

在您的示例输出中，仍然包含标点符号？您想要删除它吗？ - Roelant

1

@Roelant 我想那是最后一部分了。 - LismUK

@Roelant 我刚刚更新了帖子，我犯了一个错误。请看一下。我想删除标点符号，但不包括 2.5 等类似内容。 - deltascience

6个回答

1

好的，大家好，这里是一个答案（最好的？我不知道，但它似乎能够工作）：

item = "This is a 1example 2Ex of the text.But, it only is 2.5 percent of all data?"
#if there is two strings contatenated with the second starting with capital letter
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
#if a word starts with a digit like "1example"
item = ' '.join(re.split(r'(\d+)([A-Za-z]+)', item) )
#Magical line that removes punctuation apart from floats
item = re.sub('\S+', lambda m: re.match(r'^\W*(.*\w)\W*$', m.group()).group(1), item)
item = item.replace("  "," ")
print item

- deltascience

这需要很多开销。 - user557597

@Federico的回答更好，你可以看一下。 - deltascience

0

代码：

from itertools import groupby

s1 = "This is a 1example of the text. But, it only is 2.5 percent of all data"
s2 = [''.join(g) for _, g in groupby(s1, str.isalpha)]
s3 = ' '.join(s2).replace("   ", "  ").replace("  ", " ")

#you can keep adding a replace for each ponctuation
s4 = s3.replace(". ", " ").replace(", "," ").replace("; "," ").replace(", "," ").replace("- "," ").replace("? "," ").replace("! "," ").replace(" ("," ").replace(") "," ").replace('" '," ").replace(' "'," ").replace('... '," ").replace('/ '," ").replace(' “'," ").replace('” '," ").replace('] '," ").replace(' ['," ")

s5 = s4.replace("  ", " ")
print(s5)

输出：

'This is a 1 example of the text But it only is 2.5 percent of all data'

附注：您可以查看标点符号，然后将它们添加到.replace()函数中。

- dot.Py

与@Shivam的回答相同...标点符号不仅包含“。”和“，”。 - deltascience

我知道这不是最好的解决方案，但你知道你可以继续添加 .replace() 吗？我在我的示例中添加了几个来考虑更多的标点符号。但我认为最好的方法是使用正则表达式模式。 - dot.Py

0

这是一个正则表达式的方法

([^ ]?)(?:[^\P{punct}.]|(?<!\d)\.(?!\d))([^ ]?)

在回调中替换：

如果 $1 的长度 > 0 并且 $2 的长度 > 0
用 $1 + 空格 + $2 替换
否则用 $1$2 替换

扩展

 ( [^ ]? )                     # (1)
 (?:
      [^\P{punct}.] 
   |  
      (?<! \d )
      \.
      (?! \d )
 )
 ( [^ ]? )                     # (2)

如果您不想对紧邻标点符号的字符使用逻辑
请使用(?:[^\P{punct}.]|(?<!\d)\.(?!\d))并替换为空。

- user557597

0

我对Python有些陌生，但对正则表达式有一些见解。我可以建议使用或运算符吗？我会使用这个正则表达式："(\d+)([a-zA-Z])|([a-zA-Z])(\d+)"，然后将其替换为："\1 \2"
如果有一些特殊情况困扰着您，您可以将反向引用传递给一个过程，然后逐个处理，可能通过检查您的"\1\2"是否可以转换为浮点数来完成。TCL具有此类内置功能，Python应该也有。

- user1134991

0

我尝试了这个方法，效果非常好。

a = "This is a 1example of the text. But, it only is 2.5 percent of all data" a.replace(". ", " ").replace(", "," ")

请注意，在替换函数中标点符号后面有一个空格。我只是用空格替换了标点符号和空格。

- Shivam Pandya

这个解决方案仅适用于 . 和 ,，但在 string.punctuation 中有很多字符。我需要一些可以处理所有标点符号的东西... - deltascience

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Federico Piazza · Accepted Answer

您可以像这样使用正则表达式的环视：

(?<!\d)[.,;:](?!\d)

工作演示

本想法是创建一个字符类来收集您想要替换的标点符号，并使用环视来匹配不具有数字周围的标点符号。

regex = r"(?<!\d)[.,;:](?!\d)"

test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"

result = re.sub(regex, "", test_str, 0)

结果是：

This is a 1example of the text But it only is 2.5 percent of all data