我认为我想做的是一个相当常见的任务,但我在网上找不到任何参考资料。我有带标点符号的文本,我想要一个单词列表。
"Hey, you - what are you doing here!?"
应该是。['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
但是 Python 的 str.split()
只能使用一个参数,所以在使用空格分割后会将标点符号与单词分开。有什么解决办法吗?
首先,在循环中执行任何正则表达式操作之前,始终使用re.compile(),因为它比普通操作更快。
因此,针对您的问题,首先编译模式,然后执行操作。
import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)
st = "Hey, you - what are you doing here!?"
# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey you what are you doing here '
# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()
# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'
(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()
# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
def split_string_on_multiple_separators(input_string, separators):
buffer = [input_string]
for sep in separators:
strings = buffer
buffer = [] # reset the buffer
for s in strings:
buffer = buffer + s.split(sep)
return buffer
def split_string(source, splitlist):
output = [] # output list of cleaned words
atsplit = True
for char in source:
if char in splitlist:
atsplit = True
else:
if atsplit:
output.append(char) # append new word after split
atsplit = False
else:
output[-1] = output[-1] + char # continue copying characters until next split
return output
def tokenizeSentence_Reversible(sentence):
setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
listOfTokens = [sentence]
for delimiter in setOfDelimiters:
newListOfTokens = []
for ind, token in enumerate(listOfTokens):
ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
listOfTokens = [item for sublist in ll for item in sublist] # flattens.
listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
newListOfTokens.extend(listOfTokens)
listOfTokens = newListOfTokens
return listOfTokens
\W+
可能适用于这种情况,但可能不适用于其他情况。filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
\w
和\W
的解决方案并不是问题的答案。请注意,在您的答案中,应该删除|
(您考虑的是expr0|expr1
而不是[char0 char1…]
)。此外,没有必要对正则表达式进行compile()
。 - Eric O. Lebigot这是我尝试使用多个分隔符进行拆分的代码:
def msplit( str, delims ):
w = ''
for z in str:
if z not in delims:
w += z
else:
if len(w) > 0 :
yield w
w = ''
if len(w) > 0 :
yield w
def split_string(source,splitlist):
splits = frozenset(splitlist)
l = []
s1 = ""
for c in source:
if c in splits:
if s1:
l.append(s1)
s1 = ""
else:
print s1
s1 = s1 + c
if s1:
l.append(s1)
return l
>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']
我最喜欢的是使用 replace()
方法。以下过程会把字符串 splitlist
中的所有分隔符都更改为第一个分隔符,然后在这个分隔符上对文本进行拆分。它还会考虑到如果 splitlist
恰好是一个空字符串的情况。它将返回一个单词列表,其中不包含空字符串。
def split_string(text, splitlist):
for sep in splitlist:
text = text.replace(sep, splitlist[0])
return filter(None, text.split(splitlist[0])) if splitlist else [text]
def get_words(s):
l = []
w = ''
for c in s.lower():
if c in '-!?,. ':
if w != '':
l.append(w)
w = ''
else:
w = w + c
if w != '':
l.append(w)
return l
以下是使用方法:
>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
str.split()
方法也可以不传任何参数来使用。 - Ivan Vinogradov