替换字符串列表中的子字符串。

4
我正在尝试清理我的句子,我想要删除其中的标签(以下划线加单词形式出现,例如"_UH")。 基本上,我想要删除下划线后面的字符串(同时也删除下划线本身)。
['hanks_NNS sir_VBP',
'Oh_UH thanks_NNS to_TO remember_VB']

需要输出的内容:
['hanks sir',
'Oh thanks to remember']

以下是我尝试的代码:
for i in text:
    k= i.split(" ")
    print (k)
    for z in k:
        if "_" in z:
            j=z.replace("_",'')
            print (j)

当前输出:
ThanksNNS
sirVBP
OhUH
thanksNNS
toTO
rememberVB
RemindVB
1个回答

3

使用正则表达式:

你可以使用re.sub()方法。在字符串中匹配所需的子字符串,并用空字符串替换该子字符串:

import re

text = ['hanks_NNS sir_VBP', 'Oh_UH thanks_NNS to_TO remember_VB']
curated_text = [re.sub(r'_\S*', r'', a) for a in text]
print curated_text

输出:

['hanks sir', 'Oh thanks to remember']

正则表达式:

_\S* - Underscore followed by 0 or more non space characters

不使用正则表达式:

text = ['hanks_NNS sir_VBP', 'Oh_UH thanks_NNS to_TO remember_VB']
curated_text = [] # Outer container for holding strings in text.

for i in text:
    d = [] # Inner container for holding different parts of same string.
    for b in i.split():
        c = b.split('_')[0] # Discard second element after split
        d.append(c)         # Append first element to inner container.
    curated_text.append(' '.join(d)) # Join the elements of inner container.
    #Append the curated string to the outer container.
            
print curated_text

输出:

['hanks sir', 'Oh thanks to remember']

您的代码问题:

您只是将'_'替换为空字符串,实际上您想要将'_'及其后面的字符替换为空字符串。

for i in text:
    k= i.split(" ")
    print (k)
    for z in k:
        if "_" in z:
            j=z.replace("_",'') # <--- 'hanks_NNS' becomes 'hanksNNS'
            print (j)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接