如何在Python中从列表中删除日期

Question

如何在Python中从列表中删除日期

7

我有一个分词文本列表（list_of_words），看起来像这样：

list_of_words = 
['08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 'vendor',
 'per',
 'mfg/recommend',
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 ...]

我正在尝试从这个列表中删除所有日期和时间的实例。我尝试使用.remove（）函数，但没有成功。我尝试通过在我正在排序的停用词列表中传递通配符字符，如“../../....”，但那也没有起作用。最后，我尝试编写以下代码：

for line in list_of_words:
    if re.search('[0-9]{2}/[09]{2}/[0-9]{4}',line):
        list_of_words.remove(line)

但这也行不通。我该如何从我的列表中剥离出所有格式化为日期或时间的内容？

- MrYuck

2

你想要移除特定的日期和/或时间格式吗？ - mng

3个回答

9

如果你想从列表中获取时间和日期字符串，也许可以尝试以下正则表达式：

[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}

添加Python代码：

import re

list_of_words = [
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 'vendor',
 'per',
 'mfg/recommend',
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]

- BertramLAU

你的正则表达式很棒。我已经在我的回复中使用了它。 - Olivier Pellier-Cuit

2

@duser6188402 \d 检查所有 Unicode 数字，而 [0-9] 仅限于这10个字符。因此，[0-9] 更有效率。 - BertramLAU

使用 re.compile 编译正则表达式，然后访问编译后的表达式会更加清晰和高效。 - 2Cubed

2

试试这个：

import re

list_of_words = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complet',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 'complet']

list_of_words = filter(
    lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
    list_of_words)

- Olivier Pellier-Cuit

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ro Yo Mi · Accepted Answer

描述

^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$

正则表达式可视化

这个正则表达式会执行以下操作：

查找看起来像日期 12/23/2016 和时间 12:34:56 的字符串
查找还包括也是 am 或 pm，它们可能是源列表中前一个时间的一部分

示例

演示实例

正则表达式：https://regex101.com/r/yE8oB9/2
Python：http://codepad.org/X9D3pd7s

示例列表

08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete

处理后的列表

complete
vendor
per
mfg/recommend
complete

示例Python脚本

import re

SourceList = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complete',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 
                 'complete']

OutputList = filter(
    lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
    SourceList)


for ThisValue in OutputList:
  print ThisValue

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture (2 times):
----------------------------------------------------------------------
      [0-9]{2}                 any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
      [:\/,]                   any character of: ':', '\/', ','
----------------------------------------------------------------------
    ){2}                     end of grouping
----------------------------------------------------------------------
    [0-9]{2,4}               any character of: '0' to '9' (between 2
                             and 4 times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    am                       'am'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    pm                       'pm'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------