如何检查所有文件夹及其子文件夹中是否存在特定字符串？

Question

如何检查所有文件夹及其子文件夹中是否存在特定字符串？

5

我有文件夹和文件
我也有子文件夹和文件
我需要搜索同一文件中的特定字符串，但不包括其他字符串
所有文件都是 .txt 格式
我需要检查哪些文件中包含字符串 20210624 ，而不包含字符串 20210625
我的输出结果为文件名

import os
match_str = ['20210624']
not_match_str =  ['20210625']
for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith((".txt")):
             ## search files with match_str `20210624`  and not_match_str `20210625`

我可以使用 import walk 吗？

- sim

你必须在获取文件后读取每个文件。然后，您可以检查文件中是否存在数字。 - PCM

@PCM 请不要进行无意义的编辑，例如这样的修改。添加通用标签，例如 [tag:operating-system]、[tag:algorithm] 或 [tag:list] 并不能帮助任何人更好地理解这些问题；你所建议的大部分编辑都是针对糟糕的问题，也许可以通过少量修改使其变得更好，但你却决定建议添加一个并没有实际附加价值的标签。 - tripleee

已在我的答案中添加。 - Ann Zen

4个回答

1

从这里继续 -

if name.endswith((".txt")):
   f = file.read(name,mode='r')
   a = f.read()
   if match_str[0] in f.read():
      # Number is present

如果有多个匹配字符串，您也可以使用for循环进行读取。同样，您可以使用“not in”关键字来检查非匹配字符串。

- PCM

1

你可以用几个简单的 Shell 命令获取文件名：

find . -name "*.txt" | xargs grep -l "20210624" | xargs grep -L "20210625"

- shdxiang

1

你可以使用 pathlib 和 glob 来实现这一点。

import pathlib
path = pathlib.Path(path)
maybe_valids = list(path.glob("*20210624*.txt"))
valids = [elem for elem in maybe_valids if "20210625" not in elem.name]
print(valids)

maybe_valids 列表是由包含 "20210624" 并以 .txt 结尾的每个元素创建的，而 valids 是不包含 "20210625" 的那些。

- crissal

2

我认为OP想要在文件内找到字符串，而不是文件名。 - Maaz

引用 OP 的话，“我的输出返回文件名”，所以我认为这是期望的行为。 - crissal

但是，OP之前提到的“我需要检查...文件中是否存在...”这一点使得问题变得相当模糊。 - Maaz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ann Zen · Accepted Answer

你可以在glob.glob()方法中设置recursive关键字参数为True，让程序递归搜索文件夹、子文件夹等。

from glob import glob

path = 'C:\\Users\\User\\Desktop'
for file in glob(path + '\\**\\*.txt', recursive=True):
    with open(file) as f:
        text = f.read()
        if '20210624'  in text and '20210625' not in text:
            print(file)

如果您不想打印文件的完整路径，只需要文件名，则可以执行以下操作：

from glob import glob

path = 'C:\\Users\\User\\Desktop'
for file in glob(path + '\\**\\*.txt', recursive=True):
    with open(file) as f:
        text = f.read()
        if '20210624'  in text and '20210625' not in text:
            print(file.split('\\')[-1])

为了使用os.walk()方法，你可以像这样使用str.endswith()方法（就像你在帖子中所做的那样）：

import os

for path, _, files in os.walk('C:\\Users\\User\\Desktop'):
    for file in files:
        if file.endswith('.txt'):
            with open(os.path.join(path, file)) as f:
                text = f.read()
                if '20210624'  in text and '20210625' not in text:
                    print(file)

并且在最大子目录级别内进行搜索：

import os

levels = 2
root = 'C:\\Users\\User\\Desktop'
total = root.count('\\') + levels

for path, _, files in os.walk(root):
    if path.count('\\') > total:
        break
    for file in files:
        if file.endswith('.txt'):
            print(os.path.join(path, file))