Python glob但针对的是字符串列表而不是文件系统

80
我希望能够匹配glob格式的模式到一个字符串列表中,而不是到实际的文件系统中的文件。有没有办法做到这一点,或者将一个glob模式转换为正则表达式?

我不知道我是否做错了什么,但我认为作者想要一个可以匹配任何字符串而不仅仅是文件名的解决方案,而这里的解决方案甚至无法从[Parsed_volumedetect_0 @ 0x7fbf12004080] max_volume: -9.3 dB中提取出一个简单的字符串如max_volume。我正在尝试从ffmpeg输出中提取{max,mean}_volume - vault
12个回答

52

glob 模块使用 fnmatch 模块 来处理单个路径元素

这意味着将路径拆分为目录名称和文件名,如果目录名称包含元字符(包含任何一个字符 [, *? )则会进行递归扩展。

如果您有一组简单的文件名字符串列表,则只需使用 fnmatch.filter() 函数即可:

import fnmatch

matching = fnmatch.filter(filenames, pattern)

但是,如果它们包含完整路径,则需要进行更多的工作,因为生成的正则表达式不考虑路径段(通配符不排除分隔符,也没有针对跨平台路径匹配进行调整)。

您可以从这些路径构建一个简单的trie,然后将模式与其匹配:

import fnmatch
import glob
import os.path
from itertools import product


# Cross-Python dictionary views on the keys 
if hasattr(dict, 'viewkeys'):
    # Python 2
    def _viewkeys(d):
        return d.viewkeys()
else:
    # Python 3
    def _viewkeys(d):
        return d.keys()


def _in_trie(trie, path):
    """Determine if path is completely in trie"""
    current = trie
    for elem in path:
        try:
            current = current[elem]
        except KeyError:
            return False
    return None in current


def find_matching_paths(paths, pattern):
    """Produce a list of paths that match the pattern.

    * paths is a list of strings representing filesystem paths
    * pattern is a glob pattern as supported by the fnmatch module

    """
    if os.altsep:  # normalise
        pattern = pattern.replace(os.altsep, os.sep)
    pattern = pattern.split(os.sep)

    # build a trie out of path elements; efficiently search on prefixes
    path_trie = {}
    for path in paths:
        if os.altsep:  # normalise
            path = path.replace(os.altsep, os.sep)
        _, path = os.path.splitdrive(path)
        elems = path.split(os.sep)
        current = path_trie
        for elem in elems:
            current = current.setdefault(elem, {})
        current.setdefault(None, None)  # sentinel

    matching = []

    current_level = [path_trie]
    for subpattern in pattern:
        if not glob.has_magic(subpattern):
            # plain element, element must be in the trie or there are
            # 0 matches
            if not any(subpattern in d for d in current_level):
                return []
            matching.append([subpattern])
            current_level = [d[subpattern] for d in current_level if subpattern in d]
        else:
            # match all next levels in the trie that match the pattern
            matched_names = fnmatch.filter({k for d in current_level for k in d}, subpattern)
            if not matched_names:
                # nothing found
                return []
            matching.append(matched_names)
            current_level = [d[n] for d in current_level for n in _viewkeys(d) & set(matched_names)]

    return [os.sep.join(p) for p in product(*matching)
            if _in_trie(path_trie, p)]

这个长串可以在路径的任何位置使用通配符快速查找匹配项:

>>> paths = ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/foo/bar/*')
['/foo/bar/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/*/bar/b*')
['/foo/bar/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/*/[be]*/b*')
['/foo/bar/baz', '/foo/bar/bar', '/spam/eggs/baz']

35

在 Python 3.4 及以上版本,您只需使用 PurePath.match 即可。

pathlib.PurePath(path_string).match(pattern)

在 Python 3.3 或更早版本(包括 2.x)中,请从 PyPI 获取 pathlib

请注意,为了获得与平台无关的结果(这将取决于您运行代码的原因),您需要明确指定 PurePosixPathPureWindowsPath


2
这种方法的好处是,如果不需要指定glob语法(**/*),它就不需要。例如,如果你只是想根据文件名查找路径。 - Esteban
这仅适用于单个字符串。虽然有用,但它并不能完全回答OP的问题:“将glob格式转换为字符串列表”。 - NumesSanguis
找到了一种使用列表推导式扩展这个答案的方法,请查看我的回答。 - NumesSanguis
@Esteban 对于某些用例来说,这也是一个弱点。如果您明确要查找带有*.pya.py,然后它返回递归结果。例如,此代码将返回true pathlib.PurePath("a/b/c/abc.py").match("*.py"),而我认为它只应该对**/*.py返回true。但是,下面Mathew的解决方案解决了这个问题。 - Jeppe
为什么以下内容不匹配?pathlib.PurePath("virt/kvm/devices/vm.rst").match("virt/*") # False,但使用fnmatch就可以...fnmatch.filter(["virt/kvm/devices/vm.rst"], "virt/*") # ['virt/kvm/devices/vm.rst'] - schirrmacher
2
@schirrmacher pathlib.PurePath.match 在路径分隔符之间不匹配,并且它们总是从末尾开始匹配。Python主版本支持**通配符,在这里可以使用,但我认为它还没有发布。 - Veedrac

31

优秀的艺术家会模仿,伟大的艺术家会借鉴

我借鉴了;)

fnmatch.translate将通配符?*转换为正则表达式..*。我进行了微调以避免这样做。

import re

def glob2re(pat):
    """Translate a shell PATTERN to a regular expression.

    There is no way to quote meta-characters.
    """

    i, n = 0, len(pat)
    res = ''
    while i < n:
        c = pat[i]
        i = i+1
        if c == '*':
            #res = res + '.*'
            res = res + '[^/]*'
        elif c == '?':
            #res = res + '.'
            res = res + '[^/]'
        elif c == '[':
            j = i
            if j < n and pat[j] == '!':
                j = j+1
            if j < n and pat[j] == ']':
                j = j+1
            while j < n and pat[j] != ']':
                j = j+1
            if j >= n:
                res = res + '\\['
            else:
                stuff = pat[i:j].replace('\\','\\\\')
                i = j+1
                if stuff[0] == '!':
                    stuff = '^' + stuff[1:]
                elif stuff[0] == '^':
                    stuff = '\\' + stuff
                res = '%s[%s]' % (res, stuff)
        else:
            res = res + re.escape(c)
    return res + '\Z(?ms)'

这个和 fnmatch.filter 类似,re.matchre.search 都可以使用。
def glob_filter(names,pat):
    return (name for name in names if re.match(glob2re(pat),name))

此页面上的全局模式和字符串均通过测试。

pat_dict = {
            'a/b/*/f.txt': ['a/b/c/f.txt', 'a/b/q/f.txt', 'a/b/c/d/f.txt','a/b/c/d/e/f.txt'],
            '/foo/bar/*': ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar'],
            '/*/bar/b*': ['/foo/bar/baz', '/foo/bar/bar'],
            '/*/[be]*/b*': ['/foo/bar/baz', '/foo/bar/bar'],
            '/foo*/bar': ['/foolicious/spamfantastic/bar', '/foolicious/bar']

        }
for pat in pat_dict:
    print('pattern :\t{}\nstrings :\t{}'.format(pat,pat_dict[pat]))
    print('matched :\t{}\n'.format(list(glob_filter(pat_dict[pat],pat))))

2
太好了!是的,将模式转换为忽略路径分隔符的模式是一个好主意。请注意,它不处理 os.sepos.altsep,但调整应该很容易。 - Martijn Pieters
2
在进行任何处理之前,我通常会将路径规范化为使用正斜杠。 - Jason S
这个解决方案会错误地允许 [^abc] 匹配目录分隔符,如 /。请参考我的解决方案,其中包含修复此问题的示例,并且还允许使用 ** 通配符。 - Mathew Wicks

7

虽然fnmatch.fnmatch可以直接用于检查模式是否与文件名匹配,但您也可以使用fnmatch.translate方法从给定的fnmatch模式生成正则表达式:

>>> import fnmatch
>>> fnmatch.translate('*.txt')
'.*\\.txt\\Z(?ms)'

来自文档:

fnmatch.translate(pattern)

将 shell 风格的模式转换为正则表达式。


3

我的解决方案类似于Nizam's,但有一些变化:

  1. 支持**通配符
  2. 防止像[^abc]这样的模式匹配到/
  3. 更新为使用Python 3.8.13中的fnmatch.translate()作为基础

警告

glob.glob()存在一些轻微差异,这个解决方案(以及大多数其他解决方案)也会受到影响。如果您知道如何解决,请在评论中建议更改:

  1. *?不应该匹配以.开头的文件名
  2. **也应该匹配0个文件夹,当像/**/这样使用时

代码:

import re

def glob_to_re(pat: str) -> str:
    """Translate a shell PATTERN to a regular expression.

    Derived from `fnmatch.translate()` of Python version 3.8.13
    SOURCE: https://github.com/python/cpython/blob/v3.8.13/Lib/fnmatch.py#L74-L128
    """

    i, n = 0, len(pat)
    res = ''
    while i < n:
        c = pat[i]
        i = i+1
        if c == '*':
            # -------- CHANGE START --------
            # prevent '*' matching directory boundaries, but allow '**' to match them
            j = i
            if j < n and pat[j] == '*':
                res = res + '.*'
                i = j+1
            else:
                res = res + '[^/]*'
            # -------- CHANGE END ----------
        elif c == '?':
            # -------- CHANGE START --------
            # prevent '?' matching directory boundaries
            res = res + '[^/]'
            # -------- CHANGE END ----------
        elif c == '[':
            j = i
            if j < n and pat[j] == '!':
                j = j+1
            if j < n and pat[j] == ']':
                j = j+1
            while j < n and pat[j] != ']':
                j = j+1
            if j >= n:
                res = res + '\\['
            else:
                stuff = pat[i:j]
                if '--' not in stuff:
                    stuff = stuff.replace('\\', r'\\')
                else:
                    chunks = []
                    k = i+2 if pat[i] == '!' else i+1
                    while True:
                        k = pat.find('-', k, j)
                        if k < 0:
                            break
                        chunks.append(pat[i:k])
                        i = k+1
                        k = k+3
                    chunks.append(pat[i:j])
                    # Escape backslashes and hyphens for set difference (--).
                    # Hyphens that create ranges shouldn't be escaped.
                    stuff = '-'.join(s.replace('\\', r'\\').replace('-', r'\-')
                                     for s in chunks)
                # Escape set operations (&&, ~~ and ||).
                stuff = re.sub(r'([&~|])', r'\\\1', stuff)
                i = j+1
                if stuff[0] == '!':
                    # -------- CHANGE START --------
                    # ensure sequence negations don't match directory boundaries
                    stuff = '^/' + stuff[1:]
                    # -------- CHANGE END ----------
                elif stuff[0] in ('^', '['):
                    stuff = '\\' + stuff
                res = '%s[%s]' % (res, stuff)
        else:
            res = res + re.escape(c)
    return r'(?s:%s)\Z' % res

测试用例:

以下是一些测试用例,比较内置的fnmatch.translate()和上面的glob_to_re()

import fnmatch

test_cases = [
    # path, pattern, old_should_match, new_should_match
    ("/path/to/foo", "*", True, False),
    ("/path/to/foo", "**", True, True),
    ("/path/to/foo", "/path/*", True, False),
    ("/path/to/foo", "/path/**", True, True),
    ("/path/to/foo", "/path/to/*", True, True),
    ("/path/to", "/path?to", True, False),
    ("/path/to", "/path[!abc]to", True, False),
]

for path, pattern, old_should_match, new_should_match in test_cases:

    old_re = re.compile(fnmatch.translate(pattern))
    old_match = bool(old_re.match(path))
    if old_match is not old_should_match:
        raise AssertionError(
            f"regex from `fnmatch.translate()` should match path "
            f"'{path}' when given pattern: {pattern}"
        )

    new_re = re.compile(glob_to_re(pattern))
    new_match = bool(new_re.match(path))
    if new_match is not new_should_match:
        raise AssertionError(
            f"regex from `glob_to_re()` should match path "
            f"'{path}' when given pattern: {pattern}"
        )

例子:

这是一个使用glob_to_re()函数的示例,它接受一个字符串列表作为参数。

glob_pattern = "/path/to/*.txt"
glob_re = re.compile(glob_to_re(glob_pattern))

input_paths = [
    "/path/to/file_1.txt",
    "/path/to/file_2.txt",
    "/path/to/folder/file_3.txt",
    "/path/to/folder/file_4.txt",
]

filtered_paths = [path for path in input_paths if glob_re.match(path)]
# filtered_paths = ["/path/to/file_1.txt", "/path/to/file_2.txt"]

干得好!我猜这就是你所说的警告,即此模式:"/a/b/c/**/test.py" 无法匹配此输入:"/a/b/c/test.py"。我仍在测试中,但将 res = res + '.*' 更改为 res = res + '.*/?' 并将下面一行更改为 i = j + 2 似乎可以解决问题 - 不确定它有多健壮,但对我的目的来说有效。 :) - Jeppe

2
我无法说它有多高效,但它比其他解决方案更简洁、更完整,可能更安全/可靠。
支持的 syntax
  • * -- 匹配零个或多个字符。
  • **(实际上是** / /**)-- 匹配零个或多个子目录。
  • ? -- 匹配一个字符。
  • [] -- 匹配括号内的一个字符。
  • [!] -- 匹配不在括号内的一个字符。
  • 由于使用了\ 进行转义,因此只能使用/作为路径分隔符。
操作顺序:
  1. 转义 glob 中的特殊 RE 字符。
  2. 生成用于标记化已转义 glob 的 RE。
  3. 已转义 glob 标记替换为等效的 RE。
import re
from sys import hexversion, implementation
# Support for insertion-preserving/ordered dicts became language feature in Python 3.7, but works in CPython since 3.6.
if hexversion >= 0x03070000 or (implementation.name == 'cpython' and hexversion >= 0x03060000):
    ordered_dict = dict
else:
    from collections import OrderedDict as ordered_dict

escaped_glob_tokens_to_re = ordered_dict((
    # Order of ``**/`` and ``/**`` in RE tokenization pattern doesn't matter because ``**/`` will be caught first no matter what, making ``/**`` the only option later on.
    # W/o leading or trailing ``/`` two consecutive asterisks will be treated as literals.
    ('/\*\*', '(?:/.+?)*'), # Edge-case #1. Catches recursive globs in the middle of path. Requires edge case #2 handled after this case.
    ('\*\*/', '(?:^.+?/)*'), # Edge-case #2. Catches recursive globs at the start of path. Requires edge case #1 handled before this case. ``^`` is used to ensure proper location for ``**/``.
    ('\*', '[^/]*'), # ``[^/]*`` is used to ensure that ``*`` won't match subdirs, as with naive ``.*?`` solution.
    ('\?', '.'),
    ('\[\*\]', '\*'), # Escaped special glob character.
    ('\[\?\]', '\?'), # Escaped special glob character.
    ('\[!', '[^'), # Requires ordered dict, so that ``\[!`` preceded ``\[`` in RE pattern. Needed mostly to differentiate between ``!`` used within character class ``[]`` and outside of it, to avoid faulty conversion.
    ('\[', '['),
    ('\]', ']'),
))

escaped_glob_replacement = re.compile('(%s)' % '|'.join(escaped_glob_tokens_to_re).replace('\\', '\\\\\\'))

def glob_to_re(pattern):
    return escaped_glob_replacement.sub(lambda match: escaped_glob_tokens_to_re[match.group(0)], re.escape(pattern))

if __name__ == '__main__':
    validity_paths_globs = (
        (True, 'foo.py', 'foo.py'),
        (True, 'foo.py', 'fo[o].py'),
        (True, 'fob.py', 'fo[!o].py'),
        (True, '*foo.py', '[*]foo.py'),
        (True, 'foo.py', '**/foo.py'),
        (True, 'baz/duck/bar/bam/quack/foo.py', '**/bar/**/foo.py'),
        (True, 'bar/foo.py', '**/foo.py'),
        (True, 'bar/baz/foo.py', 'bar/**'),
        (False, 'bar/baz/foo.py', 'bar/*'),
        (False, 'bar/baz/foo.py', 'bar**/foo.py'),
        (True, 'bar/baz/foo.py', 'bar/**/foo.py'),
        (True, 'bar/baz/wut/foo.py', 'bar/**/foo.py'),
    )
    results = []
    for seg in validity_paths_globs:
        valid, path, glob_pat = seg
        print('valid:', valid)
        print('path:', path)
        print('glob pattern:', glob_pat)
        re_pat = glob_to_re(glob_pat)
        print('RE pattern:', re_pat)
        match = re.fullmatch(re_pat, path)
        print('match:', match)
        result = bool(match) == valid
        results.append(result)
        print('result was expected:', result)
        print('-'*79)
    print('all results were expected:', all(results))
    print('='*79)

非常感谢您提供的解决方案,这是我找到的第一个符合预期的解决方案。考虑到复杂性,它相当易读。我想将实际的比较提取到一个单独的函数def glob(path, pattern)中会更好。 - undefined

1

这是对@Veedrac PurePath.match答案的扩展,可应用于字符串列表:

# Python 3.4+
from pathlib import Path

path_list = ["foo/bar.txt", "spam/bar.txt", "foo/eggs.txt"]
# convert string to pathlib.PosixPath / .WindowsPath, then apply PurePath.match to list
print([p for p in path_list if Path(p).match("ba*")])  # "*ba*" also works
# output: ['foo/bar.txt', 'spam/bar.txt']

print([p for p in path_list if Path(p).match("*o/ba*")])
# output: ['foo/bar.txt']

最好使用pathlib.Path()而不是pathlib.PurePath(),因为这样您就不必担心底层文件系统。


1

没关系,我找到了。我想要fnmatch模块。


哦,等等——fnmatch不能处理路径名的分段……唉。 - Jason S
你能提供一些 fnmatch 无法处理的情况的例子吗? - Bhargav Rao
@glob.glob()函数将模式分别应用于路径元素 - Martijn Pieters
哎呀,Martijn,你在同一行上回答了,这样可以吗!我没看到你的回答!立刻放弃我的草稿 :) - Bhargav Rao
@BhargavRao:递归的“glob”调用具有您的文件系统可用,因此它们可以在遍历全局时根据需要执行“os.listdir()”。当您只有一个列表时,您没有相同的功能。我的答案使用Trie结构来复制行为,并通过仅在最后爆炸多个匹配项来避免递归调用。我想,找出如何做那个部分是仅使用“fnmatch()”无法处理他们的情况的原因。 :-) - Martijn Pieters
显示剩余4条评论

1
这是一个可以处理转义标点符号的全局变量。它不会在路径分隔符上停止。我将其发布在此处,因为它与问题标题相匹配。
要在列表中使用:
rex = glob_to_re(glob_pattern)
rex = r'(?s:%s)\Z' % rex # Can match newline; match whole string.
rex = re.compile(rex)
matches = [name for name in names if rex.match(name)]

这里是代码:

import re as _re

class GlobSyntaxError(SyntaxError):
    pass

def glob_to_re(pattern):
    r"""
    Given pattern, a unicode string, return the equivalent regular expression.
    Any special character * ? [ ! - ] \ can be escaped by preceding it with 
    backslash ('\') in the pattern.  Forward-slashes ('/') and escaped 
    backslashes ('\\') are treated as ordinary characters, not boundaries.

    Here is the language glob_to_re understands.
    Earlier alternatives within rules have precedence.  
        pattern = item*
        item    = '*'  |  '?'  |  '[!' set ']'  |  '[' set ']'  |  literal
        set     = element element*
        element = literal '-' literal  |  literal
        literal = '\' char  |  char other than \  [  ] and sometimes -
    glob_to_re does not understand "{a,b...}".
    """
    # (Note: the docstring above is r""" ... """ to preserve backslashes.)
    def expect_char(i, context):
        if i >= len(pattern):
            s = "Unfinished %s: %r, position %d." % (context, pattern, i)
            raise GlobSyntaxError(s)
    
    def literal_to_re(i, context="pattern", bad="[]"):
        if pattern[i] == '\\':
            i += 1
            expect_char(i, "backslashed literal")
        else:
            if pattern[i] in bad:
                s = "Unexpected %r in %s: %r, position %d." \
                    % (pattern[i], context, pattern, i)
                raise GlobSyntaxError(s)
        return _re.escape(pattern[i]), i + 1

    def set_to_re(i):
        assert pattern[i] == '['
        set_re = "["
        i += 1
        try:
            if pattern[i] == '!':
                set_re += '^'
                i += 1
            while True:
                lit_re, i = literal_to_re(i, "character set", bad="[-]")
                set_re += lit_re
                if pattern[i] == '-':
                    set_re += '-'
                    i += 1
                    expect_char(i, "character set range")
                    lit_re, i = literal_to_re(i, "character set range", bad="[-]")
                    set_re += lit_re
                if pattern[i] == ']':
                    return set_re + ']', i + 1
                
        except IndexError:
            expect_char(i, "character set")  # Trigger "unfinished" error.

    i = 0
    re_pat = ""
    while i < len(pattern):
        if pattern[i] == '*':
            re_pat += ".*"
            i += 1
        elif pattern[i] == '?':
            re_pat += "."
            i += 1
        elif pattern[i] == '[':
            set_re, i = set_to_re(i)
            re_pat += set_re
        else:
            lit_re, i = literal_to_re(i)
            re_pat += lit_re
    return re_pat

0

我想添加对递归通配符模式的支持,例如things/**/*.py,并且具有相对路径匹配,因此example*.py不会与folder/example_stuff.py匹配。

这是我的方法:


from os import path
import re

def recursive_glob_filter(files, glob):
    # Convert to regex and add start of line match
    pattern_re = '^' + fnmatch_translate(glob)

    # fnmatch does not escape path separators so escape them
    if path.sep in pattern_re and not r'\{}'.format(path.sep) in pattern_re:
        pattern_re = pattern_re.replace('/', r'\/')

    # Replace `*` with one that ignores path separators
    sep_respecting_wildcard = '[^\{}]*'.format(path.sep)
    pattern_re = pattern_re.replace('.*', sep_respecting_wildcard)

    # And now for `**` we have `[^\/]*[^\/]*`, so replace that with `.*`
    # to match all patterns in-between
    pattern_re = pattern_re.replace(2 * sep_respecting_wildcard, '.*')
    compiled_re = re.compile(pattern_re)
    return filter(compiled_re.search, files)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接