glob
格式的模式到一个字符串列表中,而不是到实际的文件系统中的文件。有没有办法做到这一点,或者将一个glob
模式转换为正则表达式?glob
模块使用 fnmatch
模块 来处理单个路径元素。
这意味着将路径拆分为目录名称和文件名,如果目录名称包含元字符(包含任何一个字符 [
, *
或 ?
)则会进行递归扩展。
如果您有一组简单的文件名字符串列表,则只需使用 fnmatch.filter()
函数即可:
import fnmatch
matching = fnmatch.filter(filenames, pattern)
但是,如果它们包含完整路径,则需要进行更多的工作,因为生成的正则表达式不考虑路径段(通配符不排除分隔符,也没有针对跨平台路径匹配进行调整)。
您可以从这些路径构建一个简单的trie,然后将模式与其匹配:
import fnmatch
import glob
import os.path
from itertools import product
# Cross-Python dictionary views on the keys
if hasattr(dict, 'viewkeys'):
# Python 2
def _viewkeys(d):
return d.viewkeys()
else:
# Python 3
def _viewkeys(d):
return d.keys()
def _in_trie(trie, path):
"""Determine if path is completely in trie"""
current = trie
for elem in path:
try:
current = current[elem]
except KeyError:
return False
return None in current
def find_matching_paths(paths, pattern):
"""Produce a list of paths that match the pattern.
* paths is a list of strings representing filesystem paths
* pattern is a glob pattern as supported by the fnmatch module
"""
if os.altsep: # normalise
pattern = pattern.replace(os.altsep, os.sep)
pattern = pattern.split(os.sep)
# build a trie out of path elements; efficiently search on prefixes
path_trie = {}
for path in paths:
if os.altsep: # normalise
path = path.replace(os.altsep, os.sep)
_, path = os.path.splitdrive(path)
elems = path.split(os.sep)
current = path_trie
for elem in elems:
current = current.setdefault(elem, {})
current.setdefault(None, None) # sentinel
matching = []
current_level = [path_trie]
for subpattern in pattern:
if not glob.has_magic(subpattern):
# plain element, element must be in the trie or there are
# 0 matches
if not any(subpattern in d for d in current_level):
return []
matching.append([subpattern])
current_level = [d[subpattern] for d in current_level if subpattern in d]
else:
# match all next levels in the trie that match the pattern
matched_names = fnmatch.filter({k for d in current_level for k in d}, subpattern)
if not matched_names:
# nothing found
return []
matching.append(matched_names)
current_level = [d[n] for d in current_level for n in _viewkeys(d) & set(matched_names)]
return [os.sep.join(p) for p in product(*matching)
if _in_trie(path_trie, p)]
这个长串可以在路径的任何位置使用通配符快速查找匹配项:
>>> paths = ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/foo/bar/*')
['/foo/bar/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/*/bar/b*')
['/foo/bar/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/*/[be]*/b*')
['/foo/bar/baz', '/foo/bar/bar', '/spam/eggs/baz']
在 Python 3.4 及以上版本,您只需使用 PurePath.match
即可。
pathlib.PurePath(path_string).match(pattern)
在 Python 3.3 或更早版本(包括 2.x)中,请从 PyPI 获取 pathlib
。
请注意,为了获得与平台无关的结果(这将取决于您运行代码的原因),您需要明确指定 PurePosixPath
或 PureWindowsPath
。
**/*
),它就不需要。例如,如果你只是想根据文件名查找路径。 - Estebanglob
格式转换为字符串列表”。 - NumesSanguis*.py
的a.py
,然后它返回递归结果。例如,此代码将返回true pathlib.PurePath("a/b/c/abc.py").match("*.py")
,而我认为它只应该对**/*.py
返回true。但是,下面Mathew的解决方案解决了这个问题。 - Jeppepathlib.PurePath("virt/kvm/devices/vm.rst").match("virt/*") # False
,但使用fnmatch就可以...fnmatch.filter(["virt/kvm/devices/vm.rst"], "virt/*") # ['virt/kvm/devices/vm.rst']
- schirrmacherpathlib.PurePath.match
在路径分隔符之间不匹配,并且它们总是从末尾开始匹配。Python主版本支持**
通配符,在这里可以使用,但我认为它还没有发布。 - Veedrac优秀的艺术家会模仿,伟大的艺术家会借鉴。
我借鉴了;)
fnmatch.translate
将通配符?
和*
转换为正则表达式.
和.*
。我进行了微调以避免这样做。
import re
def glob2re(pat):
"""Translate a shell PATTERN to a regular expression.
There is no way to quote meta-characters.
"""
i, n = 0, len(pat)
res = ''
while i < n:
c = pat[i]
i = i+1
if c == '*':
#res = res + '.*'
res = res + '[^/]*'
elif c == '?':
#res = res + '.'
res = res + '[^/]'
elif c == '[':
j = i
if j < n and pat[j] == '!':
j = j+1
if j < n and pat[j] == ']':
j = j+1
while j < n and pat[j] != ']':
j = j+1
if j >= n:
res = res + '\\['
else:
stuff = pat[i:j].replace('\\','\\\\')
i = j+1
if stuff[0] == '!':
stuff = '^' + stuff[1:]
elif stuff[0] == '^':
stuff = '\\' + stuff
res = '%s[%s]' % (res, stuff)
else:
res = res + re.escape(c)
return res + '\Z(?ms)'
fnmatch.filter
类似,re.match
和 re.search
都可以使用。def glob_filter(names,pat):
return (name for name in names if re.match(glob2re(pat),name))
此页面上的全局模式和字符串均通过测试。
pat_dict = {
'a/b/*/f.txt': ['a/b/c/f.txt', 'a/b/q/f.txt', 'a/b/c/d/f.txt','a/b/c/d/e/f.txt'],
'/foo/bar/*': ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar'],
'/*/bar/b*': ['/foo/bar/baz', '/foo/bar/bar'],
'/*/[be]*/b*': ['/foo/bar/baz', '/foo/bar/bar'],
'/foo*/bar': ['/foolicious/spamfantastic/bar', '/foolicious/bar']
}
for pat in pat_dict:
print('pattern :\t{}\nstrings :\t{}'.format(pat,pat_dict[pat]))
print('matched :\t{}\n'.format(list(glob_filter(pat_dict[pat],pat))))
os.sep
或 os.altsep
,但调整应该很容易。 - Martijn Pieters虽然fnmatch.fnmatch
可以直接用于检查模式是否与文件名匹配,但您也可以使用fnmatch.translate
方法从给定的fnmatch
模式生成正则表达式:
>>> import fnmatch
>>> fnmatch.translate('*.txt')
'.*\\.txt\\Z(?ms)'
来自文档:
fnmatch.translate(pattern)
将 shell 风格的模式转换为正则表达式。
我的解决方案类似于Nizam's,但有一些变化:
**
通配符[^abc]
这样的模式匹配到/
3.8.13
中的fnmatch.translate()
作为基础警告:
与glob.glob()
存在一些轻微差异,这个解决方案(以及大多数其他解决方案)也会受到影响。如果您知道如何解决,请在评论中建议更改:
*
和?
不应该匹配以.
开头的文件名**
也应该匹配0个文件夹,当像/**/
这样使用时代码:
import re
def glob_to_re(pat: str) -> str:
"""Translate a shell PATTERN to a regular expression.
Derived from `fnmatch.translate()` of Python version 3.8.13
SOURCE: https://github.com/python/cpython/blob/v3.8.13/Lib/fnmatch.py#L74-L128
"""
i, n = 0, len(pat)
res = ''
while i < n:
c = pat[i]
i = i+1
if c == '*':
# -------- CHANGE START --------
# prevent '*' matching directory boundaries, but allow '**' to match them
j = i
if j < n and pat[j] == '*':
res = res + '.*'
i = j+1
else:
res = res + '[^/]*'
# -------- CHANGE END ----------
elif c == '?':
# -------- CHANGE START --------
# prevent '?' matching directory boundaries
res = res + '[^/]'
# -------- CHANGE END ----------
elif c == '[':
j = i
if j < n and pat[j] == '!':
j = j+1
if j < n and pat[j] == ']':
j = j+1
while j < n and pat[j] != ']':
j = j+1
if j >= n:
res = res + '\\['
else:
stuff = pat[i:j]
if '--' not in stuff:
stuff = stuff.replace('\\', r'\\')
else:
chunks = []
k = i+2 if pat[i] == '!' else i+1
while True:
k = pat.find('-', k, j)
if k < 0:
break
chunks.append(pat[i:k])
i = k+1
k = k+3
chunks.append(pat[i:j])
# Escape backslashes and hyphens for set difference (--).
# Hyphens that create ranges shouldn't be escaped.
stuff = '-'.join(s.replace('\\', r'\\').replace('-', r'\-')
for s in chunks)
# Escape set operations (&&, ~~ and ||).
stuff = re.sub(r'([&~|])', r'\\\1', stuff)
i = j+1
if stuff[0] == '!':
# -------- CHANGE START --------
# ensure sequence negations don't match directory boundaries
stuff = '^/' + stuff[1:]
# -------- CHANGE END ----------
elif stuff[0] in ('^', '['):
stuff = '\\' + stuff
res = '%s[%s]' % (res, stuff)
else:
res = res + re.escape(c)
return r'(?s:%s)\Z' % res
测试用例:
以下是一些测试用例,比较内置的fnmatch.translate()
和上面的glob_to_re()
。
import fnmatch
test_cases = [
# path, pattern, old_should_match, new_should_match
("/path/to/foo", "*", True, False),
("/path/to/foo", "**", True, True),
("/path/to/foo", "/path/*", True, False),
("/path/to/foo", "/path/**", True, True),
("/path/to/foo", "/path/to/*", True, True),
("/path/to", "/path?to", True, False),
("/path/to", "/path[!abc]to", True, False),
]
for path, pattern, old_should_match, new_should_match in test_cases:
old_re = re.compile(fnmatch.translate(pattern))
old_match = bool(old_re.match(path))
if old_match is not old_should_match:
raise AssertionError(
f"regex from `fnmatch.translate()` should match path "
f"'{path}' when given pattern: {pattern}"
)
new_re = re.compile(glob_to_re(pattern))
new_match = bool(new_re.match(path))
if new_match is not new_should_match:
raise AssertionError(
f"regex from `glob_to_re()` should match path "
f"'{path}' when given pattern: {pattern}"
)
例子:
这是一个使用glob_to_re()
函数的示例,它接受一个字符串列表作为参数。
glob_pattern = "/path/to/*.txt"
glob_re = re.compile(glob_to_re(glob_pattern))
input_paths = [
"/path/to/file_1.txt",
"/path/to/file_2.txt",
"/path/to/folder/file_3.txt",
"/path/to/folder/file_4.txt",
]
filtered_paths = [path for path in input_paths if glob_re.match(path)]
# filtered_paths = ["/path/to/file_1.txt", "/path/to/file_2.txt"]
"/a/b/c/**/test.py"
无法匹配此输入:"/a/b/c/test.py"
。我仍在测试中,但将 res = res + '.*'
更改为 res = res + '.*/?'
并将下面一行更改为 i = j + 2
似乎可以解决问题 - 不确定它有多健壮,但对我的目的来说有效。 :) - Jeppe*
-- 匹配零个或多个字符。**
(实际上是** /
或/**
)-- 匹配零个或多个子目录。?
-- 匹配一个字符。[]
-- 匹配括号内的一个字符。[!]
-- 匹配不在括号内的一个字符。\
进行转义,因此只能使用/
作为路径分隔符。import re
from sys import hexversion, implementation
# Support for insertion-preserving/ordered dicts became language feature in Python 3.7, but works in CPython since 3.6.
if hexversion >= 0x03070000 or (implementation.name == 'cpython' and hexversion >= 0x03060000):
ordered_dict = dict
else:
from collections import OrderedDict as ordered_dict
escaped_glob_tokens_to_re = ordered_dict((
# Order of ``**/`` and ``/**`` in RE tokenization pattern doesn't matter because ``**/`` will be caught first no matter what, making ``/**`` the only option later on.
# W/o leading or trailing ``/`` two consecutive asterisks will be treated as literals.
('/\*\*', '(?:/.+?)*'), # Edge-case #1. Catches recursive globs in the middle of path. Requires edge case #2 handled after this case.
('\*\*/', '(?:^.+?/)*'), # Edge-case #2. Catches recursive globs at the start of path. Requires edge case #1 handled before this case. ``^`` is used to ensure proper location for ``**/``.
('\*', '[^/]*'), # ``[^/]*`` is used to ensure that ``*`` won't match subdirs, as with naive ``.*?`` solution.
('\?', '.'),
('\[\*\]', '\*'), # Escaped special glob character.
('\[\?\]', '\?'), # Escaped special glob character.
('\[!', '[^'), # Requires ordered dict, so that ``\[!`` preceded ``\[`` in RE pattern. Needed mostly to differentiate between ``!`` used within character class ``[]`` and outside of it, to avoid faulty conversion.
('\[', '['),
('\]', ']'),
))
escaped_glob_replacement = re.compile('(%s)' % '|'.join(escaped_glob_tokens_to_re).replace('\\', '\\\\\\'))
def glob_to_re(pattern):
return escaped_glob_replacement.sub(lambda match: escaped_glob_tokens_to_re[match.group(0)], re.escape(pattern))
if __name__ == '__main__':
validity_paths_globs = (
(True, 'foo.py', 'foo.py'),
(True, 'foo.py', 'fo[o].py'),
(True, 'fob.py', 'fo[!o].py'),
(True, '*foo.py', '[*]foo.py'),
(True, 'foo.py', '**/foo.py'),
(True, 'baz/duck/bar/bam/quack/foo.py', '**/bar/**/foo.py'),
(True, 'bar/foo.py', '**/foo.py'),
(True, 'bar/baz/foo.py', 'bar/**'),
(False, 'bar/baz/foo.py', 'bar/*'),
(False, 'bar/baz/foo.py', 'bar**/foo.py'),
(True, 'bar/baz/foo.py', 'bar/**/foo.py'),
(True, 'bar/baz/wut/foo.py', 'bar/**/foo.py'),
)
results = []
for seg in validity_paths_globs:
valid, path, glob_pat = seg
print('valid:', valid)
print('path:', path)
print('glob pattern:', glob_pat)
re_pat = glob_to_re(glob_pat)
print('RE pattern:', re_pat)
match = re.fullmatch(re_pat, path)
print('match:', match)
result = bool(match) == valid
results.append(result)
print('result was expected:', result)
print('-'*79)
print('all results were expected:', all(results))
print('='*79)
def glob(path, pattern)
中会更好。 - undefined这是对@Veedrac PurePath.match
答案的扩展,可应用于字符串列表:
# Python 3.4+
from pathlib import Path
path_list = ["foo/bar.txt", "spam/bar.txt", "foo/eggs.txt"]
# convert string to pathlib.PosixPath / .WindowsPath, then apply PurePath.match to list
print([p for p in path_list if Path(p).match("ba*")]) # "*ba*" also works
# output: ['foo/bar.txt', 'spam/bar.txt']
print([p for p in path_list if Path(p).match("*o/ba*")])
# output: ['foo/bar.txt']
最好使用pathlib.Path()
而不是pathlib.PurePath()
,因为这样您就不必担心底层文件系统。
没关系,我找到了。我想要fnmatch模块。
fnmatch
无法处理的情况的例子吗? - Bhargav Raorex = glob_to_re(glob_pattern)
rex = r'(?s:%s)\Z' % rex # Can match newline; match whole string.
rex = re.compile(rex)
matches = [name for name in names if rex.match(name)]
import re as _re
class GlobSyntaxError(SyntaxError):
pass
def glob_to_re(pattern):
r"""
Given pattern, a unicode string, return the equivalent regular expression.
Any special character * ? [ ! - ] \ can be escaped by preceding it with
backslash ('\') in the pattern. Forward-slashes ('/') and escaped
backslashes ('\\') are treated as ordinary characters, not boundaries.
Here is the language glob_to_re understands.
Earlier alternatives within rules have precedence.
pattern = item*
item = '*' | '?' | '[!' set ']' | '[' set ']' | literal
set = element element*
element = literal '-' literal | literal
literal = '\' char | char other than \ [ ] and sometimes -
glob_to_re does not understand "{a,b...}".
"""
# (Note: the docstring above is r""" ... """ to preserve backslashes.)
def expect_char(i, context):
if i >= len(pattern):
s = "Unfinished %s: %r, position %d." % (context, pattern, i)
raise GlobSyntaxError(s)
def literal_to_re(i, context="pattern", bad="[]"):
if pattern[i] == '\\':
i += 1
expect_char(i, "backslashed literal")
else:
if pattern[i] in bad:
s = "Unexpected %r in %s: %r, position %d." \
% (pattern[i], context, pattern, i)
raise GlobSyntaxError(s)
return _re.escape(pattern[i]), i + 1
def set_to_re(i):
assert pattern[i] == '['
set_re = "["
i += 1
try:
if pattern[i] == '!':
set_re += '^'
i += 1
while True:
lit_re, i = literal_to_re(i, "character set", bad="[-]")
set_re += lit_re
if pattern[i] == '-':
set_re += '-'
i += 1
expect_char(i, "character set range")
lit_re, i = literal_to_re(i, "character set range", bad="[-]")
set_re += lit_re
if pattern[i] == ']':
return set_re + ']', i + 1
except IndexError:
expect_char(i, "character set") # Trigger "unfinished" error.
i = 0
re_pat = ""
while i < len(pattern):
if pattern[i] == '*':
re_pat += ".*"
i += 1
elif pattern[i] == '?':
re_pat += "."
i += 1
elif pattern[i] == '[':
set_re, i = set_to_re(i)
re_pat += set_re
else:
lit_re, i = literal_to_re(i)
re_pat += lit_re
return re_pat
我想添加对递归通配符模式的支持,例如things/**/*.py
,并且具有相对路径匹配,因此example*.py
不会与folder/example_stuff.py
匹配。
这是我的方法:
from os import path
import re
def recursive_glob_filter(files, glob):
# Convert to regex and add start of line match
pattern_re = '^' + fnmatch_translate(glob)
# fnmatch does not escape path separators so escape them
if path.sep in pattern_re and not r'\{}'.format(path.sep) in pattern_re:
pattern_re = pattern_re.replace('/', r'\/')
# Replace `*` with one that ignores path separators
sep_respecting_wildcard = '[^\{}]*'.format(path.sep)
pattern_re = pattern_re.replace('.*', sep_respecting_wildcard)
# And now for `**` we have `[^\/]*[^\/]*`, so replace that with `.*`
# to match all patterns in-between
pattern_re = pattern_re.replace(2 * sep_respecting_wildcard, '.*')
compiled_re = re.compile(pattern_re)
return filter(compiled_re.search, files)
[Parsed_volumedetect_0 @ 0x7fbf12004080] max_volume: -9.3 dB
中提取出一个简单的字符串如max_volume
。我正在尝试从ffmpeg输出中提取{max,mean}_volume
。 - vault