重复捕获组将仅捕获最后一次迭代。这就是为什么您需要重新构造您的正则表达式以与re.findall
一起使用。
\s*
(?:
(?:^from\s+
( # Base (from (base) import ...)
(?:[a-zA-Z_][a-zA-Z_0-9]* # Variable name
(?:\.[a-zA-Z_][a-zA-Z_0-9]*)* # Attribute (.attr)
)
)\s+import\s+
)
|
(?:^import\s|,)\s*
)
( # Name of imported module (import (this))
(?:[a-zA-Z_][a-zA-Z_0-9]* # Variable name
(?:\.[a-zA-Z_][a-zA-Z_0-9]*)* # Attribute (.attr)
)
)
(?:
\s+as\s+
( # Variable module is imported into (import foo as bar)
(?:[a-zA-Z_][a-zA-Z_0-9]* # Variable name
(?:\.[a-zA-Z_][a-zA-Z_0-9]*)* # Attribute (.attr)
)
)
)?
\s*
(?=,|$) # Ensure there is another thing being imported or it is the end of string
在regex101.com上试一试
捕获组0将是Base
,捕获组1将是(你想要的)导入模块的名称,捕获组2将是模块所在的变量(from (group 0) import (group 1) as (group 2)
)
import re
regex = r"\s*(?:(?:^from\s+((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*))\s+import\s+)|(?:^import\s|,)\s*)((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*))(?:\s+as\s+((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*)))?\s*(?=,|$)"
print(re.findall(regex, "import pandas, os, sys"))
[('', 'pandas', ''), ('', 'os', ''), ('', 'sys', '')]
如果您不需要它们,可以删除另外两个捕获组。
*
来捕获组几乎永远不会产生您要查找的结果。这通常不是正则表达式的用途。相反,理性的做法是获取整个导入包集,然后通过,\s*(?=\w)
或类似方式拆分字符串。 - Andris Leduskrasts