使用Python从MMD元数据中提取#hashtags的正则表达式。

Question

使用Python从MMD元数据中提取#hashtags的正则表达式。

3

我正在尝试从多重标记纯文本文件的“标签：#tag1 #tag2”行中提取所有#hashtags。（我在Python多行模式下。）

我已经尝试使用前瞻：

^(?=Tags:\s.*)#(\w+)\b

以及向后查找：

#(\w+)\b(?<=Tags:^\s)

普通的香草 #(\w+)\b 可以使用，但它会捕捉到文档中后面出现的任何 #hashtag。

欢迎提供任何提示、帮助和说明。

- other_other

尝试使用dotall来强制点匹配换行符。 - paulie.jvenuez

3

你可以思考一下：提取所有包含“Tags:”后跟至少一个标签的行，然后从所有提取出的行中提取标签。否则，我知道Python 3.X 中的 regex 模块支持搜索锚点 \G。如果你在使用这个模块，那么你可以在你的脚本中使用这个正则表达式。 - Jerry

2个回答

1

首先在输入文本中获取哈希位置索引，然后使用 re.findall 获取重复捕获。以下示例将打印 ['＃tag1'，'＃tag2']

text = "Tags: #tag1 #tag2"

matched = re.search(r'^Tags([^#]+)', text)
if matched:
    tag_text = text[matched.end():]
    hash_tags = re.findall(r'(#(?:[^#\s]+(?:\s*?)))', tag_text)
    print hash_tags

- P̲̳x͓L̳

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- blaze · Accepted Answer

text = "\n\n#bogus\nTags: #foo #bar\n"

首先，您需要获取该行：

First, you need to get the line:

line = re.findall(r'Tags:.+\n', text)
# line = ['Tags: #foo #bar\n']

最后，你需要从这行获取标签：

tags = re.findall(r'#(\w+)', line[0])
# tags = ['foo', 'bar']
tags = re.findall(r'#\w+', line[0])
# tags = ['#foo', '#bar']

由于您需要提供一个没有固定宽度的模式，所以回顾后视将不起作用。