正则表达式提取多行哈希注释

Question

正则表达式提取多行哈希注释

3

我目前正在遭受写作障碍，试图想出一个优雅的解决方案来解决这个问题。

以以下示例为例：

{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}

从上面的内容中，我想把代码注释作为一个组一起提取，而不是单独提取。如果一行代码后面紧跟着一行注释，则会发生这种分组。注释始终以空格和#开头。

示例结果：

Capture group 1: Some information about field 1\n on multiple lines
Capture group 2: Some more info on a single line

如果可能的话，我希望能使用正则表达式来解决这个问题，而不是在代码中跨越行并进行评估。如果您认为正则表达式不是解决此问题的正确方法，请解释原因。

总结：

感谢大家提交各种解决方案，这是SO社区非常有帮助的一个典型例子。我将自己的一个小时时间用于回答其他问题，以弥补集体花费在这个问题上的时间。

希望这个线程也能帮助其他人。

- SleepyCal

可以做到，但正则表达式本身取决于您想要收集数据的方式。你熟练使用正则表达式吗？无论如何，你是在问正则表达式是否是一个好方法来做这件事，还是如何做到这一点？ - SebasSBM

虽然我能够构建和理解一些正则表达式，但我不会说这是我的强项之一，例如我不明白否定匹配如何工作。我想知道正则表达式是否是一个好的解决方案，如果是，那么正则表达式应该是什么。 - SleepyCal

也许您想要将正则表达式匹配分段，并将每个片段存储在自己的变量中。这可以在几行代码中完成。您需要一个示例吗？ - SebasSBM

当然，任何指向正确方向的指引都将不胜感激。 - SleepyCal

4个回答

1

假设你想要从多行字符串中使用单个正则表达式（例如，hashtags）提取特定数据，可以这样做：

#!/usr/bin/env python
# coding: utf-8

import re

# the regexp isn't 100% accurate, but you'll get the point
# groups followed by '?' match if repeated 0 or 1 times.
regexp = re.compile('^.*(#[a-z]*).*(#[a-z]*)?$')

multiline_string = '''
                     The awesomeness of #MotoGP is legendary. #Bikes rock!
                     Awesome racing car #HeroComesHome epic
'''

iterable_list = multiline_string.splitlines()

for line in iterable_list:
    '''
    Keep in mind:   if group index is out of range,
                    execution will crash with an error.
                    You can prevent it with try/except blocks
    '''
    fragments = regexp.match(line)
    frag_in_str = fragments.group(1)

    # Example to prevent a potential IndexError:
    try:
        some_other_subpattern = fragments.group(2)
    except IndexError:
        some_other_subpattern = ''

    entire_match = fragments.group(0)

每个括号内的组可以通过这种方式提取。

一个很好的反转模式的例子已经在这里发布了：如何在正则表达式中否定特定单词？

- SebasSBM

1

感谢您尝试回答这个问题，为您的努力投票支持。虽然似乎@kasra的答案正好符合我的要求。 - SleepyCal

我之前匆忙解析了正则表达式，所以它是错误的。我已经纠正了它。 - SebasSBM

1

你可以使用一个双端队列来保持两行，并添加一些逻辑将评论分成块：

src='''\
{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",


    # multiple line comments
    # supported
    # as well 
    "field3": "#this would be ignored"

  }
}
'''

from collections import deque
d=deque([], 2)
blocks=[]
for line in src.splitlines():
    d.append(line.strip())
    if d[-1].startswith('#'):        
        comment=line.partition('#')[2]
        if d[0].startswith('#'):
            block.append(comment)
        else:
            block=[comment]
    elif d[0].startswith('#'):
        blocks.append(block)

for i, b in enumerate(blocks):
    print 'block {}: \n{}'.format(i, '\n'.join(b))

输出：

block 0: 
 Some information about field 1
 on multiple lines
block 1: 
 Some more info on a single line
block 2: 
 multiple line comments
 supported
 as well

- dawg

谢谢您花时间回答，我已经接受了上面的另一个答案，但这也是一个很好的解决方案。点赞。 - SleepyCal

1

使用正则表达式无法完全实现，但您可以通过一行代码完成。

import re

str = """{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX"
    # Some information about field 1
    # on multiple lines
    # Some information about field 1
    # on multiple lines
    "field3": "#this would be ignored"
  }
}"""

rex = re.compile("(^(?!\s*#.*?[\r\n]+)(.*?)([\r\n]+|$)|[\r\n]*^\s*#\s*)+", re.MULTILINE)    
print rex.sub("\n", str).strip().split('\n\n')

输出：

['Some information about field 1\non multiple lines', 'Some more info on a single line', 'Some information about field 1\non multiple lines\nSome information about field 1\non multiple lines']

- SanD

1

不错的解决方案，我已经接受了另一个答案，但是我很感谢你花时间回答并点赞。 - SleepyCal

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mazdak · Accepted Answer

您可以使用以下正则表达式与re.findall一起使用：

>>> m= re.findall(r'\s*#(.*)\s*#(.*)|#(.*)[^#]*',s,re.MULTILINE)
[(' Some information about field 1', ' on multiple lines', ''), ('', '', ' Some more info on a single line')]

对于打印，您可以执行以下操作：

>>> for i,j in enumerate(m):
...   print ('group {}:{}'.format(i," & ".join([i for i in j if i])))
... 
group 0: Some information about field 1 &  on multiple lines
group 1: Some more info on a single line

但是，如果你需要注释行数超过 2 行的更一般的方法，可以使用 itertools.groupby :

s="""{
  "data": {
    # Some information about field 1
    # on multiple lines
    # threeeeeeeeecomment
    "field1": "XXXXXXXXXX"

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}"""
from itertools import groupby

comments =[[i for i in j if i.strip().startswith('#')] for _,j in groupby(s.split('\n'),lambda x: x.strip().startswith('#'))]

for i,j in enumerate([m for m in comments if m],1):
        l=[t.strip(' #') for t in j]
        print 'group {} :{}'.format(i,' & '.join(l))

结果：

group 1 :Some information about field 1 & on multiple lines & threeeeeeeeecomment
group 2 :Some more info on a single line