如何使用正则表达式匹配Markdown代码块？

Question

如何使用正则表达式匹配Markdown代码块？

8

我正在尝试使用PCRE RegEx从Markdown文档中提取代码块。对于未接触过的人，Markdown中的代码块定义如下：

要在Markdown中生成代码块，只需将块中的每一行至少缩进4个空格或1个制表符。代码块继续到达一个未缩进的行（或文章的结尾）为止。

因此，给定这个文本：

This is a code block:

    I need capturing along with
    this line

This is a code fence below (to be ignored):

``` json
This must have three backticks
flanking it
```

I love `inline code` too but don't capture

and one more short code block:

    Capture me

到目前为止，我有这个正则表达式：

(?:[ ]{4,}|\t{1,})(.+)

但它只是捕获每一行前缀至少有四个空格或一个制表符。它不会捕获整个块。

我需要帮助的是如何设置条件，以捕获在4个空格或1个制表符之后的所有内容，直到您要么到达未缩进的行，要么到达文本的末尾。

这是一个正在进行中的在线工作：

https://www.regex101.com/r/yMQCIG/5

- Garry Pettet

你在正则表达式上设置了哪些选项？如果你想将文本作为块而不是逐行分析，那么可以尝试使用/regex/m，其中的m表示打开“多行”选项。 - halfer

我已经尝试在regex101.com上切换m开关，但它并没有帮助我当前拥有的正则表达式。更新问题以包含我拥有的在线正则表达式链接。 - Garry Pettet

在regex101.com上启用多行开关（'s'）实际上会导致我的问题中的正则表达式匹配所有示例文本，这也是不正确的... - Garry Pettet

1

“Capture me” 前面有三个空格缩进，请参见 https://www.regex101.com/r/yMQCIG/3，该网址中有四个空格。 - Wiktor Stribiżew

2

在 Stack Overflow 上有一个带有先前尝试的正则表达式问题简直是世界第八大奇迹！干得好。 - halfer

显示剩余3条评论

3个回答

2

有三种突出显示代码的方法：1）使用行首缩进；2）使用3个或更多反引号包围多行代码块；或3）内联代码。
1和3是John Gruber原始Markdown规范的一部分。
以下是实现此目的的方法。您需要执行3个单独的正则表达式测试：

Using indentation

 (?:\n{2,}|\A)                   # Starting at beginning of string or with 2 new lines
 (?<code_all>
     (?:
         (?<code_prefix>         # Lines must start with a tab or a tab-width of spaces
             [ ]{4}
             |
             \t
         )
         (?<code_content>.*\n+)  # with some content, possibly nothing followed by a new line
     )+
 )
 (?<code_after>
     (?=^[ ]{0,4}\S)             # Lookahead for non-space at line-start
     |
     \Z                          # or end of doc
 )

2a) 使用带有反引号的代码块（原始Markdown）

    (?:\n+|\A)?                         # Necessarily at the begining of a new line or start of string
    (?<code_all>
        (?<code_start>
            [ ]{0,3}                    # Possibly up to 3 leading spaces
            \`{3,}                      # 3 code marks (backticks) or more
        )
        \n+
        (?<code_content>.*?)            # enclosed content
        \n+
        (?<!`)
        \g{code_start}                  # balanced closing block marks
        (?!`)
        [ \t]*                          # possibly followed by some space
        \n
    )
    (?<code_trailing_new_line>\n|\Z)    # and a new line or end of string

2b) 使用带有类别指定符的反引号代码块（扩展的Markdown）

    (?:\n+|\A)?                 # Necessarily at the beginning of a new line
    (?<code_all>
        (?<code_start>
            [ ]{0,3}            # Possibly up to 3 leading spaces
            \`{3,}              # 3 code marks (backticks) or more
        )
        [ \t]*                  # Possibly some spaces or tab
        (?:
            (?:
                (?<code_class>[\w\-\.]+)    # or a code class like html, ruby, perl
                (?:
                    [ \t]*
                    \{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
                )?                          # Possibly followed by class and id definition in curly braces
            )
            |
            (?:
                [ \t]*
                \{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
            )                           # Followed by class and id definition in curly braces
        )
        \n+
        (?<code_content>.*?)    # enclosed content
        \n+
        (?<!`)
        \g{code_start}          # balanced closing block marks
        (?!`)
    )
    (?:\n|\Z)                # and a new line or end of string

Using 1 or more backticks for inline code

 (?<!\\)                     # Ensuring this is not escaped
 (?<code_all>
     (?<code_start>\`{1,})   # One or more backtick(s)
     (?<code_content>.+?)    # Code content inbetween back sticks
     (?<!`)                  # Not preceded by a backtick
     \g{code_start}          # Balanced closing backtick(s)
     (?!`)                   # And not followed by a backtick
 )

- Jacques

例3的模式有误 - 它与模式1相同。是复制粘贴错误吗？ - Senipah

是的，复制/粘贴错误。应该是: (?<!\) #确保它没有被转义 (?<code_all> (?<code_start>`{1,}) #一个或多个反引号 (?<code_content>.+?) #反引号之间的代码内容 (?<!

)                   #不是由反引号前导的         \g{code_start}           ＃平衡的结束反引号         (?!

) #而且不是后面跟着反引号 )请参阅此处https://regex101.com/r/C2Vl9M/1 - Jacques

0

试试这个？

[a-z]*\n[\s\S]*?\n

它将从你的示例中提取。

This must have three backticks
flanking it

- tzatalin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- trincot · Accepted Answer

你应该使用字符串的开始/结束标记符号（^ 和 $ 与 m 修饰符结合使用）。此外，你的测试文本在最后一个块中只有3个前导空格：

^((?:(?:[ ]{4}|\t).*(\R|$))+)

通过使用\R和重复匹配，您可以每个匹配与整个块匹配，而不是每行匹配。

请在regex101上查看演示

免责声明： Markdown的规则比所示的示例文本更加复杂。例如，当（嵌套）列表中有代码块时，这些代码块需要以8、12或更多个空格为前缀。正则表达式无法识别此类代码块，或其他嵌入了Markdown符号的代码块，这些符号使用了更广泛的格式组合。