正则表达式函数用于将段落拆分为Power Query中的句子

Question

正则表达式函数用于将段落拆分为Power Query中的句子

3

我正在尝试使用Power Query中的正则表达式将示例段落拆分为句子：

先生和夫人史密斯以150万美元购买了cheapsite.com，即他花了很多钱。他介意吗？亚当·琼斯博士认为他没有。无论如何，这不是真的...好吧，有90％的概率不是。然而，这行不行。Qr。测试网站.COM和Labs.ORG看起来不错。创意不工作。t和完成。9开始

拆分为：

先生和夫人史密斯以150万美元购买了cheapsite.com，即他花了很多钱。

他介意吗？亚当·琼斯博士认为他没有。

无论如何，这不是真的...

好吧，有90％的概率不是。

然而，这行不行。

Qr。

测试网站。

COM和Labs。

ORG看起来不错。

创意不工作。t和完成。

9开始

这里是一个函数，使PQ能够使用正则表达式替换：

FnRegexReplace

// regexReplace
    let   regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
        let
            f=if flags = null or flags ="" then "" else flags,
            l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
            l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
            t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
            r=Web.Page(t)[Data]{0}[Children]{0}[Children],
            Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
        in Output
    in regexReplace

我还有以下正则表达式，是从之前的帖子中提供的，似乎在Regex101上可以使用。

https://regex101.com/r/WEC0M9/6

模式: (?<!Mr|Mrs|Dr|Jr)(\.+)(\s+(?![a-z])|(?=[A-Z]))

替换为: $1\r\n （我认为这可以是任何东西，如*）

标志: gm

我的问题是，当我尝试在Power Query中执行此操作时，没有返回结果：

或者可以在这里找到，但是仍会出现同样的问题。

该问题似乎与正则表达式中的向后查找和向前查找有关，当删除?函数至少返回结果。如果有人能够建议如何最好地使用正则表达式将此段落拆分，就像PQ中所示，那将是很棒的。

M Code: 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", "$1\r\n", "gm"))
in
    #"Invoked Custom Function"

更新1：使用建议的正则表达式的 M 代码：

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "((?:\S+\.(?:net|org|com)\b|\b[mdjs]rs?\.|\d*\.\d+|[a-z]\.(?:[a-z]\.)+|[^?.!])+(?:[.?!]+|$))[?!.\s]*)", "$1\n", "gi"))
in
    #"Invoked Custom Function"

- Nick

遗憾的是，对我来说同样返回了一个空结果。 - Nick

我刚在regex101中测试了JvdV的替换。不知道为什么结果会不同？之前：https://regex101.com/r/WEC0M9/6（V6），而之后则是：https://regex101.com/r/WEC0M9/9（V9）。 - Anonymous

2

Web 环境使用的是 JavaScript 正则表达式引擎，不支持向后查找。您还标记了 Power BI。在 Power BI 中的 Power Query 可以运行 Python 和 R 脚本，两者都支持不仅包括向后查找，还有 re.split 方法。 - Ron Rosenfeld

@JvdV 谢谢您的解释。我只是想学习，因为我不是一个经常使用正则表达式的用户。 - Anonymous

1

@Nick，我正在阅读有关可能更改基于JavaScript的函数的内容。今天我无法解决它。明天可能可以（希望到那时候有别人已经为您准备好了答案）。 - JvdV

显示剩余9条评论

2个回答

0

这个正则表达式适用于大多数文本，但也可以应对可能出现的问题。

\s*((?:\b(?:[djms]rs?|flam|liq|st)\.|\b(?:[a-z]\.){2,}|\.\d|\.(?:com|net|org)\b|[^.?!])+(?:[.?!]+|$)) (Gmi 作为标志)

其中flam|liq|st是示例，如果单词后面跟着一个.，通常会发生拆分。正则表达式的这一部分强制忽略了这些情况。例如，如果您有文本“圣伯纳德通常重80公斤”，这通常会在St上拆分。但是，在正则表达式的这个区域中添加st将忽略这一点，因此整个句子被捕获。您可以继续添加到此部分以尝试适应大多数错误。如果您想进一步改进，请发表评论/答案。

- Nick

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JvdV · Accepted Answer

我认为以下内容会很有帮助：

（注意：本文为机器翻译，可能不够准确，请以原文为准。）

For demonstration purposes I loaded the data directly from Excel. I'm sure you can figure out how to connect your PDF;
Since the JavaScript-based function is a small HTML-script we have to escape the apostrope in the sample text first using a replace function. Otherwise it will clash with the apostrophes used to write the script in the function (see below). If we don't the function will error out/show nothing. Apostrophe will be shown correctly after applying function;
I edited the pattern to catch a full sentence in 1st capture group and for this sample I replaced what is captured with the backreference to this group and a pipe-symbol to visualize the result. Note there is no use of a negative lookbehind nomore since that is not supported in the engine. This resulted in a lengthy pattern which probably does not yet catch all the quirks possible:
```
\s*((?:\b[MDJS]rs?\.|\d*\.\d+|\S+\.(?:com|net|org)\b|[a-z]\.(?:[a-z]\.)+|[^.?!])+(?:[.?!]+|$))
```

M-Code:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Kol", type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type","'","&apos",Replacer.ReplaceText,{"Kol"}),
    #"Invoked Custom Function" = Table.AddColumn(#"Replaced Value", "fnRegexReplace", each fnRegexReplace([Kol], "\\s*((?:\\b[MDJS]rs?\\.|\\d*\\.\\d+|\\S+\\.(?:com|net|org)\\b|[a-z]\\.(?:[a-z]\\.)+|[^.?!])+(?:[.?!]+|$))", "$1|"))
in
    #"Invoked Custom Function"

使用了函数fnRegexReplace：

(x,y,z)=>
let 
   Source = Web.Page(
                     "<script>var x="&"'"&x&"'"&";var z="&"'"&z&
                     "'"&";var y=new RegExp('"&y&"','gmi');
                     var b=x.replace(y,z);document.write(b);</script>")
                     [Data]{0}[Children]{0}[Children]{1}[Text]{0}
in 
   Source

这是一个正则表达式的在线演示。