正则表达式 - 将HTML转换为有效的XML标签

Question

正则表达式 - 将HTML转换为有效的XML标签

10

我需要帮助编写一个正则表达式函数，将HTML字符串转换为有效的XML标签名称。例如：它接受一个字符串并执行以下操作：

如果字符串中出现字母或下划线，则保留它。
如果出现任何其他字符，则将其从输出字符串中删除。
如果在单词或字母之间出现任何其他字符，则用下划线替换它。

Ex:
Input: Date Created
Ouput: Date_Created

Input: Date<br/>Created
Output: Date_Created

Input: Date\nCreated
Output: Date_Created

Input: Date    1 2 3 Created
Output: Date_Created

基本上，正则表达式函数应将HTML字符串转换为有效的XML标记。

- Jake

3

您的问题中说“我想写”，但它读起来像一个要求清单，等待有人放置所需的魔术正则表达式代码。不清楚您认为什么是XML标签，输出示例中也没有包含任何标签。 - mario

@JackManey：那现在已经有4000多个赞了吗？天哪。 - mpen

1

如果情况只是偶尔出现，并且只需要在短时间内对测试代码进行“快速而肮脏的修补”，那么使用正则表达式而不是DOM有什么问题呢？ - Cylian

4个回答

2

试一下这个

$result = preg_replace('/([\d\s]|<[^<>]+>)/', '_', $subject);

解释

"
(               # Match the regular expression below and capture its match into backreference number 1
                   # Match either the regular expression below (attempting the next alternative only if this one fails)
      [\d\s]          # Match a single character present in the list below
                         # A single digit 0..9
                         # A whitespace character (spaces, tabs, and line breaks)
   |               # Or match regular expression number 2 below (the entire group fails if this one fails to match)
      <               # Match the character “<” literally
      [^<>]           # Match a single character NOT present in the list “<>”
         +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      >               # Match the character “>” literally
)
"

- Cylian

2

应该能够使用：

$text = preg_replace( '/(?<=[a-zA-Z])[^a-zA-Z_]+(?=[a-zA-Z])/', '_', $text );

所以，使用回顾先行断言来查看前后是否有字母，并替换它们之间的任何非字母/非下划线字符。

- adomnom

1

我相信以下内容应该可以正常工作。

preg_replace('/[^A-Za-z_]+(.*)?([^A-Za-z_]+)?/', '_', $string);

正则表达式的第一部分[^A-Za-z_]+匹配一个或多个非字母或下划线的字符。正则表达式的结尾部分也是如此，只不过它是可选的。这是为了允许中间部分(.*)?（也是可选的）捕获两个黑名单字符之间的任何字符（甚至包括字母和下划线）。

- Litty

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ja͢ck · Accepted Answer

一点正则表达式和一些标准函数：

function mystrip($s)
{
        // add spaces around angle brackets to separate tag-like parts
        // e.g. "<br />" becomes " <br /> "
        // then let strip_tags take care of removing html tags
        $s = strip_tags(str_replace(array('<', '>'), array(' <', '> '), $s));

        // any sequence of characters that are not alphabet or underscore
        // gets replaced by a single underscore
        return preg_replace('/[^a-z_]+/i', '_', $s);
}