我安装了fckeditor,但是从MS Word中粘贴文本时会添加很多不必要的格式。我想保留某些内容,如加粗、斜体、项目符号等。我在网上搜索并找到了一些解决方案,但它们会将所有内容都去掉,甚至包括我想保留的加粗和斜体等内容。有没有办法只去除不必要的Word格式?
我安装了fckeditor,但是从MS Word中粘贴文本时会添加很多不必要的格式。我想保留某些内容,如加粗、斜体、项目符号等。我在网上搜索并找到了一些解决方案,但它们会将所有内容都去掉,甚至包括我想保留的加粗和斜体等内容。有没有办法只去除不必要的Word格式?
如果有人想要C#版本的被接受答案,请参考以下代码:
public string CleanHtml(string html)
{
//Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
// Only returns acceptable HTML, and converts line breaks to <br />
// Acceptable HTML includes HTML-encoded entities.
html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
// Does this have HTML tags?
if (html.IndexOf("<") >= 0)
{
// Make all tags lowercase
html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
return m.ToString().ToLower();
});
// Filter out anything except allowed tags
// Problem: this strips attributes, including href from a
// https://dev59.com/o3VC5IYBdhLWcg3wZwPj
string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
// Make all BR/br tags look the same, and trim them of whitespace before/after
html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
}
// No CRs
html = html.Replace("\r", "");
// Convert remaining LFs to line breaks
html = html.Replace("\n", "<br />");
// Trim BRs at the end of any string, and spaces on either side
return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
}
以下是我用来清理富文本编辑器中传入的HTML的解决方案...它是用VB.NET编写的,我没有时间转换为C#,但它非常简单明了:
Public Shared Function CleanHtml(ByVal html As String) As String
'' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
'' Only returns acceptable HTML, and converts line breaks to <br />
'' Acceptable HTML includes HTML-encoded entities.
html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
'' Does this have HTML tags?
If html.IndexOf("<") >= 0 Then
'' Make all tags lowercase
html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
'' Filter out anything except allowed tags
'' Problem: this strips attributes, including href from a
'' https://dev59.com/o3VC5IYBdhLWcg3wZwPj
Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
'' Make all BR/br tags look the same, and trim them of whitespace before/after
html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
End If
'' No CRs
html = html.Replace(controlChars.CR, "")
'' Convert remaining LFs to line breaks
html = html.Replace(controlChars.LF, "<br />")
'' Trim BRs at the end of any string, and spaces on either side
Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
End Function
Public Shared Function LowerTag(m As Match) As String
Return m.ToString().ToLower()
End Function
static string CleanWordHtml(string html) {
StringCollection sc = new StringCollection(); // get rid of unnecessary tag spans (comments and title) sc.Add(@"<!--(\w|\W)+?-->"); sc.Add(@"<title>(\w|\W)+?</title>"); // Get rid of classes and styles sc.Add(@"\s?class=\w+"); sc.Add(@"\s+style='[^']+'"); // Get rid of unnecessary tags sc.Add( @"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>"); // Get rid of empty paragraph tags sc.Add(@"(<[^>]+>)+ (</\w+>)+"); // remove bizarre v: element attached to <img> tag sc.Add(@"\s+v:\w+=""[^""]+"""); // remove extra lines sc.Add(@"(\n\r){2,}"); foreach (string s in sc) { html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase); } return html; }
但是,正如名称和网站所示,fckeditor是一个文本编辑器。对我来说,这意味着它只会显示文件中的字符。
如果没有一些额外的字符,您无法使用粗体和斜体格式。
编辑:啊,我明白了。仔细查看Fckeditor网站后,它是一个HTML编辑器,而不是我习惯的简单文本编辑器之一。
其中一个功能是自动检测Word清理剪贴板
。