从字符串中删除HTML标签

Question

从字符串中删除HTML标签

122

我怎样能够从一个字符串中删除HTML标签，以便输出干净的文本？

let str = string.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
print(str)

- arled

只需使用HTML解析器。 - The Paramagnetic Croissant

1

Led，这个问题非常有价值，但是目前的状态很可能会被关闭，因为你没有提出一个清晰的问题：这是一个不可重现的场景。我建议你按照 [ask] 重新表达你的问题。我不希望那个问题被删除。 - Tunaki

3

这个被关闭为“离题”的问题是怎么回事？它是“Swift去除HTML标签”的谷歌搜索结果中排名第一的。 - canhazbits

2

@canhazbits 我也知道！点击“重新打开”以再次提名它重新打开。 - arled

1

Swift 3：string.replacingOccurrences(of: "<[^>]+>", with: " ", options: .regularExpression, range: nil) - etayluz

10个回答

38

由于HTML不是一种正则语言（HTML是一种上下文无关的语言），因此您不能使用正则表达式。请参见：为什么不能使用正则表达式解析HTML？

我建议考虑使用NSAttributedString。

let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. <br />"    
let htmlStringData = htmlString.dataUsingEncoding(NSUTF8StringEncoding)!
let options: [String: AnyObject] = [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding]
let attributedHTMLString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
let string = attributedHTMLString.string

或者，像评论中的Irshad Mohamed一样：

let attributed = try NSAttributedString(data: htmlString.data(using: .unicode)!, options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType], documentAttributes: nil)
print(attributed.string)

- Joony

7

这似乎是最干净的方法，并且效果非常好！最好让经过实战考验的Foundation框架为你处理这个问题，而不是自己编写不可靠的解析器。 - Shyam Bhat

4

清洁！

let attributed = try NSAttributedString(data: htmlString.data(using: .unicode)!, options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType], documentAttributes: nil)                             print(attributed.string)

大多数人更喜欢选择简单易懂的答案。 - Irshad Mohamed

1

谢谢您提供的解决方案！在我们删除HTML标签的同时，是否有可能保留空格和换行符？目前，新字符串中所有换行符都被忽略了。 - A_G

10

只是一个警告，使用这个东西：HTML 样式转换（属性赋值）很慢！。一位在 WWDC 的 CoreText 工程师告诉我，这已经不再维护了，并且他已经完全忘记了它。 - Allison

1

关于之前的警告，我想提醒一下：在我们因为某个方法“太慢”而放弃它之前，让我们先看看一些数据。有很多C库是你经常使用的（通常是不自觉的），它们不需要太多的维护。这并不一定是坏事。 - Joony

显示剩余3条评论

28

在Swift 4中，Mohamed的解决方案是将其作为字符串扩展。

extension String {

    func stripOutHtml() -> String? {
        do {
            guard let data = self.data(using: .unicode) else {
                return nil
            }
            let attributed = try NSAttributedString(data: data, options: [.documentType: NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
            return attributed.string
        } catch {
            return nil
        }
    }
}

- Andrew

12

我正在使用以下扩展来删除特定的HTML元素：

extension String {
    func deleteHTMLTag(tag:String) -> String {
        return self.stringByReplacingOccurrencesOfString("(?i)</?\(tag)\\b[^<]*>", withString: "", options: .RegularExpressionSearch, range: nil)
    }

    func deleteHTMLTags(tags:[String]) -> String {
        var mutableString = self
        for tag in tags {
            mutableString = mutableString.deleteHTMLTag(tag)
        }
        return mutableString
    }
}

这使得从字符串中仅删除<a>标签成为可能，例如：

let string = "my html <a href="">link text</a>"
let withoutHTMLString = string.deleteHTMLTag("a") // Will be "my  html link text"

- Antoine

@Mr Lister，有没有办法删除所有 HTML 标签并保留 <a href="">链接文本</a>？ - Mazen Kasser

6

extension String{
    var htmlStripped : String{
        return self.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)
    }
}

愉快编码

- Benny Davidovitz

5

我更喜欢使用正则表达式而不是使用NSAttributedString HTML转换，需要注意的是这种方法非常耗时且需要在主线程上运行。更多信息请查看此处：https://developer.apple.com/documentation/foundation/nsattributedstring/1524613-initwithdata

对我来说，这个方式很有效，首先我删除任何CSS内联样式，然后再删除所有HTML标签。这种方法可能不如NSAttributedString选项可靠，但对我来说速度要快得多。

extension String {
    func withoutHtmlTags() -> String {
        let str = self.replacingOccurrences(of: "<style>[^>]+</style>", with: "", options: .regularExpression, range: nil)
        return str.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)
    }
}

- Pau Ballada

3

Swift 5：

extension String {
    public func trimHTMLTags() -> String? {
        guard let htmlStringData = self.data(using: String.Encoding.utf8) else {
            return nil
        }
    
        let options: [NSAttributedString.DocumentReadingOptionKey : Any] = [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
        ]
    
        let attributedString = try? NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
        return attributedString?.string
    }
}

用途：

let  str = "my html <a href='https://www.google.com'>link text</a>"

print(str.trimHTMLTags() ?? "--") //"my html link text"

- Viktor

请翻译以下与编程有关的内容，从英文到中文。仅返回翻译文本：来源：https://gist.github.com/hashaam/31f51d4044a03473c18a168f4999f063#gistcomment-3137487 - Viktor

这也会删除换行符，但并不是完美的答案。 - Fadi Abuzant

2

Swift 4：

extension String {
    func deleteHTMLTag(tag:String) -> String {
        return self.replacingOccurrences(of: "(?i)</?\(tag)\\b[^<]*>", with: "", options: .regularExpression, range: nil)
    }

    func deleteHTMLTags(tags:[String]) -> String {
        var mutableString = self
        for tag in tags {
            mutableString = mutableString.deleteHTMLTag(tag: tag)
        }
        return mutableString
    }
}

- Logic

2

func deleteHTMLTag() -> String { return self.replacingOccurrences(of: "(?i)</?\b[^<]*>", with: "", options: .regularExpression, range: nil) } - Anil Kumar

这个正则表达式对我来说无法去除HTML代码。例如字符串：“<b>猫喜欢</b>做某事”。没有深入研究为什么它不起作用。但是text.replacingOccurrences（of：“<[^>]+>”，....）适用于我的简单情况。 - Benjamin Piette

2

更新至Swift 4版本：

最初的回答

guard let htmlStringData = htmlString.data(using: .unicode) else { fatalError() }

let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
                .documentType: NSAttributedString.DocumentType.html
                .characterEncoding: String.Encoding.unicode.rawValue
             ]

let attributedHTMLString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
let string = attributedHTMLString.string

- Lee Irvine

你在 .documentType 后面缺少了一个逗号：param。 - cwgso

1

使用XMLEvent-Based Processing和XMLParser，可在所有平台上使用Foundation，取得了一定的成功。

优点

与使用正则表达式相比，此解决方案更具争议性。
更安全，因为一些人已经提到，HTML不是一种常规语言。
线程安全（无需在主线程上运行）。

缺点

HTML虽然与XML非常相似，但它们并不相同。在尝试将其解析为XML之前，您可能需要清理您的HTML。
例如：<br>和<hr>会导致解析失败，但<br />和<hr />将被解析为\n。
这是一个基于委托的API，强制您遵守NSObject协议和基于事件的处理。
XMLParser已经很长时间没有更新了，因此缺乏我们想要的许多新的Swift功能。
XMLDocument是一个更加现代和灵活的API，也提供了Foundation，但它仅在macOS上可用。

针对我的使用情况，我创建了一个类，使我能够使用async/await和异步处理。

请随意调整以适应您自己的用例，也许可以改进原始HTML字符串的清理过程。

解决方案

import Foundation

final class Parser: NSObject, XMLParserDelegate {
    private(set) var result = ""
    private var finished: (() -> Void)?
    private var fail: ((Error) -> Void)?
    private var content = ""

    init(html: String) async throws {
        super.init()
        
        result = try await withUnsafeThrowingContinuation { [weak self] continuation in
            // tweak here as needed
            let clean = html
                .replacingOccurrences(of: "<!DOCTYPE html>",
                                      with: "",
                                      options: .caseInsensitive)
                .replacingOccurrences(of: "<br>",
                                      with: "\n",
                                      options: .caseInsensitive)
                .replacingOccurrences(of: "<hr>",
                                      with: "\n",
                                      options: .caseInsensitive)
            
            let xml = XMLParser(data: .init(("<xml>" + clean + "</xml>").utf8))
            self?.finished = { [weak self] in
                xml.delegate = nil
                self?.fail = nil
                self?.finished = nil
                
                guard let content = self?.content else { return }

                continuation
                    .resume(returning: content
                        .trimmingCharacters(in:
                                .whitespacesAndNewlines))
            }
            
            self?.fail = { [weak self] in
                xml.delegate = nil
                self?.fail = nil
                self?.finished = nil
                xml.abortParsing()

                continuation
                    .resume(throwing: $0)
            }
            
            xml.delegate = self
            
            if !xml.parse(),
                let error = xml.parserError {
                self?.fail?(error)
            }
        }
    }
    
    func parserDidEndDocument(_: XMLParser) {
        finished?()
    }
    
    func parser(_: XMLParser, parseErrorOccurred: Error) {
        fail?(parseErrorOccurred)
    }
    
    func parser(_: XMLParser, validationErrorOccurred: Error) {
        fail?(validationErrorOccurred)
    }
    
    func parser(_: XMLParser, foundCharacters: String) {
        content += foundCharacters
    }
}

使用方法和示例

利用此帖子中已经给出的一些示例

let string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"

let result = try await Parser(html: string).result
// My First Heading My first paragraph.

let string = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. <br />"

let result = try await Parser(html: string).result
// LCD Soundsystem was the musical project of producer James Murphy, co-founder of dance-punk label DFA Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of alternative dance and post punk, along with elements of disco and other styles.

let string = "my html <a href=\"\">link text</a>"

let result = try await Parser(html: string).result
// my html link text

- vicegax

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Steve Rosenberg · Accepted Answer

嗯，我尝试了你的函数，并且它在一个小例子上有效：

var string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
let str = string.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
print(str)

//output "  My First Heading My first paragraph. "

你能举个问题的例子吗？

Swift 4和5版本：

var string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
let str = string.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)