使用Python和lxml模块从HTML中删除所有的JavaScript标签和样式标签

Question

使用Python和lxml模块从HTML中删除所有的JavaScript标签和样式标签

36

我正在使用http://lxml.de/库解析HTML文档。到目前为止，我已经知道如何从HTML文档中删除标签（在lxml中，如何删除标签但保留所有内容？），但是该帖子中描述的方法会保留所有文本，而不是将脚本标签移除。我还找到了一个class reference，名为lxml.html.clean.Cleaner （http://lxml.de/api/lxml.html.clean.Cleaner-class.html），但是这个类如何使用并不清楚。任何帮助，也许一个简短的例子对我会有用！

- john-charles

5个回答

6

以下是从XML/HTML树中删除和解析不同类型的HTML元素的几个示例。

关键建议：最好不要依赖外部库，并使用"本地python 2/3代码"来完成所有操作。

以下是使用"本地"python的几个示例...

# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

注意：

re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character"

我已经在几个不同的HTML文件上测试了它，包括`

`、`

`、``和``，它可以快速运行并跨越换行符！请注意：它也不依赖于beautifulsoup或任何其他外部下载的库！希望这可以帮助你！ :)

- Asher

4

您可以使用“strip_elements”方法来删除脚本，然后使用“strip_tags”方法来删除其他标签。请见以下格式：

您可以使用strip_elements方法来删除脚本，然后使用strip_tags方法来删除其他标签：

etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove

- cenanozen

1

对于 HTML 文档，当移除脚本时，您想要摆脱所有的 JavaScript，而不仅仅是 <script> 标签本身，因此 Cleaner 是一个更好的通用解决方案（https://dev59.com/4moy5IYBdhLWcg3wcNhO#8554251），尽管 strip_elements 对于 XML 文档也可以。 - aculich

谢谢...你的答案仍然是XML文档的好解决方案，所以我在我的答案中添加了一些文本来澄清XML与HTML用例。 - aculich

3

你可以使用bs4库来实现这个目的。

soup = BeautifulSoup(html_src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]

- Hafiz Muhammad Shafiq

2

肯定这做相反的事 / 你用这个列表做什么？ - Andy Hayden

不可以，因为这会改变soup的内容。也就是说，soup不再具有这些标签。 - havlock

0

您可以轻松使用正则表达式

对于JavaScript

def remove_script_code(data):
    clean = re.compile('<script>.*?</script>')
    return [re.sub(clean, '', data)]

关于 CSS 样式

def remove_style_code(data):
    clean = re.compile('<style>.*?</style>')
    return [re.sub(clean, '', data)]

- sudeep kharel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- aculich · Accepted Answer

以下是一个示例，可以实现你想要的功能。对于HTML文档，Cleaner是解决这个问题比使用strip_elements更好的通用解决方案，因为在这种情况下，你想要剥离的不仅仅是<script>标记；还需要摆脱其他标记上的onclick=function()属性等内容。

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print("WITH JAVASCRIPT & STYLES")
print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
print("WITHOUT JAVASCRIPT & STYLES")
print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))

您可以在lxml.html.clean.Cleaner文档中获取可设置的选项列表; 有些选项可以直接设置为True或False（默认值），而其他选项需要使用类似以下方式的列表：

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

请注意kill和remove之间的区别：

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).