阻止Bingbot爬取我的网站

Question

阻止Bingbot爬取我的网站

asp.net-mvc.htaccessbotsrobots.txtbing

9

我希望暂时完全阻止必应爬取我的网站（它以惊人的速度攻击我的网站，每月500GB的数据）。

我在必应站长工具中添加了1000个子域名，因此无法单独设置每个子域名的爬行频率。我尝试使用 robots.txt 阻止它，但失败了。这是我的 robots.txt：

# robots.txt 
User-agent: *
Disallow:
Disallow: *.axd
Disallow: /cgi-bin/
Disallow: /member
Disallow: bingbot
User-agent: ia_archiver
Disallow: /

- Zoinky

我也发现bingbot在我管理的许多网站上都这样做。它完全忽略了通用的“*”规则和任何爬行延迟。 - WooDzu

2个回答

4

您的 robots.txt 文件不正确：

在记录之间需要有换行符（一个记录以一个或多个 User-agent 行开头）。
Disallow: bingbot 禁止爬取 URL 路径以 "bingbot" 开头的页面（例如：http://example.com/bingbot），这可能不是您想要的结果。
这不是错误，但是因为默认已经禁止了所有的路径，所以不需要使用 Disallow:。

所以您可能需要使用以下内容：

User-agent: *
Disallow: *.axd
Disallow: /cgi-bin/
Disallow: /member

User-agent: bingbot
User-agent: ia_archiver
Disallow: /

这将禁止 "bingbot" 和 "ia_archiver" 爬取任何内容。所有其他机器人都被允许爬取除了路径以 /member，/cgi-bin/ 或 *.axd 开头的URL之外的所有内容。

请注意，按照原始的 robots.txt 规范，*.axd 将被机器人直接解释（因此它们不会爬取 http://example.com/*.axd，但会爬取 http://example.com/foo.axd）。然而，许多机器人扩展了规范并将 * 解释为某种通配符。

- unor

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Carl · Accepted Answer

这将会严重影响您的SEO/搜索排名，并导致网页从索引中删除，请小心使用

如果您已安装iis重写模块（如果没有，请访问此处），则可以根据用户代理字符串阻止请求。

然后像这样添加一个规则到您的webconfig：

<system.webServer>
  <rules>
    <rule name="Request Blocking Rule" stopProcessing="true">
      <match url=".*" />
      <conditions>
        <add input="{HTTP_USER_AGENT}" pattern="msnbot|BingBot" />
      </conditions>
      <action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="You do not have permission to view this page." />
    </rule>
  </rules>
</system.webServer>

如果机器人访问您的网站，这将返回403错误。

更新

查看您的robots.txt文件后，我认为应该是：

# robots.txt 
User-agent: *
Disallow:
Disallow: *.axd
Disallow: /cgi-bin/
Disallow: /member
User-agent: bingbot
Disallow: /
User-agent: ia_archiver
Disallow: /