如何在Cloudflare DDoS保护下获取网页的HTML?

20

我使用htmlagility获取网页数据,但是使用了www.cloudflare.com的ddos保护的页面,我已经尝试了所有方法。重定向页面在htmlagility中无法处理,因为它们不使用meta或js进行重定向,我猜测它们检查您是否已经使用了一个cookie来进行检查,而我未能用c#模拟。当我获取页面时,HTML代码来自于着陆页。


传递Cookies http://stackoverflow.com/a/20478716/736079 - jessehouwing
您也可以使用BrowserSession类,如此处所述:http://refactoringaspnet.blogspot.nl/2010/04/using-htmlagilitypack-to-get-and-post.html - jessehouwing
4个回答

14
我也曾经遇到过这个问题。真正的解决方法是解决cloudflare网站提出的挑战(您需要使用javascript计算正确的答案,将其发送回去,然后您将收到一个cookie/令牌,用于继续查看网站)。因此,通常您只会得到像这样的页面。

cloudflare

最终,我只是通过shell执行调用了一个Python脚本。我使用了这个github分支提供的模块。这可以作为实现在C#中规避cloudflare反DDoS页面的起点。
顺便说一下,我为了个人使用编写的Python脚本只是将cookie写入文件中。我稍后使用C#读取并将其存储在CookieJar中以便在C#中继续浏览该页面。
#!/usr/bin/env python
import cfscrape
import sys

scraper = cfscrape.create_scraper() # returns a requests.Session object
fd = open("cookie.txt", "w")
c = cfscrape.get_cookie_string(sys.argv[1])
fd.write(str(c))
fd.close()  
print(c)

编辑:重申一下,这与cookie只有很小的关系!Cloudflare强制你使用javascript命令解决一个真正的挑战。它不像接受cookie并在以后使用那么简单。看看https://github.com/Anorov/cloudflare-scrape/blob/master/cfscrape/init.py和大约40行的javascript仿真来解决这个挑战。

编辑2:有人使用完整的浏览器对象(这不是无头浏览器)去访问网站并在页面加载时订阅某些事件,而不是编写规避保护的内容。使用WebBrowser类创建一个无限小的浏览器窗口并订阅适当的事件。

编辑3:好吧,我实际上实现了C#的方法来做到这一点。这使用了JavaScript引擎Jint for .NET,可以通过https://www.nuget.org/packages/Jint获取。

Cookie处理代码很丑陋,因为有时HttpResponse类无法获取cookie,尽管标头包含一个Set-Cookie部分。

using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
using System.Web;
using System.Collections;
using System.Threading;

namespace Cloudflare_Evader
{
    public class CloudflareEvader
    {
        /// <summary>
        /// Tries to return a webclient with the neccessary cookies installed to do requests for a cloudflare protected website.
        /// </summary>
        /// <param name="url">The page which is behind cloudflare's anti-dDoS protection</param>
        /// <returns>A WebClient object or null on failure</returns>
        public static WebClient CreateBypassedWebClient(string url)
        {
            var JSEngine = new Jint.Engine(); //Use this JavaScript engine to compute the result.

            //Download the original page
            var uri = new Uri(url);
            HttpWebRequest req =(HttpWebRequest) WebRequest.Create(url);
            req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0";
            //Try to make the usual request first. If this fails with a 503, the page is behind cloudflare.
            try
            {
                var res = req.GetResponse();
                string html = "";
                using (var reader = new StreamReader(res.GetResponseStream()))
                    html = reader.ReadToEnd();
                return new WebClient();
            }
            catch (WebException ex) //We usually get this because of a 503 service not available.
            {
                string html = "";
                using (var reader = new StreamReader(ex.Response.GetResponseStream()))
                    html = reader.ReadToEnd();
                //If we get on the landing page, Cloudflare gives us a User-ID token with the cookie. We need to save that and use it in the next request.
                var cookie_container = new CookieContainer();
                //using a custom function because ex.Response.Cookies returns an empty set ALTHOUGH cookies were sent back.
                var initial_cookies = GetAllCookiesFromHeader(ex.Response.Headers["Set-Cookie"], uri.Host); 
                foreach (Cookie init_cookie in initial_cookies)
                    cookie_container.Add(init_cookie);

                /* solve the actual challenge with a bunch of RegEx's. Copy-Pasted from the python scrapper version.*/
                var challenge = Regex.Match(html, "name=\"jschl_vc\" value=\"(\\w+)\"").Groups[1].Value;
                var challenge_pass = Regex.Match(html, "name=\"pass\" value=\"(.+?)\"").Groups[1].Value;

                var builder = Regex.Match(html, @"setTimeout\(function\(\){\s+(var t,r,a,f.+?\r?\n[\s\S]+?a\.value =.+?)\r?\n").Groups[1].Value;
                builder = Regex.Replace(builder, @"a\.value =(.+?) \+ .+?;", "$1");
                builder = Regex.Replace(builder, @"\s{3,}[a-z](?: = |\.).+", "");

                //Format the javascript..
                builder = Regex.Replace(builder, @"[\n\\']", "");

                //Execute it. 
                long solved = long.Parse(JSEngine.Execute(builder).GetCompletionValue().ToObject().ToString());
                solved += uri.Host.Length; //add the length of the domain to it.

                Console.WriteLine("***** SOLVED CHALLENGE ******: " + solved);
                Thread.Sleep(3000); //This sleeping IS requiered or cloudflare will not give you the token!!

                //Retreive the cookies. Prepare the URL for cookie exfiltration.
                string cookie_url = string.Format("{0}://{1}/cdn-cgi/l/chk_jschl", uri.Scheme, uri.Host);
                var uri_builder = new UriBuilder(cookie_url);
                var query = HttpUtility.ParseQueryString(uri_builder.Query);
                //Add our answers to the GET query
                query["jschl_vc"] = challenge;
                query["jschl_answer"] = solved.ToString();
                query["pass"] = challenge_pass;
                uri_builder.Query = query.ToString();

                //Create the actual request to get the security clearance cookie
                HttpWebRequest cookie_req = (HttpWebRequest) WebRequest.Create(uri_builder.Uri);
                cookie_req.AllowAutoRedirect = false;
                cookie_req.CookieContainer = cookie_container;
                cookie_req.Referer = url;
                cookie_req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0";
                //We assume that this request goes through well, so no try-catch
                var cookie_resp = (HttpWebResponse)cookie_req.GetResponse();
                //The response *should* contain the security clearance cookie!
                if (cookie_resp.Cookies.Count != 0) //first check if the HttpWebResponse has picked up the cookie.
                    foreach (Cookie cookie in cookie_resp.Cookies)
                        cookie_container.Add(cookie);
                else //otherwise, use the custom function again
                {
                    //the cookie we *hopefully* received here is the cloudflare security clearance token.
                    if (cookie_resp.Headers["Set-Cookie"] != null)
                    {
                        var cookies_parsed = GetAllCookiesFromHeader(cookie_resp.Headers["Set-Cookie"], uri.Host);
                        foreach (Cookie cookie in cookies_parsed)
                            cookie_container.Add(cookie);
                    }
                    else
                    {
                        //No security clearence? something went wrong.. return null.
                        //Console.WriteLine("MASSIVE ERROR: COULDN'T GET CLOUDFLARE CLEARANCE!");
                        return null;
                    }
                }
                //Create a custom webclient with the two cookies we already acquired.
                WebClient modedWebClient = new WebClientEx(cookie_container);
                modedWebClient.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0");
                modedWebClient.Headers.Add("Referer", url);
                return modedWebClient;
            }
        }

        /* Credit goes to https://dev59.com/52Up5IYBdhLWcg3wtZLt 
           (user https://stackoverflow.com/users/541404/cameron-tinker) for these functions 
        */
        public static CookieCollection GetAllCookiesFromHeader(string strHeader, string strHost)
        {
            ArrayList al = new ArrayList();
            CookieCollection cc = new CookieCollection();
            if (strHeader != string.Empty)
            {
                al = ConvertCookieHeaderToArrayList(strHeader);
                cc = ConvertCookieArraysToCookieCollection(al, strHost);
            }
            return cc;
        }

        private static ArrayList ConvertCookieHeaderToArrayList(string strCookHeader)
        {
            strCookHeader = strCookHeader.Replace("\r", "");
            strCookHeader = strCookHeader.Replace("\n", "");
            string[] strCookTemp = strCookHeader.Split(',');
            ArrayList al = new ArrayList();
            int i = 0;
            int n = strCookTemp.Length;
            while (i < n)
            {
                if (strCookTemp[i].IndexOf("expires=", StringComparison.OrdinalIgnoreCase) > 0)
                {
                    al.Add(strCookTemp[i] + "," + strCookTemp[i + 1]);
                    i = i + 1;
                }
                else
                    al.Add(strCookTemp[i]);
                i = i + 1;
            }
            return al;
        }

        private static CookieCollection ConvertCookieArraysToCookieCollection(ArrayList al, string strHost)
        {
            CookieCollection cc = new CookieCollection();

            int alcount = al.Count;
            string strEachCook;
            string[] strEachCookParts;
            for (int i = 0; i < alcount; i++)
            {
                strEachCook = al[i].ToString();
                strEachCookParts = strEachCook.Split(';');
                int intEachCookPartsCount = strEachCookParts.Length;
                string strCNameAndCValue = string.Empty;
                string strPNameAndPValue = string.Empty;
                string strDNameAndDValue = string.Empty;
                string[] NameValuePairTemp;
                Cookie cookTemp = new Cookie();

                for (int j = 0; j < intEachCookPartsCount; j++)
                {
                    if (j == 0)
                    {
                        strCNameAndCValue = strEachCookParts[j];
                        if (strCNameAndCValue != string.Empty)
                        {
                            int firstEqual = strCNameAndCValue.IndexOf("=");
                            string firstName = strCNameAndCValue.Substring(0, firstEqual);
                            string allValue = strCNameAndCValue.Substring(firstEqual + 1, strCNameAndCValue.Length - (firstEqual + 1));
                            cookTemp.Name = firstName;
                            cookTemp.Value = allValue;
                        }
                        continue;
                    }
                    if (strEachCookParts[j].IndexOf("path", StringComparison.OrdinalIgnoreCase) >= 0)
                    {
                        strPNameAndPValue = strEachCookParts[j];
                        if (strPNameAndPValue != string.Empty)
                        {
                            NameValuePairTemp = strPNameAndPValue.Split('=');
                            if (NameValuePairTemp[1] != string.Empty)
                                cookTemp.Path = NameValuePairTemp[1];
                            else
                                cookTemp.Path = "/";
                        }
                        continue;
                    }

                    if (strEachCookParts[j].IndexOf("domain", StringComparison.OrdinalIgnoreCase) >= 0)
                    {
                        strPNameAndPValue = strEachCookParts[j];
                        if (strPNameAndPValue != string.Empty)
                        {
                            NameValuePairTemp = strPNameAndPValue.Split('=');

                            if (NameValuePairTemp[1] != string.Empty)
                                cookTemp.Domain = NameValuePairTemp[1];
                            else
                                cookTemp.Domain = strHost;
                        }
                        continue;
                    }
                }

                if (cookTemp.Path == string.Empty)
                    cookTemp.Path = "/";
                if (cookTemp.Domain == string.Empty)
                    cookTemp.Domain = strHost;
                cc.Add(cookTemp);
            }
            return cc;
        }
    }

    /*Credit goes to  https://dev59.com/6nI-5IYBdhLWcg3wlpeW
 (user https://stackoverflow.com/users/129124/pavel-savara) */
    public class WebClientEx : WebClient
    {
        public WebClientEx(CookieContainer container)
        {
            this.container = container;
        }

        public CookieContainer CookieContainer
        {
            get { return container; }
            set { container = value; }
        }

        private CookieContainer container = new CookieContainer();

        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
            }
            return r;
        }

        protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protected override WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        private void ReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

该函数将返回一个带有已解决挑战和Cookie的Web客户端。您可以按以下方式使用它:
static void Main(string[] args)
{
    WebClient client = null;
    while (client == null)
    {
        Console.WriteLine("Trying..");
        client = CloudflareEvader.CreateBypassedWebClient("http://anilinkz.tv");
    }
    Console.WriteLine("Solved! We're clear to go");
        Console.WriteLine(client.DownloadString("http://anilinkz.tv/anime-list"));

    Console.ReadLine();
}

8
最近我也遇到了同样的问题。作为一个 解决方案,我编写了一个小型便携式 .NET 类库,它提供了一个 DelegatingHandler,可以自动管理 JS 挑战,因此您可以使用 HttpClient 类访问受保护的站点,而不必担心 CloudFlare 的保护。它不依赖于任何 JS 引擎。 - El Cattivo
@El Cattivo 更新:我将 clearanceHandler 的 MaxRetries 属性更改为 3,错误消失了,现在每次都会在 "await client.GetStringAsync(url)" 行上出现 "任务已取消" 错误。 - eMi
@El Cattivo 更新:增加了客户端的超时时间,错误消失了,但现在出现了“4次尝试后清除失败”的错误,看来CloudFlare更改了他们的挑战... - eMi
我收到了空指针异常:“对象引用未设置为对象的实例”。 - Haseeb Mir
@HaSeeBMiR,我认为最好您开一个新的问题,详细描述您的代码和错误信息,而不是在评论中解决它。 - Maximilian Gerhardt
显示剩余6条评论

3
如果您不使用库,这是一个“简单”的绕过Cloudflare的工作方法(有时无法正常工作)。
  1. 打开一个“隐藏”的WebBrowser(大小为1,1或类似大小)。
  2. 打开目标Cloudflare站点的根目录。
  3. 从WebBrowser获取cookies。
  4. 在WebClient中使用这些cookies。
确保WebBrowser和WebClient的UserAgent完全相同。如果WebClient之后存在不匹配,Cloudflare将给您503错误。
您需要在此处搜索如何从WebBrowser获取cookies以及如何修改WebClient,以便可以设置其cookiecontainer并在1个或两个上修改UserAgent,以使它们相同。
由于Cloudflare的cookies似乎永远不会过期,因此您可以将cookies序列化到某个临时位置,并在每次运行应用程序时加载它,可能需要进行验证和重新获取。
我已经使用这种方法一段时间了,效果非常好。无法让C#库在特定的Cloudflare网站上工作,而它们在其他网站上工作。原因尚不清楚。
这也适用于IIS服务器后台,但您必须设置“不受欢迎”的设置。也就是说,将应用程序池作为SYSTEM或ADMIN运行,并将其设置为Classic模式。

为什么这个答案假设了一个非法的试图来覆盖保护?我正在寻找一种合法的方法... 尝试 GET 一个 PHP 文件时,我得到的响应是 [__cfduid] => something 和 [cf_clearance] => something,我认为这意味着“好的,你可以使用这些代码访问”.... 回答原问题应该是如何在 JavaScript 中使用这些代码来访问网站! - jumpjack

3
现在的答案应该包括Flaresolverr项目。它旨在使用Docker作为容器部署,因此您只需要传递一个端口就可以运行它。
它不会影响您的项目,因为您不需要导入库。目前它得到了支持。我唯一看到的不好的地方是,您需要安装Docker才能使其工作。

你可以不使用Docker来运行它。 - 472084

-3
使用WebClient获取页面的HTML,
我编写了以下处理Cookie的类,
只需在构造函数中传递CookieContainer实例即可。

using System;
using System.Collections.Generic;
using System.Configuration;
using System.Linq;
using System.Net;
using System.Text;

namespace NitinJS
{
    public class SmsWebClient : WebClient
    {
        public SmsWebClient(CookieContainer container, Dictionary<string, string> Headers)
            : this(container)
        {
            foreach (var keyVal in Headers)
            {
                this.Headers[keyVal.Key] = keyVal.Value;
            }
        }
        public SmsWebClient(bool flgAddContentType = true)
            : this(new CookieContainer(), flgAddContentType)
        {

        }
        public SmsWebClient(CookieContainer container, bool flgAddContentType = true)
        {
            this.Encoding = Encoding.UTF8;
            System.Net.ServicePointManager.Expect100Continue = false;
            ServicePointManager.MaxServicePointIdleTime = 2000;
            this.container = container;

            if (flgAddContentType)
                this.Headers["Content-Type"] = "application/json";//"application/x-www-form-urlencoded";
            this.Headers["Accept"] = "application/json, text/javascript, */*; q=0.01";// "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
            //this.Headers["Accept-Encoding"] = "gzip, deflate";
            this.Headers["Accept-Language"] = "en-US,en;q=0.5";
            this.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1; rv:23.0) Gecko/20100101 Firefox/23.0";
            this.Headers["X-Requested-With"] = "XMLHttpRequest";
            //this.Headers["Connection"] = "keep-alive";
        }

        private readonly CookieContainer container = new CookieContainer();

        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
                request.Timeout = 3600000; //20 * 60 * 1000
            }
            return r;
        }

        protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protected override WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        private void ReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

使用方法:

CookieContainer cookies = new CookieContainer();
SmsWebClient client = new SmsWebClient(cookies);
string html = client.DownloadString("http://www.google.com");

1
但这里的问题不是完全不同吗?他无法访问页面,因为Cloudflare的反dDoS保护将他重定向到另一个页面。使用自动存储cookie的WebClient类对他没有帮助。或者this.Headers["X-Requested-With"] = "XMLHttpRequest";是否绕过了Cloudflare的整个反dDoS保护机制? - Maximilian Gerhardt
需要修改Headers,我建议OP使用Fiddler记录浏览器请求并相应地修改头部。我希望通过使用这个类来解决问题。 - Nitin Sawant
这个问题不仅仅涉及到cookies。正如@MaximilianGerhardt在他的回答中所解释的那样,你必须解决一个JavaScript挑战才能绕过CloudFlare的反DDoS措施。 - El Cattivo

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接