我想使用HTML Agility Pack从这个网站获取信息,但是由于数据是在搜索后加载的,所以我无法做到这一点。我需要每隔5分钟连续完成一些数据。
https://enquiry.indianrail.gov.in/ntes/NTES?action=getTrainsViaStn&viaStn=NDLS&toStn=null&withinHrs=2&trainType=ALL&6iop0ssrpi=1m1ol4ha86
你只会得到以下内容:
(function(){location.reload();/*ho ho ho ho*/})()
&6iop0ssrpi=1m1ol4ha86
是某种“密码”(缺乏更好的词语)。这确保您不能重放请求。现在,您可以尝试破解它。但它被隐藏在一个非常密集的3396行JavaScript文件中。因此,很难(甚至可能不可能)找出要发送到服务器以接收所需数据的内容。
更好的是,服务器的响应永远不会是HTML,而是JSON格式。格式如下:
_obj_1511003507337 = {
trainsInStnDataFound:"trainRunningDataFound",
allTrains:[
{
trainNo:"14316",
startDate:"18 Nov 2017",
trainName:"INTERCITY EXP",
trnName:function(){return _LANG==="en-us"?"INTERCITY EXP":"इंटरसिटीएक्स."},
trainSrc:"NDLS",
trainDstn:"BE",
runsOn:"NA",
schArr:"Source",
schDep:"16:35, 18 Nov",
schHalt:"Source",
actArr:"Source",
delayArr:"RIGHT TIME",
actDep:"16:35, 18 Nov",
delayDep:"RIGHT TIME",
actHalt:"Source",
trainType:"MEX",
pfNo:"9"
} ,
trainNo:"12625",
startDate:"16 Nov 2017",
trainName:"KERALA EXPRESS",
trnName:function() { return _LANG === "en-us" ? "KERALA EXPRESS" : "केरलएक्स."},
trainSrc:"TVC",
trainDstn:"NDLS",
runsOn:"NA",
schArr:"13:45, 18 Nov",
schDep:"Destination",
schHalt:"Destination",
actArr:"16:56, 18 Nov",
delayArr:"03:11",
actDep:"Destination",
delayDep:"RIGHT TIME",
actHalt:"Destination",
trainType:"SUF",
pfNo:"4"
}
]
}
这里是使用Selenium获取HTML和数据的解决方案。
using System;
using System.Collections.Generic;
using System.Net;
using HtmlAgilityPack;
using OpenQA.Selenium.Firefox;
using OpenQA.Selenium;
using System.Threading;
namespace test
{
class Program
{
public static void Main(string[] args)
{
string url = "https://www.google.com";
IWebDriver driver = new FirefoxDriver();
driver.Navigate().GoToUrl("https://enquiry.indianrail.gov.in");
Console.WriteLine("Step 1");
driver.FindElement(By.XPath("//a[@id='ui-id-2']")).Click();
Thread.Sleep(10000);
Console.WriteLine("Step 2");
driver.FindElement(By.XPath("//input[@id='viaStation']")).SendKeys("NEW DELHI [NDLS]");
Thread.Sleep(2000);
Console.WriteLine("Step 3");
driver.FindElement(By.XPath("//button[@id='viaStnGoBtn']")).Click();
//PRESS A KEY WHEN THE HTML IS FULLY LOADED
Console.ReadKey();
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(driver.PageSource);
HtmlNodeCollection nodeCol = doc.DocumentNode.SelectNodes("//body//tr[@class='altBG']");
foreach(HtmlNode node in nodeCol){
Console.WriteLine("Trip:");
foreach(HtmlNode child in node.ChildNodes)
{
Console.WriteLine("\t" + child.InnerText);
}
}
//Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
Thread.Sleep() 应该是不必要的。我只是作为一种预防措施将它们放在里面。如果您使用像 PhantomJS 这样的不带界面的驱动程序,速度也可以进行优化。