在Java中,有一些URL解析器吗?

6

我知道Java中有一个URL类,但我需要获取页面文件扩展名(html、php、asp等)、域名所属国家(ca、au、br、jp、fr等)、页面类型(.net、.org、.gov等)和其他信息的方法。其中一些方法我使用字符串处理实现了,但我认为一个专门用于此目的的类可能更可靠。


你可以自己编写算法来确定这些URL部分。 - Neigyl R. Noval
编写自己的类将为您提供所需的一切。 - Ash Burlaczenko
有一个名为公共后缀列表(http://publicsuffix.org/)的网站详细列出了顶级域名。该列表很长,使得解析正确的顶级域名变得复杂。如果您不需要验证顶级域名,那么可能会更简单。 - John Yeary
3个回答

5
我创建了一个简单的Java类,使得在Java中解析URL变得更加容易。 https://github.com/juliuss/urlplus 它可以用于构建URL并以编程方式修改它们。 它还处理相对URL。
从单元测试中可以看出,它非常全面。
// build a URL
URL u = new URL("http://www.shopobot.com/?search=ipod");

// check the parts of the url were set correctly
assertEquals(u.getProtocol().name(), "http");

u.setFragment("login");
assertEquals(u, "http://www.shopobot.com/?search=ipod#login");

// add a parameter
u.addParameter("abc", "123");
assertEquals(u, "http://www.shopobot.com/?search=ipod&abc=123#login");

// add a duplicate parameter
u.addParameter("abc", "456");
assertEquals(u, "http://www.shopobot.com/?search=ipod&abc=123&abc=456#login");

// remove a parameter
u.removeParameter("search");
assertEquals(u, "http://www.shopobot.com/?abc=123&abc=456#login");

// reset fragment
u.setFragment("");
assertEquals(u, "http://www.shopobot.com/?abc=123&abc=456");

// test an encoded paramter
u.addParameter("encoding", "this code = awesome!");
assertEquals(u, "http://www.shopobot.com/?abc=123&abc=456&encoding=this+code+%3D+awesome%21");

// remove both duplicate parameters
u.removeParameter("abc");
assertEquals(u, "http://www.shopobot.com/?encoding=this+code+%3D+awesome%21");

// change host and port
u.setHost("localhost").setPort(8080);
assertEquals(u, "http://localhost:8080/?encoding=this+code+%3D+awesome%21");

// remove a parameter and add a page number (int parameter)
u.removeParameter("encoding").addParameter("page", 2);
assertEquals(u, "http://localhost:8080/?page=2");

// set the path
u.setPath("electronics/");
assertEquals(u, "http://localhost:8080/electronics/?page=2");
u.setPath("/electronics/");
assertEquals(u, "http://localhost:8080/electronics/?page=2");

// increment a parameter 3 times
u.incrementParameter("page").incrementParameter("page").incrementParameter("page");
assertEquals(u, "http://localhost:8080/electronics/?page=5");
// make sure the correct page number is returned
assertEquals(u.getParameter("page", 1), 5);

// set the page number to 2 and remove it -- setting it to 1
// since 1 is considered default, it is removed completely
u.setParameter("page", 2).decrementParameter("page");
assertEquals(u, "http://localhost:8080/electronics/");

// make sure that page will not be decremented since we're at 1
u.decrementParameter("page");
assertEquals(u, "http://localhost:8080/electronics/");

// test that defaults work
assertEquals(u.getParameter("page", 1), 1);
assertEquals(u.getParameter("page", 10), 10);

// test relative paths
u.setPath("/electronics/photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/")), "electronics/photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/electronics/")), "photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/electronics/photography/")), "");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/electronics/mp3-players/")), "../photography/");
// make sure when paths match, but authority doesn't results in full url return
assertEquals(u.toStringRelative(new URL("http://www.shopobot.com/electronics/photography/")), "http://localhost:8080/electronics/photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:80/electronics/photography/")), "http://localhost:8080/electronics/photography/");
assertEquals(u.toStringRelative(new URL("https://localhost:8080/electronics/photography/")), "http://localhost:8080/electronics/photography/");

// try some more complicated relative paths
u.setHost("x.com").setPath("/a/b/c/d/e.html").setPort(80);
assertEquals(u.toStringRelative(new URL("http://x.com/")),"a/b/c/d/e.html");
assertEquals(u.toStringRelative(new URL("http://x.com/a/b")),"c/d/e.html");
assertEquals(u.toStringRelative(new URL("http://x.com/a/b?q=1")),"/a/b/c/d/e.html");
u.addParameter("q", 1);
assertEquals(u.toStringRelative(new URL("http://x.com/a/b/c/d/e.html")),"?q=1");
assertEquals(u.toStringRelative(new URL("http://x.com/a/b/c/d/e/f/g/h")),"../../../../e.html?q=1");
assertEquals(u.toStringRelative(new URL("x.com/x/y/z/")),"../../../a/b/c/d/e.html?q=1");
assertEquals(u.toStringRelative(new URL("x.com/a/b/c/d/x/y/e.html")),"../../../e.html?q=1");
u.addParameter("f", "a b c");
assertEquals(u.toStringRelative(new URL("x.com/a/b/c/d/x/y/e.html")),"../../../e.html?q=1&f=a+b+c");
u.setFragment("hello").removeParameter("f");
assertEquals(u.toStringRelative(new URL("x.com/a/b/c/d/x/y/e.html")),"../../../e.html?q=1#hello");
assertEquals(u.toStringFull(),"/a/b/c/d/e.html?q=1#hello");

//test parameters with relative paths
u = new URL("facebook.com");
u.addParameter("test", "hi");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi")),"");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi&hello=hey")),"?test=hi");
u.addParameter("hello", "hey");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi&hello=hey")),"");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi&hello=hey#wow")),"?test=hi&hello=hey");
assertEquals(u.toStringRelative(new URL("facebook.com/")),"?test=hi&hello=hey");
assertEquals(u.toStringRelative(new URL("facebook.com/#yo")),"?test=hi&hello=hey");
u = new URL("facebook.com/#yo");
assertEquals(u.toStringRelative(new URL("facebook.com/")),"#yo");

//test relative paths with parameter changes
u = new URL("example.com/?param=1");
assertEquals(u.toStringRelative(new URL("example.com/?param=2")),"?param=1");
u = new URL("example.com/?param=1&param=2");
assertEquals(u.toStringRelative(new URL("example.com/?param=1&param=4")),"?param=1&param=2");
u.removeParameter("param");
assertEquals(u.toStringRelative(new URL("example.com/?param=1&param=4")),"/");

// build a new URL to test empty and null parameter values
u = new URL("http://www.google.com/");
u.addParameter("test", "");
assertEquals(u, "http://www.google.com/?test");
assertEquals(u.getParameter("test", "this is not returned"), "");
u.addParameter("this is a test", null);
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.addParameter("", "");
assertEquals(u, "http://www.google.com/?test&this+is+a+test");    
u.addParameter(null, "");
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.addParameter("", null);
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.addParameter(null, null);
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.removeParameter("this is a test");
assertEquals(u, "http://www.google.com/?test");
u.removeParameter("");
assertEquals(u, "http://www.google.com/?test");
String[] nullGuy = null;
u.removeParameter(nullGuy);
assertEquals(u, "http://www.google.com/?test");
u.removeParameter("test");
assertEquals(u, "http://www.google.com/");
u.addParameter(" "," ");
assertEquals(u, "http://www.google.com/?+=+");
u.addParameter("+","+");
assertEquals(u.getParameter("+", ""), "+");
assertEquals(u, "http://www.google.com/?+=+&%2B=%2B");
u.removeParameter(" ").removeParameter("+");
assertEquals(u, "http://www.google.com/");

//test fragment encoding
u.setFragment("short");
assertEquals(u, "http://www.google.com/#short");
assertEquals(u.getFragment(),"short");
u.setFragment("/this/is/a/#/<long>/( fragment )/");
assertEquals(u, "http://www.google.com/#/this/is/a/%23/%3Clong%3E/(+fragment+)/");
u.setFragment(null);
assertEquals(u, "http://www.google.com/");

u = new URL("www.wikipedia.org/wiki/USA");
assertEquals(u.matchesAuthority("org"), true);
assertEquals(u.matchesAuthority(".org"), true);
assertEquals(u.matchesAuthority("pedia.com"), false);
assertEquals(u.matchesAuthority("wikipedia.org"), true);
assertEquals(u.matchesAuthority("uwikipedia.org"), false);
assertEquals(u.matchesAuthority(".wikipedia.org"), true);
assertEquals(u.matchesAuthority("en.wikipedia.org"), false);
u.setHost("sub.en.wiki.com");
assertEquals(u.matchesAuthority("com"), true);
assertEquals(u.matchesAuthority("wiki.com"), true);
assertEquals(u.matchesAuthority("en.wiki.com"), true);
assertEquals(u.matchesAuthority("sub.en.wiki.com"), true);
assertEquals(u.matchesAuthority("asub.en.wiki.com"), false);
assertEquals(u.matchesAuthority("a.sub.en.wiki.com"), false);
assertEquals(u.matchesAuthority("sub.en.wiki.com","asub.en.wiki.com"), true);
assertEquals(u.matchesAuthority("a.sub.en.wiki.com","asub.en.wiki.com"), false);

//test no protocol on factory style methods
u = URL.get("www.wikipedia.org/wiki/USA");
u = URL.get("www.wikipedia.org/wiki/USA", u);

u = new URL("shopobot.com");
u.setParameter("will this <#> be encoded?","we've gone batshit crazy! seriously!");
u.setFragment("what's our # again?");
assertEquals(u.toString(),"http://shopobot.com/?will+this+%3C%23%3E+be+encoded%3F=we%27ve+gone+batshit+crazy%21+seriously%21#what's+our+%23+again?");
assertEquals(u.getParameter("will this <#> be encoded?", ""), "we've gone batshit crazy! seriously!");
assertEquals(u.getFragment(), "what's our # again?");

u = new URL("www.en.shopobot.com");
assertEquals(u.getAuthoritySize(), 4);
assertEquals(u.getAuthority(-1),"");
assertEquals(u.getAuthority(0),"");
assertEquals(u.getAuthority(1),"com");
assertEquals(u.getAuthority(2),"shopobot.com");
assertEquals(u.getAuthority(3),"en.shopobot.com");
assertEquals(u.getAuthority(4),"www.en.shopobot.com");
assertEquals(u.getAuthority(5),"www.en.shopobot.com");

u = new URL("en.wikipedia.org:90210/a/b/c/d/e.html?test=true");
assertEquals(u.getChildDirectory("a"),"b");
assertEquals(u.getChildDirectory("b"),"c");
assertEquals(u.getChildDirectory("c"),"d");
assertEquals(u.getChildDirectory("d"),"e.html");
assertEquals(u.getChildDirectory("e.html"),"");
assertEquals(u.getChildDirectory("g"),"");

assertEquals(u.getParentDirectory("a"),"");
assertEquals(u.getParentDirectory("b"),"a");
assertEquals(u.getParentDirectory("c"),"b");
assertEquals(u.getParentDirectory("d"),"c");
assertEquals(u.getParentDirectory("e.html"),"d");
assertEquals(u.getParentDirectory("e"),"");

//test relative url creation
u = new URL("http://www.example.com");
URL u2 = u.resolveRelative("q.html");
assertEquals(u2.toString(), "http://www.example.com/q.html");
u2 = u.resolveRelative("/q.html");
assertEquals(u2.toString(), "http://www.example.com/q.html");
u = new URL("http://www.example.com/abc/");
u2 = u.resolveRelative("q.html");
assertEquals(u2.toString(), "http://www.example.com/abc/q.html"); 

1
非常好。同时解决了URL编码问题。谢谢! - Erich

2
我不确定是否有一个特定的类可以做你要求的事情。首先看一下URL类和下面的帖子。 你能分享一个URL解析实现的链接吗? 我认为你需要结合URL类返回的数据和自己的解析算法来获取那些不可用的小数据块。这应该很容易做到,因为它听起来像是主机和路径的最后一个点之后的所有内容(如果它们确实存在,这并不保证)。

1

没有这样的类。其中一些东西(如国家代码)存在问题,模糊不清,通常仅从URL无法确定。它们不是解析,而是查找或推断。其他东西(比如文件扩展名)对大多数页面都没有定义。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接