Finally Pinned Down my Java Issue
I finally pinned down the Java issue that I have been dealing with, documented in this post.
I finally hit on it when I was doing some work to lazy-load data in a ket object, and found that performance was good when I was part done and went to hell when I was fully done. Back-tracking and digging through the code, effectively cutting out pieces until I narrowed it down to this:
Set urlSet = new HashSet();
#Loop... {
String href = "...";
urlSet.add(new URL(href)); # This kills the JVM
}
What I am doing is extracting the urls from a piece of html and using the Java Set to de-duplicate them. For some reason the last statement causes the JVM to slow down to a point where it is just idling from the CPU’s point of view (2%-5% usage).
However this works fine as a work-around:
Set urlSet = new HashSet();
#Loop... {
String href = "...";
urlSet.add(href);
}
List urlList = new ArrayList();
for ( String href : hrefSet ) {
urlList.add(new URL(href));
}
I am using the latest version of the JVM on Linux Centos 5.0, Dual 2.4GHz Xeon, 4GB RAM:
java version “1.6.0_10″
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) Server VM (build 11.0-b15, mixed mode)
Finally, and not least, thanks to a colleague for suggesting testing the code with strings as opposed to URLs.
Updated – my colleague pushed me to check the Java source code and it turns out that the java.net.URL class uses the java.net.URLStreamHandler.hashCode() method which does a DNS lookup on line 337 to work out the hash code:
InetAddress addr = getHostAddress(u);
Nice…






Oh man! I should have recognized this when I read your blog post. I could have saved you some time and frustration. Sorry.
The java.net.URL problem is well-known. Basically you should never use it as a key in a collection, because its equals() and hashCode() functions are too slow to be useful.
Workarounds include using the URI class, which is nice and fast, or a String like you noted.
Thanks. Actually I have been tracking this down for a while and the two people I showed it to had no idea (in their defense I think they probably discounted my making such a mistake.) I am glad I was able to solve it in the end, can’t think what Sun was thinking putting that in the URL class. I have checked other places where I used URL Sets and changed that.