Friday, June 19, 2009

A (Subtle?) Difference in Regular Expressions Between Java and Perl

I wrote a bunch of Perl code before Java finally got decent regular expressions. (I wrote lots of C++ before that but C++ didn’t have built in regular expressions either).

For some reason, on a number of occasions my Java regular expressions never worked right and I never fully realized why.

David has an interesting post pointing out how in Java regex’s – Carriage Return is not included in .* by default.

However, when I saw his example, I finally understood my confusion. In Perl, a rage means – “Does this pattern exist somewhere in my target string”?

So the following Perl code:

$str = "word in middle of line";
if ($str =~ /middle/) {
print "match"
}



will print “match”



You can force a regex in Perl to mean match from the beginning of the line by putting line markers into your string.



So the code:




$str = "word in middle of line";
if ($str =~ /^middle/) { print "match"}




won’t print out “match”


However, in Java, regex’s have to match against the whole string and you need .* on both ends if you want the Perl behavior. So the code:


String str = "word in middle of line";
System.out.println("First if");
if (str.matches("middle")) {
System.out.println("match");
}
System.out.println("Second if");
if (str.matches(".*middle.*")) {
System.out.println("match");
}



Will print out:


First if
Second if
match

Sunday, December 14, 2008

Closures are out of Java 7

Well it appears we will are going to have longer to wait for Closures in Java.

A status of Java 7 was given at Devoxx. Hamlet D’Arcy has a good summary here.

As mentioned, it seems that Closures are out. Personally I am a huge fan of closures and would love to be able to use them instead of the ugly, anonymous interface , pretend-its-almost-as-good workarounds one has to do in Java today. But even I have to admit that I understand the dangers and the downside in adding Closures to the language (see Generics)

JSR 294 (Modularization) is in. We will need to wait and see how close that ends up to OSGi or not.

The other interesting tidbit was the current time frame for Java 7 is Q1 2010.

Monday, December 1, 2008

Sharing a URI Context Between Wars in Tomcat

Today’s segment goes to some Tomcat configuration issues. Specifically, on how Tomcat sets the base of the URI which you can use to access your War based application,

In general, Tomcat takes the name of your war as the base of the URI. So if you have a war called app1.war (with a file called page1.jsp in the root) and you drop that war into the Tomcat webapps directory, you can then access your jsp at:

http://localhost:8080/app1/page1.jsp

If we have another war called app2.war (again with our favorite page1.jsp in the root), we can put that war in Tomcat and then access the page at

http://localhost:8080/app2/page1.jsp

But lets say we want the URI access to both wars to start with the same URI base. For example, we want app1.war to be accessible at the app1 url but we want the other war to be accessible at some URI under the app1 root (say app1/app2/page1.jsp)

This is actually something we actually needed to support as we refactored some of our application and for backward compatibility couldn’t change the access URIs.

In Tomcat this seemed hard to do. Our application ran in JBoss, so we were initially able to solve the problem by wrapping our app2.war into an ear and having our application.xml look like this:

<application>
    <display-name>App2 Stuff</display-name>
    <module>
        <web>
          <web-uri>app2.war</web-uri>
          <context-root>/app1/app2</context-root>
        </web>
    </module>
</application>

This is an ok solution if you are in a container like JBoss. But, if you aren’t, and while we are using JBoss, we want to move our application into a standalone Tomcat, this won’t work. So we needed to find a better solution.

It turns out that it’s not that hard though there seems to be very little documentation about it. 

The trick is the character #.

All you need to do is rename the war from app2.war to app1#app2.war

Tomcat then maps the war into the URI we want:

http://localhost:8080/app1/app2/page1.jsp

In some of the documentation and by fooling around with the Tomcat admin application, there seems to be other ways in Tomcat to do this with various xml files, but I couldn’t get them to work and was starting to give up hope.

Then we found the magic #. You may not need this often but when you do, it’s good to know about it.

Tuesday, November 18, 2008

Axis and Xerces can cause Perm problems

Away from the Groovy for a bit and back to my regular Java day job.

We have a Web Service that can occasionally return a large data structure. The server end runs inside a Tomcat and is served by Axis 1.4. Our client runs inside Tomcat in JBoss and also uses Axis 1.4 for its client code.

Recently, we experienced a situation that after the client called the web service to receive a large object a few times, we had a Out Of Memory (OOM) in the Perm of our JBoss (which as a reminder was the client side).

Since, we of course had the  crucial JVM flag (-XX:+HeapDumpOnOutOfMemoryError) on, so we got a memory dump to analyze.

Generally, analyzing perm issues is very difficult, but using Memory Analyzer Tool (did I mention how great that tool is?) we did find a suspiciously large object:

symboltable_full

You can see here that there was an instance of

org.apache.xerces.util.SymbolTable

that took up close to 225 MB.

A little drill down into the object showed that most of the entries in the SymbolTable (well since there were more then a million entries I can’t say that I really checked most of the entries but most of the entries I did check) were related to Soap. You can see one entry here:

symboltable_entry_zoom

You can see that the value in the entry is ns921375:severity which is a Soap wrapper type tag.

So, that object definitely seemed related to our OOM. The only problem was tying it to the Perm. Strings are suspects for perm problems, since in Sun’s JVMs interned strings get stored in the Perm space. The jmap command can give info on perm space used by a running jvm but unfortunately that functionality is only working on the Solaris version of the jvm. So I needed another way to try to tie the object to the Perm space and as the say: Use the Source, Luke

So, from looking at the Xerces code, what I learned is that Xerces uses the SymbolTable as a mechanism to reuse Strings. When the parser gets a new String it can pass it into the SymbolTable to get the canonical instance.

This makes sense in an XML Parser. A lot of tags will probably appear multiple times in an XML file and so a lot of memory can be saved if the parser can reuse the String objects. Also, (and possibly more importantly)  if the same String is continuously reused, then the parser can rely on the much faster == for comparing objects rather then calling equals.

The use of the term canonical instance was on purpose. The javadoc for String describes the intern function as:

public String intern()

Returns a canonical representation for the string object.

A pool of strings, initially empty, is maintained privately by the class String.



When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.




So, calling intern on a String basically does what I described the SymbolTable does.



The Xerces code in fact refers to this and says that:


The symbol table performs the same task as String.intern() with the following differences:


  • A new string object does not need to be created in order to retrieve a unique reference. Symbols can be added by using a series of characters in a character array.


  • Users of the symbol table can provide their own symbol hashing implementation. For example, a simple string hashing algorithm may fail to produce a balanced set of hashcodes for symbols that are mostly unique. Strings with similar leading characters are especially prone to this poor hashing behavior.



All this sounds good. The exact details of the reasons are less important to me but the point is someone there thought about intern and decided that it wasn’t good enough.



But a little digging into the Xerces source code turned up the following tidbit in the code for SymbolTable. SymbolTable has an inner class of type Entry which is where the Strings are put into to be stored in the SymbolTable. Entry’s constructor looks like this:



public Entry(String symbol, Entry next) {
this.symbol = symbol.intern();
characters = new char[symbol.length()];
symbol.getChars(0, characters.length, characters, 0);
this.next = next;
}


What we see here is that every Entry that gets created calls intern on the String as well.  After building a uniqueness mechanism in SymbolTable, they still relied on intern and as I mentioned before, intern Strings are stored in the perm space.



So, the bottom line here is that as large XML files with lots of unique Strings are parsed, the SymbolTable will slowly fill up your perm space as well. 



Soap envelopes are full of tags like ns921375:severity which are going to be unique and so will fill up your perm.



Final Note: We changed our soap function to return the object as an attachment and it seemed to solve the problem. 

Wednesday, November 5, 2008

Groovy Sort Follow-up

Here is another Groovy post, but after last weeks post about sorting in Groovy, I just wanted to make a couple of points on that.

First I had left with a question:

how Groovy distinguishes between the Closure version of the Comparator and the straight Closure which returns the Comparable object

My guess was that there is special code to deal with it, and a little digging into the Groovy source code showed this to be correct.

The magic happens in the file src\main\org\codehaus\groovy\runtime\DefaultGroovyMethods.java:
public static List sort(Collection self, Closure closure) { 
List list = asList(self); 
// use a comparator of one item or two 
int params = closure.getMaximumNumberOfParameters(); 
if (params == 1) { 
   Collections.sort(list, new OrderBy(closure)); 
} else { 
   Collections.sort(list, new ClosureComparator(closure)); 

return list; 

As you can see, the Groovy Java support code does in fact check what type of Closure it gets, and then builds an appropriate Comparator to pass into the regular Java sort function.

At work, a standard question that always comes up as we consider doing a production project in Groovy is: How does Groovy perform versus Java. It is true that the code is much nicer in Groovy, but how much slower will it run?

As we can see above, both Groovy and Java end up in the exact same function. The only difference is the work the Comparator will do. In Java, it is a straightforward compare. In Groovy it is either doing closure work or some other kind of comparator.

I timed a number of different possibilities in Groovy and Java. (This of course comes with the standard caveat of Beware of Microbenchmarks).

In order to avoid problems I did the following:

  • I randomly built lists with 20,000 Things to sort. (see thel last Groovy post as a reminder on what Things are). I created the same lists for all the sorting code at the same time.
  • After sorting the lists, I printed out the first element of each list to ensure that the optimizer wasn’t cutting it out or something.
  • I created and sorted lists 3 times to “warm up” the code
  • I then ran 100 times and averaged the runs.

So keeping those risks in mind, I timed the following codes (the time in parens is the average time for sorting the list using the given code):
  1. Basic Java sort (15 ms):
    Collections.sort(inList,new Comparator<Thing>() {
    public int compare(Thing o1, Thing o2) {
    return (o1.val.compareTo(o2.val));
    }
    });



  2. Basic Groovy Closure (487 ms):

    inList.sort {
    it.val
    }



  3. Groovy using the two param closure (596 ms):

    inList.sort {o1, o2 ->
    o1.val.compareTo(o2.val)
    }

  4. Groovy using a dynamically built Comparator (612 ms):

    inList.sort([compare:{a,b->
    a.val.compareTo(b.val) } ] as Comparator)



  5. Groovy using a prebuilt Comparator (444 ms):

    The Groovy Comparator:

    class StringComparator implements Comparator {
    int compare(o1, o2) {
    return o1.val.compareTo(o2.val)
    }
    }
    StringComparator sc = new StringComparator();

    And here is the sort code:

    inList.sort(sc)



Some interesting things can be seen here:


  • Java is shockingly faster – so much so that I am worried that I am falling for one of the micro-benchmark problems


  • The dynamic comparator versus the pre-built comparator is 50% slower – probably caused by the extra redirection and lookup the dynamic one requires.


  • I am also surprised that the two param closure was slower then the one param version. I would have thought that they would do similar work.

Sunday, October 26, 2008

Sorting in Groovy vs. Java

When you switch from Java to Groovy, one of the thing things you notice right away is how Closures affect things like sorting. As an example, lets pretend we have Things

public class Thing {
public int i1;
public String val;
public Thing(String s, int i) {
val = s;
i1 = i;
}
// for easy list printing
public String toString() {
return val +"-"+i1;
}
}

(Lets ignore the public member and all – this is just to make the code shorter)

Now lets make a list of Things in Java

List<Thing> l = new ArrayList<Thing>();
l.add(new Thing("hello", 3));
l.add(new Thing("there", 1));
l.add(new Thing("apple", 5));
l.add(new Thing("orange", 10));

and in Groovy

l = [new Thing("hello", 3),
new Thing("there", 1),
new Thing("apple", 5),
new Thing("orange", 10)];

Of course, the default way to sort in Java is fairly straightforward, though of course it doesn’t work here:

Collections.sort(l);

This won’t even compile since our Thing doesn’t implement Comparable. Groovy actually executes this no problem (though the order it sorts it in is anyone’s guess.) Running this snippet works:

println()
l.sort()
println “After Sort:”
println(l)

and prints the following result:

[hello-3, there-1, apple-5, orange-10]
After sort:
[apple-5, orange-10, hello-3, there-1]

But as you can see there is no way of knowing the sort order. So of course that is where implementing Comparable kicks in. I won’t go into that since that is basic Java and if we defined Thing as Comparable it would work as expected in Groovy as well.
Java, of course, also allows us to pass a Comparator to the sort function and this will work without changing Thing:

System.out.println(l);
Collections.sort(l, new Comparator<Thing> () {
public int compare(Thing o1, Thing o2) {
return o1.i1 - o2.i1; // sort by the int value
}
});
System.out.println("After Sorting:");
System.out.println(l);

Which will yield:

[hello-3, there-1, apple-5, orange-10]
After Sorting:
[there-1, hello-3, apple-5, orange-10]
Groovy has a similar option:

println(l)
l.sort([compare:{a,b-> Math.abs(a.i1) - Math.abs(b.i1) } ] as Comparator)
println "After sort:"
println(l)

Note the syntax for defining the Comparator on the fly in line 2. We are creating a map with functions and then converting it to a Comparator object. Very cool and powerful on the one hand but on the other kind of wacky for someone just out of Java. (Perhaps the topic for another post)You can also do this via a Closure syntax. (This is the way you would do this for these examples – the Comparator object is better for cases where you need to define a Comparator and pass it around). Here is the Closure way:

l.sort { o1, o2 ->
o1.i1 - o2.i1; // sort by the int value
}

This has been sorting with passing in a Comparator way. For that, Groovy just offers alternative (better?) syntax over Java. Groovy, however, does offer another way to use the sort function. There is version of sort that takes a much simplified Closure that won’t require us to remember exactly how to use Java’s Comparator – i.e. a positive or negative value for which way to sort.

The GDK Docs describe the function as:

sort
public List sort(Closure closure)

Sorts this Collection using the given closure as a comparator. The closure is passed each item from the collection, and is assumed to return a comparable value (i.e. an int).
Parameters:
closure - a Closure used as a comparator.
Returns:
a newly created sorted List

So, you pass a Closure to sort that turns the Object you want into a Comparable object. In our example if we want to sort our Things based on their int values you can just say:

l.sort { it.i1 }

Or for the String , just return it:

l.sort { i.val }
This works great to sort the list in the ascending order. Using this to sort the list in reverse order takes some maneuvering. For int this is still fairly straightforward.

l.sort { -1 * it.i1 }

I couldn’t get a good similar trick for Strings though – so you may need to use the Comparator or just call reverse on the returned list:

l.sort{ it.val }.reverse()

As you can see the Closure thing, can really shorten up the syntax and gives more power to what you can do in less code.

One downside of the whole Closure thing is that you can’t always tell what type of Closure a function expects to receive. If you use an editor with command completion you can only see that a function wants a Closure

image
and you have no way of knowing how many parameters will be passed to the Closure and what the function expects the Closure to return. In this case, the Javadocs explained what is needed but that is more difficult to look at or find sometimes rather than just being obvious from a method signature. Part of this is a function of a dynamic languages and not just the Closure issue.

Another technical question is that I am also not sure how Groovy distinguishes between the Closure version of the Comparator and the straight Closure which returns the Comparable object. My current guess is that there is code in the Closure version to check the number of parameters the Closure passed in expects to receive and calls it accordingly – because straight method dispatching should send them both to the same function. I will have to check into that.

Tuesday, October 7, 2008

Great tool to analyze a JVM memory dump

In the old days of Java, when you had an OutOfMemory exception, you had to begin a very painful exercise of trying to determine the source of the error. This was difficult to do since one had to reproduce and try to isolate the cause of the problem.

Now, we have heap dumps and there are a number of ways to get heap dumps out of a JVM but at a bare minimum you need to have Sun JVM's great flag on at all times:
-XX:+HeapDumpOnOutOfMemoryError

This causes the JVM to dump a snapshot of the current heap when there is an OutOfMemory exception.
This flag causes no runtime overhead, since it only comes into play when the OOM happens - the JVM then checks to see if it is set and if so will dump the heap.

There are a number of tools available to examine a heap dump.
The most basic comes with the JDK itself (starting with JDK 6): jhat
Its usage is fairly straightforward (don't forget the -J parameter to enlarge the heap of jhat itself) but its fairly basic.
It's UI is a basic HTML one.


It has a query language which is useful. It is based on Javascript for fancier queries but I had a hard time with the syntax whenever I tried to do anything fancy. It's documentation is really lacking in that regard.
However, since jhat is open source it's relatively easy to write queries in Java and compile them into jhat.

Another option is a profiler like Yourkit
It can open hprof files and its UI is pretty good as well (definitely better then jhat's)
It's commercial so that of course is a huge drawback. In addition, it is fairly slow in opening large dumps.

The best option is a relatively new tool. (and its the reason I decided to write this post - everyone needs to know about this tool) Actually it has been around for a bit but it has recently been added to Eclipse.
I am talking about Eclipse's Memory Analyzer Tool. It used to be known as SAP Memory Analyzer but was recently added to Eclipse. The new version is like its SAP predecessor but it just seems slightly more polished and ready for prime time.
It opens dumps fairly quickly and uses less memory too (It does create lots of index files on the disk but that's not that big of a deal)

It also has a query language and while it's similar to jhat's query language (they both call it OQL though this one is not Javascript based), it just seems more straightforward to use. Plus the built in queries just seem more useful and plentiful. I haven't tried adding queries to MAT though.

You can download a 32 bit version or a 64 bit version if you want to open larger heaps here.