Friday, June 19, 2009

A (Subtle?) Difference in Regular Expressions Between Java and Perl

I wrote a bunch of Perl code before Java finally got decent regular expressions. (I wrote lots of C++ before that but C++ didn’t have built in regular expressions either).

For some reason, on a number of occasions my Java regular expressions never worked right and I never fully realized why.

David has an interesting post pointing out how in Java regex’s – Carriage Return is not included in .* by default.

However, when I saw his example, I finally understood my confusion. In Perl, a rage means – “Does this pattern exist somewhere in my target string”?

So the following Perl code:

$str = "word in middle of line";
if ($str =~ /middle/) {
print "match"

will print “match”

You can force a regex in Perl to mean match from the beginning of the line by putting line markers into your string.

So the code:

$str = "word in middle of line";
if ($str =~ /^middle/) { print "match"}

won’t print out “match”

However, in Java, regex’s have to match against the whole string and you need .* on both ends if you want the Perl behavior. So the code:

String str = "word in middle of line";
System.out.println("First if");
if (str.matches("middle")) {
System.out.println("Second if");
if (str.matches(".*middle.*")) {

Will print out:

First if
Second if


mcherm said...

And I always disliked the PERL behavior. Sometimes you have to search for containment, sometimes you want to match the whole string. In the Java world, you can use ".*xxx.*" to get the other behavior, and in the PERL world you can use "\Axxx\Z" to get the other behavior (I think... my PERL is rusty). But I've always found it more intuitive for the DEFAULT behavior to be WHOLE STRING matching, not containment... so I prefer Java's choice here.

Doug Meyer said...

And I always disliked Java.
But there may be an implied difference between "$str =~ /regex/" and "str.matches('string')".
Java's "matches" method makes it look like you are comparing equality, whereas Perl makes it obvious that you are performing a regular expression operation.
I don't know if Java is "converting" the string into a regex under the hood, but in my opinion, it is doing it wrong.

laz said...

To make the Java version behave like the Perl version you could use:


That is what String.matches is a shortcut for, but using matches() instead of find(). Why there is no String.find I have no idea.

Also, to have .* include carriage returns, use Pattern.compile("middle", Pattern.DOTALL).

EvanCarroll said...

I think Perl also does this right. I find the majority of usecases are for martial matches.

I think Doug's reasoning here is sound too...

laz said...

I should have read the link to David's Dev Notes before posting before. I see that he mentions Pattern.DOTALL there. According to the Java doc for Pattern, it is also possible to indicate which flags to set using a non-capturing parenthesis notation. So to get .* to match carriage returns you can surround the regular expression like so: (?s:middle)

yt said...

Thanks for the interesting discussion here.
As you guys have been saying in your comments here - you can get both behaviors from Java and Perl.
Its a matter of which is the default and which is the one you need to work at.

mcherm - I dont know about the \A and \Z syntax.
I always used the ^xxx$ syntax to mean end of line and beginning of line.

Doug - Java is turning the matches call into a regex call as laz wrote. But I guess you can interpret the different syntax as a message on the intent of the function.

I guess as in a lot of things it comes down to what you learned first.
I first really used regex's heavily in Perl and so got used to the containment concept.

DuncanKinnear said...

This is not really a case of Java doing it differently to Perl. This is a case of Java doing it differently to most other languages that handle regex's.

In fact I'm struggling to think of any language that implies the start and end of string in the regex. Anyone? The Perl behaviour is the same as PHP, Python, C, C++, C#, Ruby, VB, etc. That's because these languages invariably use the POSIX C routines native to the platform.

As Evan said above, most use cases call for partial matches not whole lines. Adding the extra .* at the beginning and end will actually make most of the java regex's less efficient, as a regex of 'abc' will usually be more efficient than the resulting '^.*abc.*$' in Java.

Sorry, but I think the Java Developers just plain got this wrong. Where was the consultation with the Java community?