Using capture groups and Escape Characters


Regular Expressions

A regular expression is a special sequence of characters that helps in matching or finding other strings or sets of strings, using a specialized syntax held in a pattern. Java has support for regular expression usage through the java.util.regex package. This topic is to introduce and help developers understand more with examples on how Regular Expressions must be used in Java.

Using capture groups

If you need to extract a part of string from the input string, we can use capture groups of regex. For this example, we'll start with a simple phone number regex:

\d{3}-\d{3}-\d{4}

If parentheses are added to the regex, each set of parentheses is considered a capturing group. In this case, we are using what are called numbered capture groups:

(\d{3})-(\d{3})-(\d{4})
^-----^ ^-----^ ^-----^
Group 1 Group 2 Group 3

Before we can use it in Java, we must not forget to follow the rules of Strings, escaping the backslashes, resulting in the following pattern:

"(\\d{3})-(\\d{3})-(\\d{4})"

We first need to compile the regex pattern to make a Pattern and then we need a Matcher to match our input string with the pattern:

Pattern phonePattern = Pattern.compile("(\\d{3})-(\\d{3})-(\\d{4})");
Matcher phoneMatcher = phonePattern.matcher("abcd800-555-1234wxyz");

Next, the Matcher needs to find the first subsequence that matches the regex:

phoneMatcher.find();

Now, using the group method, we can extract the data from the string:

String number = phoneMatcher.group(0); //"800-555-1234" (Group 0 is everything the regex matched)
String aCode = phoneMatcher.group(1); //"800"
String threeDigit = phoneMatcher.group(2); //"555"
String fourDigit = phoneMatcher.group(3); //"1234"

Note : Matcher.group() can be used in place of Matcher.group(0).

Version ≥ Java SE 7

Java 7 introduced named capture groups. Named capture groups function the same as numbered capture groups (but with a name instead of a number), although there are slight syntax changes. Using named capture groups improves readability. We can alter the above code to use named groups:

(?\d{3})-(\d{3})-(\d{4})
^----------------^ ^-----^ ^-----^
AreaCode           Group 2 Group 3

To get the contents of "AreaCode", we can instead use:

String aCode = phoneMatcher.group("AreaCode"); //"800"

Using regex with custom behaviour by compiling the Pattern with flags

A Pattern can be compiled with flags, if the regex is used as a literal String, use inline modifiers:

Pattern pattern = Pattern.compile("foo.", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
pattern.matcher("FOO\n").matches(); // Is true.
 
/* Had the regex not been compiled case insensitively and singlelined,
 * it would fail because FOO does not match /foo/ and \n (newline)
 * does not match /./.
 */
 
 
Pattern anotherPattern = Pattern.compile("(?si)foo");
anotherPattern.matcher("FOO\n").matches(); // Is true.
 
"foOt".replaceAll("(?si)foo", "ca"); // Returns "cat".

Escape Characters

Generally

To use regular expression specific characters (?+| etc.) in their literal meaning they need to be escaped. In common regular expression this is done by a backslash \. However, as it has a special meaning in Java Strings, you have to use a double backslash \\.

These two examples will not work:
 
"???".replaceAll ("?", "!"); //java.util.regex.PatternSyntaxException
"???".replaceAll ("\?", "!"); //Invalid escape sequence
 
This example works
 
"???".replaceAll ("\\?", "!"); //"!!!"

Splitting a Pipe Delimited String

This does not return the expected result:
 
"a|b".split ("|"); // [a, |, b]
 
This returns the expected result:
 
"a|b".split ("\\|"); // [a, b]

Escaping backslash \

Generally

To use regular expression specific characters (?+| etc.) in their literal meaning they need to be escaped. In common regular expression this is done by a backslash \. However, as it has a special meaning in Java Strings, you have to use a double backslash \\.

This will give an error:
 
"\\".matches("\\"); // PatternSyntaxException
"\\".matches("\\\"); // Syntax Error
 
This works:
 
"\\".matches("\\\\"); // true

Basic Programs