Recursion in Regular Expressions

Robin Houston (author of the original proposal for extending regular expressions to support recursion) gave an example of how to use this feature to match properly nested parentheses on the TextMate mailing list.

He also demonstrates the use of (?x:…) to allow commenting of the regular expression and splitting it over multiple lines, making it less terse (highly recommended when crafting complex regular expressions e.g. for language grammars).

Inserting Thousand Separators

Speaking of regular expressions, I previously did a post about how to word wrap text using a regular expression. To follow that, here is a regular expression for matching all groups of digits which should be followed by a thousand separator:

\d{1,3}(?=(\d{3})+(?!\d))

So the replacement string would be \0, or $&, (depending on format string syntax). Let’s try this with the ls command and perl:

% ls -l /mach_kernel \
    |perl -pe 's/\d{1,3}(?=(\d{3})+(?!\d))/$&,/g'
-rw-r--r-- 1 root  wheel  10,235,416 19 Sep 05:50 /mach_kernel

Of course in this case you’d be better off with the -h option, which gives the “human readable size”, i.e.:

% ls -lh /mach_kernel
-rw-r--r-- 1 root  wheel   9.8M Sep 19 05:50 /mach_kernel

Figure Space

Staying in the domain of pretty-printing numbers. I use GeekTool to dump a lot of information on my desktop, some of it consists of columns with numbers. Let’s imagine I am tracking the value of the dollar and euro per week (leftmost column being the week number, and the figures are unlikely going to depict the actual value):

 8   $5.45   €7.44
 9   $5.42   €7.45
10   $5.37   €7.44

Displaying such data is best done with a monospace font, so that things align. But the € or $ character in these fonts might not be entirely to our liking (when we blow up the text size to 24 points), so can we use a proportional width font? Generally proportional width fonts will make all the digits the exact same width, so that they do align nicely. Though the space will normally not follow the width of the digits, so we have a problem with 10 being wider than 9 (here represents a regular space).

Fortunately somebody already thought of that. If you check out WikiPedia’s table of space characters you will find the Figure Space (U+2007) which is a space made especially for use when aligning numbers, i.e. it is the width of a digit. So by using leading figure spaces for our week numbers, we can make the above data align nicely even when printed with a proportional width font.

If you do want to use this (or any other unicode character) with GeekTool output, you need to patch the source to interpret command output as utf-8.

3 Comments

  1. 19 Oct 2007 | # Thousand separator regex wrote…

    [...] (via) [...]

  2. 15 Nov 2007 | # Alexey wrote…

    Hi! Allan, can you give me a compiled version of UTF-enabled version of GeekTool?

  3. 29 Feb 2008 | # Bob wrote…

    OK, so we've gone from regular expressions describing regular languages, to describing non-regular (but only slightly more expressive than regular) languages. Now this joker's decided to make the thing essentially describe (in a truly awful way) a context-free language.

    Needless to say, I don't see the point.

Comments closed, you can use the mailing list for discussion.