TextMate News

Anything vaguely related to TextMate and macOS.

Wrapping Text With Regular Expressions

I have been asked “how to wrap text” a handful of times in the last year or so, and I have needed to do it myself a couple of times as well, so here is a Ruby function which does the job:

def wrap_text(txt, col = 80)
  txt.gsub(/(.{1,#{col}})( +|$\n?)|(.{1,#{col}})/, "\\1\\3\n")
end

The Pattern in Detail

This pattern is parameterized on the wrap column, so let us inject that parameter before further analysis:

(.{1,80})( +|$\n?)|(.{1,80})

Now let us step through that:

  • (.{1,80})

    The parenthesis makes this a capture, meaning we can refer to the matched text in the replacement string (as \1 since this is the first capture group.) The period (.) matches all characters except a newline, and {1,80} tells that we want to repeat that 1-80 times. It is greedy, so it will do as many repeats as possible.

    So basically this is a match for at most 80 non-newline characters with the text matched, stored in capture group one.

  • ( +|$\n?)

    Here we also use parenthesis, though this time not to capture the text matched, but because we want an either/or, i.e. we want to match one of two things: either we match one or more spaces, expressed with a literal space followed by + to denote one or more matches, or (expressed with the | operator) we want to match the end of the line (indicated by $) optionally followed by a newline, indicated by a \n and a question mark to make it optional.

    The reason why we make the newline optional is that we could be faced with the last line of the text, and while there is an end of that line, there is not necessarily a newline.

    So this part will match a consecutive run of spaces, or end of line. The magic happens when we combine that with the previous pattern, which matched 1-80 non-newline characters, because this additional match requires that the 1-80 non-newline characters are followed either by the consecutive run of spaces, or end of the line.

  • |

    The previous two patterns combined will match at most 80 non-newline characters followed by spaces or end of line. A problem exists if there are more than 80 non-newline characters not containing any spaces.

    For this reason we add special handling for that case, this is done by using the alternation (|) operator. Alternations are matched left-to-right and the first one which matches is used. So only if the left side of this operator did not match, will we advance to the next pattern.

  • (.{1,80})

    This is the same as the first pattern but since we do not follow it with the “space or end of line” clause, it will match up to 80 characters of the line (greedy) regardless of what follows the “break.”

    Like the first pattern, the result is put in a capture group, but since this is the third use of parenthesis, it is capture group three and can thus be references as \3 in the replacement string.

Replacement String

Since we get the text we need either in capture group one or three, the replacement string needs to insert one of these, and a newline.

As only one of them will match, I am simply inserting both using \\1\\3\n (the double backslashes are to cater for Ruby string escaping.)

Be aware that some implementations of regexp replacing use $1-n instead of \1-n for referencing a capture group.

Personally I favor the dollar notation, as it makes it seem like a variable, thus opening up for normal variable modification syntax, and also frees up the backslash, for example to prefix a capture group with a backslash in Ruby you easily end up with 6 backslashes in a row, just to insert one — so as you may have figured, in TextMate capture groups are referenced with the dollar notation.

Notes

  1. It might make sense to make this method a member of the String class.

  2. It is possible to do a non-capturing group using (?:…) this would make the function:

     def wrap_text(txt, col = 80)
       txt.gsub(/(.{1,#{col}})(?: +|$\n?)|(.{1,#{col}})/, "\\1\\2\n")
     end
    

This is technically better, but at the expense of a slightly more complex regexp.

2006-06-30: Based on Florians comment, here is a version which deals with lines that end with trailing spaces:

def wrap_text(txt, col = 80)
  txt.gsub(/(.{1,#{col}})( +|$)\n?|(.{#{col}})/, "\\1\\3\n")
end

I also changed the right side to do exactly col repeats instead of the (unnecessary) range.

categories Tips & Tricks

25 Comments

28 June 2006

by Benoit Gagnon

This method looks a lot like the one that comes with Rails :)

http://api.rubyonrails.org/classes/ActionView/Helpers/TextHelper.html

I hadn’t seen that one, but I wouldn’t say that they look a lot like each other — the Rails method makes use of two gsub’s and one strip where I settle with just the one gsub.

In addition the behavior is a little different. The Rails method eats empty lines, e.g.:

print word_wrap <<-TEXT
Test

Test
TEXT

Will output (removing the paragraph break):

Test
Test

And it doesn’t force a break for lines too long, e.g. providing 2 above as the line-width still gives the same result.

Not sure if this is by-design, personally though, I wouldn’t want this behavior ;)

28 June 2006

by Jay Soffian

Can you provide a little more context about why you’re doing this with a regex instead of, say Reformat Paragraph or /usr/bin/fmt?

If folks are looking for language specific solutions, in Perl there is the Text::Wrap module available via CPAN. In Python you can make use of TextWrapper in the textwrap module (included with Python).

j.

Jay: The entry was inspired by a friend of mine who asked over IM specifically for a regexp to word wrap text at column 80 (and force a break if a line was too long) — I do not know what environment he was in, other than he is a .NET whore, so neither TextMate, fmt, perl, or similar was of any help to him ;)

Since coincidentally he was not the first to ask me about this, I decided to write it up on this blog, also for the hopefully educational value in deconstructing the regexp (as many TM users are still not fully comfortable with regexps.)

If you watch my customization screencast I actually do call fmt myself from Ruby :) though the last time I needed to do something like word wrap myself was for emails sent out — here the text needed to be both wrapped and indented, resulting in the function below (which is what I based this entry on):

def blockquote(txt)
  txt.gsub(/(.{1,60})( +|$\n?)|(.{1,60})/,
    " \\1\\3\n") 
end

This is btw for the ticket system. When I release a new build of TM the change log is scanned for ticket ID’s and an email is sent to people involved with that ticket ID quoting the relevant part of the release notes, but nicely wrapped and indented, using the function above.

Anything else? :)

This is a little funny, I just noticed that the TextMate Ruby bundle has a snippet to insert a word wrapping gsub :)

Haha, I know who the .net whore is! ;)

29 June 2006

by Jeremy Dunck

Come come now, .Net developers are sharecroppers, not whores. :)

Your solution ignores tabs and converts trailing whitespace into spurious paragraph breaks:

$ perl -e '$a = "Testtest\tTesttest\tTesttest\n"; $a =~ s/(.{1,10})( +|$\n)|(.{1,10})/$1$3\n/gm; print $a'
Testtest T
esttest Te
sttest


$ perl -e '$a = "Testtest \nTest\n"; $a =~ s/(.{1,10})( +|$)|(.{1,10})/$1$3\n/gm; print $a'
Testtest  

Test

(Where is the preview button so I can check that markdown did in fact interprete my code example correctly? )

See, it didn’t. Here are some of the missing backslashes, plese insert as appropriate:

\\\\\\\\\\\\\\

Florian: not handling tabs was deliberate since you can’t wrap text which contain tab characters to a given column without knowing the tab size which will be used for display.

So running the text through tab expansion would generally be the practical solution — though you can make the match for space into space-or-tab if you’re more interested in a character wrap, e.g. for wrapping email/mime text where there is a line width (make sure the . then match a byte and not a (potential multi byte) character.)

Adding a match for an optional newline after the one-or-more spaces should handle the excessive use of spaces in your example.

As for backslashes, this is unfortunately a display bug in WordPress, if/when I fix it, they should show up for your comment :)

For anyone interested, the backslash swallow problem has been fixed :)

The markdown plug-in which I use does disable the problematic WP comment text filter (wpautop) but it didn’t provide the priority under which it was added (30) which made remove_filter a no-op.

This’ll even hyphenate ;)

#!/usr/bin/env ruby

require 'rubygems'
require 'text/reform'

r = Text::Reform.new
while line=gets
    puts r.format('['*40, line)
end

You should

gem install text-reform

before running this inside TextMate.

greetings

I was looking for a similar solution for coldfusion. This works, but then I found that you can use the coldfusion function “wrap” like such: wrap(str,5). Got to love CF.

This is version when you need strictly cut lines:

# Wraps text
 def wrap_text(text, len = 80)
   text.to_a.collect do |line|
     t = []
     1.upto((Float(line.length) / len).ceil) do |i|
       # Arithmetical progression
       # a = (n - 1) * q
       t.push line[( (i - 1) * len )..( (i - 1) * len + len - 1)]
     end
     t.join("\n")
   end.join
 end

03 December 2006

by Kike Lahuerta

This version is for vb.net

Public Sub CreaDescripWW(ByVal dv As DataView, ByVal tamLinea As Integer)
    Dim r As Regex = New Regex("(.{1," & tamLinea & "})( +|$\n?)|(.{1," & tamLinea & "})")

    For Each drv As DataRowView In dv
        Dim descrip As String = drv("Descrip")
        Dim descripWW As String = ""
        Dim m As Match = r.Match(descrip)
        While m.Success
            Dim str As String = m.Groups(1).Value

            If str.Length = 0 Then str = descrip.Substring(0, tamLinea)

            descripWW &= str & vbCrLf
            descrip = descrip.Remove(0, str.Length)

            m = r.Match(descrip)
        End While
        drv("DescripWW") = descripWW
    Next
End Sub

Can you say me if this is a good tutorial?

I want to learn regexp but i don’t now where i can found a good tutorial.

Can you help me?

I’m trying to write a little reflow command for TextMate and I was wondering if there’s a way to get the current document’s wrap column.

In other words, instead of passing the column width col = 80, is there a global TextMate variable that stores the stores the document’s wrap column?

Tony: The TM_COLUMNS variable has the number. When there is a column selection it changes to number of columns selected (i.e. width of selection), but for your command that’s probably desired.

03 August 2007

by Anonymous

wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww

Sorry, that was me above, if it would wrap or not. i’m looking for a way to word wrap non-fixed width fonts. because same number of w’s and i’s wont have the same display size, is there a way to make them look “equal”?

31 August 2007

by Guillaume

Hum will it be a possibility to put “-“ before the cut line in the case it s cuting a words…. (so not put “-“ if it s cuting white space) ?

Thanks

Guillaume.

Guillaume: In ruby you would do something like below.

def wrap_text(txt, col = 80)
  txt.gsub(/(.{1,#{col}})( +|$)\n?|(.{#{col-1}})-?/) do
    ($1 ? $1 : $3 + "-") + "\n"
  end
end

08 September 2008

by Anonymous

dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

16 September 2008

by Anonymous

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

27 November 2008

by bakaohki

Thanks for this regexp, mate.