Wrapping Text With Regular Expressions

I have been asked “how to wrap text” a handful of times in the last year or so, and I have needed to do it myself a couple of times as well, so here is a Ruby function which does the job:

def wrap_text(txt, col = 80)
  txt.gsub(/(.{1,#{col}})( +|$\n?)|(.{1,#{col}})/,
    "\\1\\3\n") 
end

The Pattern in Detail

This pattern is parameterized on the wrap column, so let us inject that parameter before further analysis:

(.{1,80})( +|$\n?)|(.{1,80})

Now let us step through that:

  • (.{1,80})

    The parenthesis makes this a capture, meaning we can refer to the matched text in the replacement string (as \1 since this is the first capture group.) The period (.) matches all characters except a newline, and {1,80} tells that we want to repeat that 1-80 times. It is greedy, so it will do as many repeats as possible.

    So basically this is a match for at most 80 non-newline characters with the text matched, stored in capture group one.

  • ( +|$\n?)

    Here we also use parenthesis, though this time not to capture the text matched, but because we want an either/or, i.e. we want to match one of two things: either we match one or more spaces, expressed with a literal space followed by + to denote one or more matches, or (expressed with the | operator) we want to match the end of the line (indicated by $) optionally followed by a newline, indicated by a \n and a question mark to make it optional.

    The reason why we make the newline optional is that we could be faced with the last line of the text, and while there is an end of that line, there is not necessarily a newline.

    So this part will match a consecutive run of spaces, or end of line. The magic happens when we combine that with the previous pattern, which matched 1-80 non-newline characters, because this additional match requires that the 1-80 non-newline characters are followed either by the consecutive run of spaces, or end of the line.

  • |

    The previous two patterns combined will match at most 80 non-newline characters followed by spaces or end of line. A problem exists if there are more than 80 non-newline characters not containing any spaces.

    For this reason we add special handling for that case, this is done by using the alternation (|) operator. Alternations are matched left-to-right and the first one which matches is used. So only if the left side of this operator did not match, will we advance to the next pattern.

  • (.{1,80})

    This is the same as the first pattern but since we do not follow it with the “space or end of line” clause, it will match up to 80 characters of the line (greedy) regardless of what follows the “break.”

    Like the first pattern, the result is put in a capture group, but since this is the third use of parenthesis, it is capture group three and can thus be references as \3 in the replacement string.

Replacement String

Since we get the text we need either in capture group one or three, the replacement string needs to insert one of these, and a newline.

As only one of them will match, I am simply inserting both using \\1\\3\n (the double backslashes are to cater for Ruby string escaping.)

Be aware that some implementations of regexp replacing use $1-n instead of \1-n for referencing a capture group.

Personally I favor the dollar notation, as it makes it seem like a variable, thus opening up for normal variable modification syntax, and also frees up the backslash, for example to prefix a capture group with a backslash in Ruby you easily end up with 6 backslashes in a row, just to insert one — so as you may have figured, in TextMate capture groups are referenced with the dollar notation.

Notes

  1. It might make sense to make this method a member of the String class.

  2. It is possible to do a non-capturing group using (?:…) this would make the function:

    def wrap_text(txt, col = 80)
      txt.gsub(/(.{1,#{col}})(?: +|$\n?)|(.{1,#{col}})/,
        "\\1\\2\n") 
    end
    

    This is technically better, but at the expense of a slightly more complex regexp.

2006-06-30: Based on Florians comment, here is a version which deals with lines that end with trailing spaces:

def wrap_text(txt, col = 80)
  txt.gsub(/(.{1,#{col}})( +|$)\n?|(.{#{col}})/,
    "\\1\\3\n")
end

I also changed the right side to do exactly col repeats instead of the (unnecessary) range.

25 Comments

  1. 28 Jun 2006 | # Benoit Gagnon wrote…

    This method looks a lot like the one that comes with Rails :)

    http://api.rubyonrails.org/classes/ActionView/Helpers/TextHelper.html

  2. 28 Jun 2006 | # Allan Odgaard wrote…

    I hadn’t seen that one, but I wouldn’t say that they look a lot like each other — the Rails method makes use of two gsub’s and one strip where I settle with just the one gsub.

    In addition the behavior is a little different. The Rails method eats empty lines, e.g.:

    print word_wrap <<-TEXT
    Test
    
    Test
    TEXT
    

    Will output (removing the paragraph break):

    Test
    Test
    

    And it doesn’t force a break for lines too long, e.g. providing 2 above as the line-width still gives the same result.

    Not sure if this is by-design, personally though, I wouldn’t want this behavior ;)

  3. 28 Jun 2006 | # Jay Soffian wrote…

    Can you provide a little more context about why you're doing this with a regex instead of, say Reformat Paragraph or /usr/bin/fmt?

    If folks are looking for language specific solutions, in Perl there is the Text::Wrap module available via CPAN. In Python you can make use of TextWrapper in the textwrap module (included with Python).

    j.

  4. 28 Jun 2006 | # Allan Odgaard wrote…

    Jay: The entry was inspired by a friend of mine who asked over IM specifically for a regexp to word wrap text at column 80 (and force a break if a line was too long) — I do not know what environment he was in, other than he is a .NET whore, so neither TextMate, fmt, perl, or similar was of any help to him ;)

    Since coincidentally he was not the first to ask me about this, I decided to write it up on this blog, also for the hopefully educational value in deconstructing the regexp (as many TM users are still not fully comfortable with regexps.)

    If you watch my customization screencast I actually do call fmt myself from Ruby :) though the last time I needed to do something like word wrap myself was for emails sent out — here the text needed to be both wrapped and indented, resulting in the function below (which is what I based this entry on):

    def blockquote(txt)
      txt.gsub(/(.{1,60})( +|$\n?)|(.{1,60})/,
        "   \\1\\3\n") 
    end
    

    This is btw for the ticket system. When I release a new build of TM the change log is scanned for ticket ID’s and an email is sent to people involved with that ticket ID quoting the relevant part of the release notes, but nicely wrapped and indented, using the function above.

    Anything else? :)

  5. 28 Jun 2006 | # Allan Odgaard wrote…

    This is a little funny, I just noticed that the TextMate Ruby bundle has a snippet to insert a word wrapping gsub :)

  6. 28 Jun 2006 | # Soryu wrote…

    Haha, I know who the .net whore is! ;)

  7. 29 Jun 2006 | # Jeremy Dunck wrote…

    Come come now, .Net developers are sharecroppers, not whores. :)

  8. 30 Jun 2006 | # Florian Hars wrote…

    Your solution ignores tabs and converts trailing whitespace into spurious paragraph breaks:

    $  perl -e '$a = "Testtest\tTesttest\tTesttest\n"; $a =~ s/(.{1,10})( +|$\n)|(.{1,10})/$1$3\n/gm; print $a'
    Testtest        T
    esttest Te
    sttest
    
    
    $ perl -e '$a = "Testtest    \nTest\n"; $a =~ s/(.{1,10})( +|$)|(.{1,10})/$1$3\n/gm; print $a'
    Testtest  
    
    Test
    

    (Where is the preview button so I can check that markdown did in fact interprete my code example correctly? )

  9. 30 Jun 2006 | # Florian Hars wrote…

    See, it didn't. Here are some of the missing backslashes, plese insert as appropriate:

    \\\\\\\\\\\\\\
    
  10. 30 Jun 2006 | # Allan Odgaard wrote…

    Florian: not handling tabs was deliberate since you can’t wrap text which contain tab characters to a given column without knowing the tab size which will be used for display.

    So running the text through tab expansion would generally be the practical solution — though you can make the match for space into space-or-tab if you’re more interested in a character wrap, e.g. for wrapping email/mime text where there is a line width (make sure the . then match a byte and not a (potential multi byte) character.)

    Adding a match for an optional newline after the one-or-more spaces should handle the excessive use of spaces in your example.

    As for backslashes, this is unfortunately a display bug in WordPress, if/when I fix it, they should show up for your comment :)

  11. 02 Jul 2006 | # Allan Odgaard wrote…

    For anyone interested, the backslash swallow problem has been fixed :)

    The markdown plug-in which I use does disable the problematic WP comment text filter (wpautop) but it didn’t provide the priority under which it was added (30) which made remove_filter a no-op.

  12. 08 Aug 2006 | # Kaspar Schiess wrote…

    This'll even hyphenate ;)

    
    #!/usr/bin/env ruby
    
    require 'rubygems'
    require 'text/reform'
    
    r = Text::Reform.new
    while line=gets
        puts r.format('['*40, line)
    end

    You should

    gem install text-reform
    

    before running this inside TextMate.

    greetings

  13. 15 Aug 2006 | # Mitch wrote…

    I was looking for a similar solution for coldfusion. This works, but then I found that you can use the coldfusion function "wrap" like such: wrap(str,5). Got to love CF.

  14. 13 Oct 2006 | # Artūras Šlajus wrote…

    This is version when you need strictly cut lines:

     # Wraps text
     def wrap_text(text, len = 80)
       text.to_a.collect do |line|
         t = []
         1.upto((Float(line.length) / len).ceil) do |i|
           # Arithmetical progression
           # a = (n - 1) * q
           t.push line[( (i - 1) * len )..( (i - 1) * len + len - 1)]
         end
         t.join("\n")
       end.join
     end
    
  15. 03 Dec 2006 | # Kike Lahuerta wrote…

    This version is for vb.net

    Public Sub CreaDescripWW(ByVal dv As DataView, ByVal tamLinea As Integer)
        Dim r As Regex = New Regex("(.{1," & tamLinea & "})( +|$\n?)|(.{1," & tamLinea & "})")
    
        For Each drv As DataRowView In dv
            Dim descrip As String = drv("Descrip")
            Dim descripWW As String = ""
            Dim m As Match = r.Match(descrip)
            While m.Success
                Dim str As String = m.Groups(1).Value
    
                If str.Length = 0 Then str = descrip.Substring(0, tamLinea)
    
                descripWW &= str & vbCrLf
                descrip = descrip.Remove(0, str.Length)
    
                m = r.Match(descrip)
            End While
            drv("DescripWW") = descripWW
        Next
    End Sub
    
  16. 06 Mar 2007 | # david wrote…

    Can you say me if this is a good tutorial?

    I want to learn regexp but i don't now where i can found a good tutorial.

    Can you help me?

  17. 05 Jul 2007 | # Tony wrote…

    I'm trying to write a little reflow command for TextMate and I was wondering if there's a way to get the current document's wrap column.

    In other words, instead of passing the column width col = 80, is there a global TextMate variable that stores the stores the document's wrap column?

  18. 05 Jul 2007 | # Allan Odgaard wrote…

    Tony: The TM_COLUMNS variable has the number. When there is a column selection it changes to number of columns selected (i.e. width of selection), but for your command that’s probably desired.

  19. 03 Aug 2007 | # Anonymous wrote…

    wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww

  20. 03 Aug 2007 | # Jose wrote…

    Sorry, that was me above, if it would wrap or not. i'm looking for a way to word wrap non-fixed width fonts. because same number of w's and i's wont have the same display size, is there a way to make them look "equal"?

  21. 31 Aug 2007 | # Guillaume wrote…

    Hum will it be a possibility to put "-" before the cut line in the case it s cuting a words…. (so not put "-" if it s cuting white space) ?

    Thanks

    Guillaume.

  22. 01 Sep 2007 | # Allan Odgaard wrote…

    Guillaume: In ruby you would do something like below.

    def wrap_text(txt, col = 80)
      txt.gsub(/(.{1,#{col}})( +|$)\n?|(.{#{col-1}})-?/) do
        ($1 ? $1 : $3 + "-") + "\n"
      end
    end
    
  23. 08 Sep 2008 | # Anonymous wrote…

    dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

  24. 16 Sep 2008 | # Anonymous wrote…

    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

  25. 27 Nov 2008 | # bakaohki wrote…

    Thanks for this regexp, mate.

Comments closed, you can use the mailing list for discussion.