Handling encodings (UTF-8)

From time to time I'm asked to extend the number of encodings supported by TextMate. My answer is normally that the user should be using UTF-8, so here's a bit of history, reasons for using UTF-8, and tips about handling it in miscellaneous contexts.

Initially we had ASCII which define 128 characters (some of them being control characters). Each character can be represented with 7 bits, and you can see all of them by running man ascii in your shell.

Since ASCII only contain the letters A-Z (without diacritics) several 8 bit extensions were made (e.g. CP-1252, MacRoman, iso-8859-1), but 8 bit isn't enough to also add e.g. greek letters, so multiple variants exist (MacRoman, MacGreek, MacTurkish, …).

The different 8 bit encodings are generally not interchangeable w/o loss, so a new standard had to be created (Unicode) which is a superset of all existing encodings.

Unicode is 32 bit, which gives it plenty of room to grow, e.g. the default encoding for documents transferred over http (iso-8859-1) does not contain the € symbol, and has no room to add it.

So Unicode should sell itself, seeing how it's the only way to actually represent all the characters you can type both now and in the future.

But a byte is 8 bit (an octet) and there is a lot of software which treat strings as octet streams, and some of them expect to find miscellaneous tokens in these strings represented using their ASCII values (e.g. parsers, compilers and interpreters).

This is where UTF-8 enters the picture. UTF-8 is an 8 bit representation of Unicode and when it comes to new protocols, RFC 2277 from IETF says: Protocols MUST be able to use the UTF-8 charset.

Besides being an 8 bit encoding and being able to represent Unicode, it has a few other very nice properties:

  1. Every ASCII character is represented as an ASCII character in UTF-8.
  2. Every UTF-8 byte which looks like an ASCII character, is an ASCII character.
  3. Generating a random 15 byte sequence containing characters in the range 0×17—0xFF has a probability of 0.000081 to be valid UTF-8 (the probability gets lower, the longer the sequence is, and is also lower for actual text).

Properties 1 and 2 are important to keep compatibility with our existing ASCII heavy software. E.g. a C compiler would generally only know about ASCII, but since strings and comments are treated as byte streams, we can use UTF-8 for our entire source and put non-ASCII characters in both our strings and comments.

Property 3 turns out to be attractive because it means we can heuristically recognize UTF-8 with a near 100% certainty by checking if the file is valid. Some software think it's a good idea to embed a BOM (byte order mark) in the beginning of an UTF-8 file, but it is not, because the file can already be recognized, and placing a BOM in the beginning of a file means placing three bytes in the beginning of the file which a program that use the file may not expect (e.g. the shell interpreter looks for #! as the first two bytes of an executable).

Serving HTML as UTF-8

What I hear the most is that some browsers do not support UTF-8. This is not true (since at least version 4 of IE/NS), but you need to include the encoding in the http response headers. If you're using apache and the default charset is not set to UTF-8, you can add the following to your .htaccess file:

AddDefaultCharset utf-8

You can also set it for specific extensions, e.g.:

AddCharset utf-8 .txt .html

Receiving user data as UTF-8

If you accept data from the user via an HTML form, you should add accept-charset="utf-8" to the form element, e.g.:

<form accept-charset="utf-8" …>
    …
</form>

This will ensure that data is sent as UTF-8, and no, you cannot rely on the encoding if you do not supply this! Nor can you rely on all users limiting their use of characters to the ASCII subset which is common for the majority of encodings.

LaTeX

To make LaTeX interpret your document as UTF-8, add this in the top:

\usepackage[utf8]{inputenc}

Terminal

By default Terminal.app should already be set to use UTF-8 (Window Settings → Display). Since HFS+ is using UTF-8 for file names, this makes sense not only to be able to cat and grep files in UTF-8, but ls will return data in UTF-8 as well (since it's dumping file system names).

In addition to the display preference, you should also add the following line to your profile (e.g. ~/.bash_profile for bash users):

export LC_CTYPE=en_US.UTF-8

In fact without it, subversion will fail to work for repositories which use non-ASCII characters (when it re-encodes filenames to the local system).

Other programs are also using the variable, e.g. vim will only interpret UTF-8 multi-byte sequences correct with the variable set.

Converting between encodings

If you need to convert between encodings, you can use iconv, e.g.:

ls|iconv -f utf-8 -t ucs-2|xxd

Will convert the result from ls to ucs-2 (16 bit unicode) and do a hex dump of that. iconv has a transliteration feature if you need to use a lossy encoding, e.g.:

echo “that\'s nice…”|iconv -f utf-8 -t ASCII//TRANSLIT

will output

"that's nice..."

37 Comments

  1. 18 Sep 2005 | # Kevin Ballard wrote…

    I'd just like to point out that EBCDIC existed before ASCII.

  2. 18 Sep 2005 | # Sander wrote…

    That is all nice, but I work with a lot of legacy systems that I can't (and won't) update to UTF-8 because the creator of a text editing program refuses to use other encodings.

  3. 18 Sep 2005 | # Allan Odgaard wrote…

    Sander: the point of this post was not to make people upgrade because I refuse to use other encodings, but because UTF-8

    1) is the only 8 bit encoding which can represent the entire Unicode,

    2) is compatible with ASCII,

    3) can be used as a drop-in replacement in most situations,

    4) is strongly backed by IETF,

    5) generally makes things easier (like actually allowing a way to recognize it, unlike all the other 8 bit encodings).

    So UTF-8 is the future, all other 8 bit encodings are problematic because they can't be recognized, they can't represent all the characters we use today (like the € symbol), and UTF-8 allows an easy transition by being compatible with ASCII.

    So what legacy systems are you working with, what encoding do they use, and why can't they use UTF-8?

  4. 18 Sep 2005 | # Rimantas wrote…

    Regarding HTML – according to spec it uses UCS as document character set, and UCS is equivalent to Unicode… So UTF for HTML makes even more sense (and frees us from using numeric character references – which by the way sometimes are used incorrectly).

  5. 19 Sep 2005 | # Dan Kelley wrote…

    That's neat about latex. Does anyone reading this happen to have a hint as to how I install the utf8 stuff for latex? I found a ".def" file on CTAN but I don't know where I put it. (My setup uses latex from the /sw path used by fink.)

  6. 20 Sep 2005 | # Allan Odgaard wrote…

    Dan: I'm using pdflatex from the i-installer package where utf-8 works out-of-the-box. In general I think this is a better “port” of LaTeX than at least the one I initially tried from Fink.

  7. 20 Sep 2005 | # Benoit Gagnon wrote…

    Regarding UTF-8 in TextMate, I don't think it needs support more encoding types, but it SHOULD offer better functionality for dealing with the few charsets it supports. I keep SubEthaEdit around exclusively for its comprehensive charset features. For example, if you change the charset from the menu, a dialog pops up asking if you'd like to Reinterpert the file or Convert it. Now that's what I call fool proof. Even beats BBEdit in my opinion. It's sometimes hard to tell what TextMate is REALLY doing with my file if I select utf-8 from the menu. Please consider adding a similar feature. Also, I love how SubEthaEdit shows the current encoding/charset in the status bar. We see the current Language, why not add the current charset :)

  8. 25 Sep 2005 | # Jim Woolum wrote…

    Can someone tell me how my profile should look after adding the line: export LC_CTYPE=en_US.UTF-8 ?

    Currently it looks like this:

    #

    DELUXE-USR-LOCAL-BIN-INSERT

    (do not remove this comment)

    #

    echo $PATH | grep -q -s "/usr/local/bin" if [ $? -eq 1 ] ; then PATH=$PATH:/usr/local/bin export PATH fi

    Thanks a lot!

  9. 25 Sep 2005 | # Jim Woolum wrote…

    Never mind, I think I got it.

  10. 25 Sep 2005 | # Allan Odgaard wrote…

    You can verify that it's working by opening a new terminal and typing locale.

  11. 28 Sep 2005 | # Gabe da Silveira wrote…

    Allan, your stand on utf-8 is commendable. I think some additional features for dealing with encodings are in order, but I think we are all better off with you working on the new and innovative features that make TextMate unique.

    I really do feel for those who are stuck working with other encodings and can't use TextMate for that reason, but I also think that they should be making the case to their superiors about updating their systems. It's high time that all generally available open-source and commercial software packages fully support at least utf-8. PHP for example is a huge black mark for unicode support on the web.

    And for those of you who haven't done so already, please read Joel on Unicode.

  12. 28 Sep 2005 | # mathie wrote…

    Nice article, I'm convinced. :-)

    Just one wee thing: should it be s/accept-encoding/accept-charset/g for the HTML forms or is that something different?

  13. 28 Sep 2005 | # Allan Odgaard wrote…

    mathie: you are correct, I have fixed it (and made it a link to W3C), thanks!

  14. 22 Nov 2005 | # Anonymous wrote…

    How about handling more input methods? I tried using the japanese input built into OS X just now and it didn't work at all. I'd grown so used to this working in every OS X app ever so this really took me by surprise. I'm guessing the reason is that you're using custom text views, but there really aren't excuses not to support other input inputs these days IMO.

  15. 22 Nov 2005 | # Allan Odgaard wrote…

    Anonymous: Yes, TextMate doesn't support that. And yes, because it's all custom code, and I have yet to devote resources to support Hebrew, Arabic, Korean, Japanese, Chinese, a.s.o.

    I'll ignore your “no excuses” remark, assuming that you have no clue about the technical side of things, and how many hours I already spend on TextMate, trying to catchup with the code in “every OS X app”, which started its life more than ten years ago…

  16. 24 Nov 2005 | # Anonymous wrote…

    Sorry about that, 'no excuses' was rather harsh and I meant no offense.

    TextMate remains my editor of choice and is so superior to the alternatives I've tried that I don't mind doing some cutting and pasting the few times I need to insert some Japanese into my files :).

  17. 11 Dec 2005 | # Xslf wrote…

    "and I have yet to devote resources to support Hebrew, Arabic" I am looking for a good code editor (mainly for XML/XSLT/XHTML) that can handle utf/bidi/Hebrew properly. TextMate does seem nice, but it is still very lacking on the Hebrew/bidi dept. So- how high on your todo list is Hebrew/Arabic/Bidi support? Is it worth waiting for, or should I look someplace else?

    Thanks!

  18. 12 Dec 2005 | # Allan Odgaard wrote…

    Xslf: You should look elsewhere. I haven't even grasped the full complexity of this stuff, so we're talking at least a year.

  19. 30 Dec 2005 | # Koyata wrote…

    I would like to see a working japanese input method. How about hiring people for the project, or is this to early for your business?

  20. 02 Jan 2006 | # Allan Odgaard wrote…

    Koyata: Yes, it's too early.

  21. 17 Jan 2006 | # Brendan Dowling wrote…

    Actually, UTF-8 can encode characters from the entire 32-bit Unicode space, something that the 16-bit encodings cannot do. Also, UTF-8 does not have the endian issues that 16-bit encodings can have.

  22. 07 Feb 2006 | # Taylan Pince wrote…

    Well, I was just about to be fully convinced to finally drop BBEdit and move all my production to TextMate by buying it right away. But after reading this entry and seeing that you have no intention of supporting other encodings, I don't see the point. TextMate is a good piece of software, but BBEdit still seems superior to me.

    The reason why I just cannot let go of other encodings (ISO-8859-9 in my case) is not because I am against UTF-8, but because I had to use them in my older projects, and I cannot be expected to switch back and forth between TextMate and BBEdit because some projects work with it and some don't.

    Good luck with the development, nice work, but not enough for me.

  23. 07 Feb 2006 | # Allan Odgaard wrote…

    Taylan: you could of course convert your old projects to utf-8.

    I might not have made it so clear in this post, but one (of several) reasons to push so strongly for utf-8 is that TextMate does acheive a lot through (shell) scripts, and these simply cannot have advanced heuristics to guess encoding.

    So when presenting difference files from svn, collecting TODO items from sources, running scripts with exception catching and pretty-printed output, etc. all these scripts assume that what they work with is utf-8, and it's all but trivial to do otherwise.

  24. 07 Feb 2006 | # Taylan Pince wrote…

    Allan, I totally understand, and I think you are right to push the users a bit on this topic. I didn't make up my mind just yet, but I might still start using TextMate for all my new projects.

    As for old projects, I simply don't have the resources to go back through two farily large e-commerce sites, convert everything to UTF-8, and more importantly, test it to make sure they work properly. They'll have to stay the way they are for a while.

  25. 29 Mar 2006 | # George wrote…

    I second that Taylan.

  26. 16 Dec 2006 | # Luc wrote…

    Thanks for the article! I needed this right now!

  27. 02 Dec 2007 | # A Nonny Mouse wrote…

    Rather than add it to a .bashrc you could create a Mac OS X properties file called ~/.MacOSX/environment.plist along these lines:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
    <dict>
        <key>LC_CTYPE</key>
        <string>en_US.UTF-8</string>
    </dict>
    

  28. 11 Jul 2008 | # MarkUp wrote…

    Allan,

    Honestly, I'm convinced that you could add load/save suport for more encodings in TextMate in less time than it took you to create this page.

    I don't see the point in teaching people to use UTF-8 in an editor that claims to be some sort of "culmination" and oriented to "expert scripters and novice users alike/Whether you are a programmer or a designer" when we, expert scripters, programmers, designers and even novice users have often the need to deal with other encodings.

    I also don't see the point in teaching us the burden of using iconv, when you could even just create a temporary file and call iconv on it from TextMate itself if you don't feel like using the comfortable OS X API to do it the right way (another thing that could be implemented in less time than it is discussed!).

    We are not requesting TextMate to be some sort of Wintel Nero-like feature-overflowed monster app! But supporting load/save in different encodings is something that even TextEdit does!

    I myself am a programmer whose applications depart radically from standards (most even take over the OS!), and often I try to teach people to "good habits", but for this reason my apps are clearly labelled for that purposes and don't give the impression of being all-purpose culminations of anything.

    I really love TextMate, but I need an all-purpose editor that has at least the same features of TextEdit or Xcode's built-in editor.

    Cheers,

    MarkUp
    
  29. 22 Jul 2008 | # george tziralis wrote…

    I'm a textmate lover and I've written my PhD dissertation in latex using textmate. Now I've learned that I can't submit it in english (i'm studying in Athens), so I need to translate it in Greek. The only package that works seamlessly with greek in latex requires an ISO 8859-7 encoding and I do feel pretty unlucky that you have yet to add support for separate than utf-8 encodings (even if your arguments are strong indeed). Or is there any hope/hack for my beloved software at the end of the tunnel?

  30. 07 Mar 2009 | # jpol wrote…

    Really guys, lack of ISO 8859-7 encoding is what keeps me from using this app over TexShop. Too bad..

  31. 26 Mar 2009 | # ev wrote…

    I am a web developer. Nowdays our company design new sites on UTF-8. But old sites are already done in windows-1251, and sometimes they need changes. So every day I connect to some servers through FTP (SFTP etc) and open, change and save some files written on windows-1251.

    Is it possible to use TextMate to help me with these tasks? I suppose, iconv command can be used to change encoding "on the fly", but I don't know how to do that, because I went to Mac OS from Windows recently, so I am not familiar with terminal features.

  32. 25 Jul 2009 | # Anton Gavrilov wrote…

    UTF-8 is excellent, but the hard cold reality is that there is a lot of non-Unicode software around that you have to interoperate with, and it's not going anywhere as long the dominant OS is Windows with its mess of code pages and UTF-16 (perhaps the most cumbersome way to represent Unicode – do you know how much software which does try to support it will pretend surrogate pairs just don't exist?).

    It is such an arrogance to say "Just use UTF-8" having ensured, with Latin-1, that the problem of legacy encodings doesn't bite you personally (and perhaps your largest market, which incidentally seems to be the one most suffering from the recession right now).

    Why don't you put your money where your mouth is and remove support for the Western encodings, leaving just plain ASCII and Unicode? And when your customers start asking for a refund, tell them you don't see what their problem is – everyone should be using UTF-8 by now, right?

  33. 25 Jul 2009 | # Allan Odgaard wrote…

    Anton: The blog post is making a case for UTF-8 because many users have been using legacy encodings simply because no-one explained the advantages of UTF-8 or the disadvantages of using legacy encodings.

    The blog post further shows how you can convert from practically any encoding to any other encoding using the iconv command found on every install of OS X.

    As for going UTF-8 only, that actually is something I plan to do, but at the same time introduce import/export callbacks that can do the necessary transcoding for users that need this.

    There are also users who keep their log files gzipped, and want to open them in TextMate, or keep them on an sftp server, and want to open them in TextMate, etc. I don’t see anyone calling me arrogant because I haven’t yet addressed these needs…

  34. 25 Jul 2009 | # Anton Gavrilov wrote…

    It's one thing when you're not implementing something because it's hard and/or you haven't gotten around to it yet, and it's another thing to refuse to add easy features (possibly not requiring any code at all) because you believe you know better what the users need.

    For all my love of UTF-8, the corporate website I'm responsible for will stay in Windows-1251 for the foreseeable future to ensure it can be readily edited in any of the windows tools we might want to use. Luckily it seems I'll be able to edit pages on my mac as well should I feel like it, now that I discovered TextWrangler which automates charset detection and saving back to the original format without me having to do as much as a mouse click, let alone typing console commands!

    And I guess it'll be easier to use TextWrangler for my personal projects as well, which are of course in UTF-8.

  35. 25 Jul 2009 | # Allan Odgaard wrote…

    Anton: I wrote my own encoding stuff which does frequency analysis to estimate which legacy encoding was used, code which I have been wanting to retire rather than extend, so no, this can’t be done w/o writing code.

    Glad to hear you got your encoding needs solved by TextWrangler. Hopefully you’ll give TextMate a second chance when we release 2.0.

  36. 26 Jul 2009 | # Anton Gavrilov wrote…

    Now I like that response! I will certainly be looking forward to 2.0.

  37. 08 Oct 2009 | # David Price wrote…

    Good luck with Release 2.0 and its support for alternative character sets.

    Like other would-be users, support by the convert-to-UTF-8-on-open and convert-back-on-save method seems easy enough to me. Can I suggest that you allow user-written plugins for this purpose? People can contribute their plugins and the responsibility for understanding hundreds of character sets and thousands of minor variations will no longer be yours!

    My own interest is EBCDIC, the oldest living character set whose usage continues to grow. Used on mainframes and on certain IBM midrange computers, it is a (family of) 8-bit character set(s) that (as your very first comment above noted) is older than ASCII.

    Support for simple (user-extendable) conversion and convert-back (for any character set) would get you a lot of extra customers, and would bring you up to the level of your competition.

    If you really wanted to support multilingual multi-platform users, and be the ultimate editor, then please add support in such a way that the characters are displayed correctly but still have their native hex bit-patterns available as well. That would enable reading dumps, and other files with mixed text and binary fields, on a Mac using TextMate. But I guess if you consider simple builtin conversion to be difficult, then support for data with mixed non-native text and binary must seem impossible.

    Best of luck with the best UTF-8-specific Mac editor around!

Comments closed, you can use the mailing list for discussion.