What does "Unicode Build" Really Mean?

wxPython can be built in either "unicode mode" or "ansi mode" but it may not be clear what that really means. Hopefully this page will help clear up the confusion. If not then please ask questions below and when somebody comes along that knows the answer they can update the page.

Unicode in Python

There are a few aspects of unicode-ness to keep track of here. First, in Python there are unicode objects and there are string objects. String objects are essentially a sequence of 8-bit characters, and unicode objects are a sequence of "wide" characters (either 16-bit or 32-bit depending on platform and options used when building Python.) They are related to each other in that a unicode object can be encoded into a string object using a specific "codec" (a matched enCODer and DECoder pair). You can think of a codec as being like the "magic decoder ring" that came in the box of cereal when you were a kid, (or perhaps when your dad was a kid...) String objects can be decoded into a unicode object using the decoder part of the codec.

See this link in the Python Tutorial for more info.

Unicode in wxWidgets

On the other side of the fence is wxWidgets and how it can use unicode. In the C++ code there is a class named wxString, and all string type parameters and return values in the wxWidgets library use a wxString type. The wxWidgets library has a unicode compile switch that makes wxString be either an array of 8-bit characters (the C char data type) or an array of wide characters (C's wchar_t data type.) So in other words you can have a wxWidgets build where wxStrings are unicode and another build where wxStrings are ansi strings.

There is more info here.

Unicode in wxPython

So what does all that mean for wxPython? Since Python does know about string and unicode objects, and you can have both in the same program, the wxPython wrappers need to attempt to do something intelligent based on if the wxWidgets being used is an unicode build or an ansi build.

So, if wxPython is using an ansi build of wxWidgets then:

And if wxPython is using an unicode build of wxWidgets then:

You can test if your code is running in a unicode build of wxPython by checking wx.PlatformInfo, like this:

    if "unicode" in wx.PlatformInfo:
        # do something about unicode here...

UPDATE: Starting with wxPython 2.5.4.1 the codec used to convert string objects to/from unicode objects can be set by user code using the wx.SetDefaultPyEncoding(encoding) function. Also the default codec used is the value of the locale.getdefaultlocale()[1] expression, or it will fall back to sys.getdefaultencoding() if there is a problem with the default locale settings.

A Very Basic Unicode Introduction

This section is designed to help morons like me figure out this Unicode stuff. Here are a number of things about working with Unicode that I would have found helpful to understand when I started working with it. (The print statements below work in IDLE on Windows, but probably won't work from command-line Python because complex characters can't be printed in those environments. This is signified by the ">>>" prefix in my examples.)

The official web site of all things Unicode is http://www.unicode.org. They have many code charts that allow you to determine what the character codes are for Unicode characters.

There are many ways to represent Unicode characters in Python. If you're starting from the code charts at Unicode.org, perhaps the easiest way is:

  ch_u = u'\u4eb0'

will give you character 4EB0, the Chinese character 亰, which I picked at random.

If you call ord(ch_u), you will find out that this is character 20144. So an alternate way of creating this character would be:

  ch2_u = unichr(20144)

The Rich Text Format specification, for example, uses this form for indicating Unicode characters.

You can use these characters with wxPython widgets now. The simplest example would be to display them as follows:

  import wx   # Assuming you have a Unicode build as your default wxPython.
  App = wx.App()
  dlg = wx.MessageDialog(None, u'My Unicode character is %s' % ch_u)
  dlg.ShowModal()
  dlg.Destroy()

Now, if you've been reading up on Unicode, you'll know that Encodings are a big deal. But neither of the methods discussed above deals with encodings. One article I read said essentially that if you don't have an explicit encoding, you don't know what character you're dealing with, but I think that's an overstatement. The two representations above create Unicode objects that accurately represent the character. (If you are having trouble getting a particular character displayed on the screen correctly, you may need to install a character set or font that includes the character.)

Encoding is important, but it took me the longest time to understand why. Python and wxPython seem perfectly happy working with Unicode objects transparently. That is to say, they are able to work with character 20144 just like any other character. But when you have to interact with the outside world, doing something radical like saving your data to a file or in a database, you run into trouble. Why? The file system in most OSes (for example) assumes 8-bit characters, and obviously our 亰 character isn't represented internally in 8 bits. It's a 16-bit character. Encodings are used to translate between Python's internal 16-bit representation of the 亰 character and the outside world's 8-bit expectations.

There are lots of different encodings. Different ones are designed to handle different jobs. The people who were putting all of this together wanted to be able to keep things relatively simple, so they created "Latin-1" to deal with Romantic languages that descended from Latin. They have a more-or-less common character set, and they were able to work out an encoding system with 16-bit characters. Latin-1 is easy to work with because your characters are fixed width.

Other languages were more difficult. They contain more characters or ideographs. The Latin-1 encoding doesn't work with our Chinese character, as it's not in that encoding's character set. "UTF-8" is an encoding that can represent a lot more characters than Latin-1, but it's more difficult to work with because characters are represented by between one and four bytes. As you can imagine, not knowing the character width can add some challenges to decoding these characters.

The main thing to keep in mind is that you need to use the same encoding when converting your string to and from Unicode. The choice I've made for my software, where I don't know what language the end-user will be using, is to go with UTF-8 because I hope that will provide me with support for a lot more languages than other encoding systems I looked at. There are lots of articles on the Web that explain encodings and the differences between them far better than I could, so I won't say anything more about it here.

The point is that when we want to take our Unicode strings and save them (or whatever,) we need to convert them to an 8-bit representation. This is accomplished as follows:

  ch_utf8 = ch_u.encode('utf8')

While ch_u was of type 'unicode', ch_utf8 is of type 'str'. (Okay, maybe it seems obvious, but I didn't notice this subtlety for a long time.)

  >>> print ch_u, type(ch_u), ch_utf8, type(ch_utf8)

produces:

  亰 <type 'unicode'> 亰 <type 'str'>

What does this tell us? It tells us that our 亰 character in Unicode is represented by the 3-character string 亰 in UTF-8. To make sense of this, in IDLE, leave out the "print" and type:

  >>> ch_u, ch_utf8

which produces:

  (u'\u4eb0', '\xe4\xba\xb0')

The first representation you should recognize. The second representation tells us that the UTF-8 representation of Unicode character 20144 is 3 characters, presented in hex form; character \xe4, which is displayed as character "ä", which can also be displayed using chr(228); character \xba, which is displayed as character "º", which can also be displayed using chr(186); and character \xb0, which is displayed as character "°", which can also be displayed using chr(176). (The visual difference between \xb4 and \xba is hard to see, but they are different characters.)

So now we have a third way of creating our Chinese character:

  ch3_u = unicode('\xe4\xba\xb0', 'utf8')

When we're manipulating data in Python and wxPython, we simply use our Unicode objects, such as ch_u. When we need to convert our Unicode objects to 8-bit string representations, for example to save the string to a database, we encode it with a specific encoding by calling ch_u.encode(encoding), and when we read that string from the database, we need to convert it to a Unicode object using unicode(string, encoding) so that it will be displayed correctly by wxPython.


Comments or Questions?


Simply smashing! I could not make head or tails out of unicode or utf8 etc. What a great bit of help, thanks a lot. {OK}

Donn.

UnicodeBuild (last edited 2013-01-09 18:06:08 by p54B3D853)

NOTE: To edit pages in this wiki you must be a member of the TrustedEditorsGroup.