Understanding unicode in python and writing text in devanagri script

This Unicode HOWTO by Python Software Foundation is a short but informative read about Unicode handling in python. Understanding the brief history of characters standardization from ASCII to Unicode is important not only when you’re working on the stuff that needs it but also since more and more services interact with each other over Web and internationalization (or localization or l10n) is omnipresent.

In python (v2.7) playing around with unicode is super easy.

A normal string in python:

$ s = "abcde"
$ type(s)
str 

A unicode string in python:

$ s = unicode("abcde")
$ type(s)
unicode 

A character is the smallest possible component of a text. ‘A’, ‘B’, ‘C’, etc., are all different characters. So are ‘È’ and ‘Í’. The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12ca’. U+12ca is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.

So, a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff.

The rules for translating a Unicode string into a sequence of bytes are called an encoding. UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) UTF-8 uses the following rules:

  • If the code point is <128, it’s represented by the corresponding byte value.
  • If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
  • Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

UTF-8 has several convenient properties:

  • It can handle any Unicode code point.
  • A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
  • A string of ASCII text is also valid UTF-8 text.
  • UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.

That was some important and interesting theory.

Now out of curiosity I wanted to print Devanagari script characters (Hindi characters to be precise). So I searched over google for Devanagari characters unicode list and found this page from unicode.org comprising of all devanagari script unicode values.

first_name = u'\u092e' + u'\u0928' + u'\u094b' + u'\u091c'
last_name = u'\u0905' + u'\u0935' + u'\u0938' + u'\u094d' \
+ u'\u0925' + u'\u0940'

print first_name + ' ' + last_name

This prints the devanagari characters (basically my name). Output on a ipython notebook is clear and on terminal bit unclear. Looks beautiful.

मनोज अवस्थी
Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s