Tuesday, 26 July 2011

Unicode in Python

ASCII is a famous 5-lettered abbreviation in C - its an encoding which means "American Standard Code for Information Interchange" ; in other words its a text representation which means that every character that you see on your key board has a specific code (here, a number) which helps the computer recognize what key you've hit. I used it every now and then while writing code in "C" - especially if the code involved strings or characters. However, while I was learning Python a new encoding popped up  - Unicode!
                                Even though it is regarded as an extension to ASCII, there's more to it than what meets the eye. However, the overly simplified description in the tutorial for Python 2.7 didn't make any sense to an amateur programmer like myself - I decided to look for decent references and (no surprise) I managed to get an intricate description about Unicode in Python  here.
                                                  
                                           What made Unicode stand apart from the crowd is that it has over a million codes for about 50 languages apart from English ! that means, what ever language you are reading from off the internet/TV has Unicode encoding. A very detailed description is provided in the above mentioned link but if you're one of the lazy kind, let me take the liberty to mention some key points of Unicode.

 Unicode has no specific representation in a computer so it is either represented as ascii or iso-8859-7 or UTF-8 (Unicode Text Format - 8 bit) or UTF-16(bit) the latter are encoding mechanisms as well.
It can be expressed in terms of UTF - 32 as well but Python doesn't support UTF-32 directly instead it is represented as a pair of UTF-16 encoding which are called "surrogate pairs".
            Also, Python 2 requires the character "u" to be mentioned in the beginning of a string to declare it as a unicode string but its not the same when it comes to  Python 3 since every string is considered a unicode string by default.
                   

No comments:

Post a Comment