Why do you need unicode? This article’s purpose is to tell you the purpose of unicode and briefly, how it is used in Python.
Let’s start from square one. As you know, bytes compose all of the characters, strings, and values that we have on the machine. When you use a computer, for instance, you see the world at a much higher level, and for a language like Python, you normally don’t have to worry about every single byte and the translation of words into the right bytes.
BUT, English isn’t good enough. More than half the world uses non-Latin characters. ASCII? Ever heard of it? Unicode came to provide every character from all languages unique numbers called code points.
Unicode is the set of all characters used in the world, and it has two main encodings UTF-8 and UTF-16. Think of unicode as the alphabet with different translation tables. Encodings are the translation tables where you see and use the understandable end like a string called “dog,” and behind the scenes, a language like Python uses the encoding to properly translate the bytes.
These encodings, decodings, or translation tables have the ability to translate mathematical symbols and Chinese characters, which is you should come to appreciate unicode. Consider the range of characters that exist beyond the English language!
Most people use UTF-8 (because it includes mostly every character) to work with encoding and decoding values especially for character and string data types.
Now to Python
Two types of strings exist in Python, byte strings and unicode strings.
When you’re working with a language like Python, you have byte strings, which are strings having every element as a byte. Whereas a unicode string has every element as a character on these unique numbers called code points.
Why the two?
Byte strings are used for writing to files, transferring to networks, etc. while unicode strings can be used to manipulate and translate to any character that exists on the planet. You’re always sending byte strings around, and in the actual program, when writing, you tend to make changes to the unicode type string.
For a language like Python, you’re usually operating on the unicode strings, and Python encodes the output of your manipulation to whatever correct byte layout translation your terminal application is using.
Great, it’s automatic?
Not quite. There are specific scenarios where Python can’t encode or decode output automatically. For example, pipes need you to encode manually. Another big example is across networks. You may get funky byte strings over the web if you make HTTP requests to foreign language webpages, so you’ll need to decode the pages with an encompassing decoding type. With UTF-8, a popular encoding of unicode, you can translate the totally foreign byte string to unicode translated format so that you can actually manipulate and understand what you received correctly.
You manipulate unicode strings. Consider unicode as the entirety of every character existing on the planet. Unicode has encodings. Encodings are formats like UTF-8 that arrange the characters into a number format so that they’re all unique. These unique values in turn convert the characters to their proper bytes, making it sendable and readable.
Strings in Python can be very confusing when it comes to encoding! Even more confusing is that:
str is byte string unicode is Unicode string
bytes is byte string str is Unicode string
But luckily, for most instances of Python 2 vs Python 3, the translation between
u'foo' (unicode) and
b'foo' (byte) are the same because Python does the encoding and decoding automatically. But not always, so it’s good to understand the difference. Python also messes up its automatic encoding and decoding, which brings you the need to understand what’s actually happening to make the proper encoding and decoding changes manually.