Character Encoding

To the computer all data is a bunch of binary numbers, including strings, pictures and videos. So how do you get from a bunch of numbers to a string? There is an agreed upon way to convert the numbers into letters. The agreed upon way is called a character encoding.

The oldest character encoding that’s still in use is called ASCII, short for American Standard Code for Information Interchange. ASCII uses seven bits to represent 128 different characters. Why so many? To the computer upper case letters are different from lower case letters and there are a whole bunch of special characters. Special characters control how a string looks or is printed.

There are a few special characters that are important to know.

Binary

Hex

Decimal

String Notation

Meaning

000 0000

0x00

0

\0

Null character. There must be one at the end of every string.

000 1001

0x09

9

\t

Tab character. Adds an adjustable amount of whitespace.

000 1010

0x0a

10

\n

Line feed. Starts a new line of text on UNIX.

000 1101

0x0d

13

\r

Carriage return. On Windows \rn together start a new line of text.

Many of the special characters are leftovers from the days when there were no computer monitors. In those days the UNIX command line was typed out using a specially modifed typewriter called a Teletype.

It’s pretty commont to use \t and \n in strings:

print('Special characters:\nNo tab\n\tOne tab\n\t\tTwo tabs\n\t\t\tThree tabs')

Enter the print statement into the cell below.

[ ]:

Notice what the special characters do?

UTF Encoding

There’s a problem with ASCII. What about if you want to interchange information with non-Americans? ASCII only contains the English alphabet, so you can’t write things in other languages. In 1960 when ASCII was developed few other coutries had computers. Now, people all over the world have computers in their pocket. To address this problem the UTF family of encodings was created. UTF is short for Unicode Transformation Format. UTF encodings contain enough characters for every human language and emojis!

In ASCII every character is one byte (8-bits, with only 7 used). UTF characters can be between one and four bytes (8 to 32 bits). The ord and chr functions are only work on single-byte characters so the new functions encode and decode are used to do the same job.

Byte Strings

In order to understand what encode and decode do you need to know about a type of string that I haven’t mentioned yet. The b-string is used to express raw bytes. Raw bytes are a bunch of numbers. They can be converted to strings using an encoding.

Here’s how you specify a b-string. Each hexadecimal value in the b-string starts with \x. So \x01 is how you write hexadecimal 0x01.

bytes = b'\xF0\x9F\x98\xB8'

The decode function applies a character encoding to bytes and returns a string. See what happens when you encode the bytes from the previous example:

print(bytes.decode('utf-8'))

Enter the example lines to see the decoded bytes:

[ ]:

Meow! The encode and decode functions cannot automatically determine the character encoding, you have to tell them. See what happens to the code above if you try to decode the bytes as “ascii”.

The encode() function does the reverse. Here’s how to convert the smiling cat face emoji to the ‘utf-32’ format.

[1]:
'🤖'.encode('utf-32')
[1]:
b'\xff\xfe\x00\x00\x16\xf9\x01\x00'

Check out your favorite emojis at https://getemoji.com/. Try encoding them to see what they look like in bytes. After that take a look at the standard encodings that are built into Python. Try encoding your favorite emoji with different codecs (short for coder-decoder).

File Encodings

Files are made of raw bytes. When you open a file Python 3 assumes that the file is UTF-8 encoded and the familiar functions like readline() return strings. That’s not always what you want. Placing the letter b in the file mode tells Python to return bytes instead of strings.

[1]:
with open('files/languages.txt', 'rb') as f:
    print (f.readline())
    print (f.readline())
    print (f.readline())
    print (f.readline())
b'Afrikaans\n'
b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b\n'
b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0\n'
b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9\n'

The b flag is important when you’re trying to read binary files like images or videos. If you’re trying to read a text file with an alternate encoding you can pass the encoding= keyword argument to open(). Here’s an example where the file encoding is explicity set.

[3]:
with open('files/languages.txt', 'r', encoding='utf-8') as f:
    print (f.readline())
    print (f.readline())
    print (f.readline())
    print (f.readline())
Afrikaans

አማርኛ

Аҧсшәа

العربية

Exercise

Write a Python program that translates files from ASCII encoding to UTF-32 encoding.