
Chapter 8. Strings and Serialization
Before we get involved with higher level design patterns, let's take a deep dive into one of Python's most common objects: the string. We'll see that there is a lot more to the string than meets the eye, and also cover searching strings for patterns and serializing data for storage or transmission.
In particular, we'll visit:
- The complexities of strings, bytes, and byte arrays
- The ins and outs of string formatting
- A few ways to serialize data
- The mysterious regular expression
Strings
Strings are a basic primitive in Python; we've used them in nearly every example we've discussed so far. All they do is represent an immutable sequence of characters. However, though you may not have considered it before, "character" is a bit of an ambiguous word; can Python strings represent sequences of accented characters? Chinese characters? What about Greek, Cyrillic, or Farsi?
In Python 3, the answer is yes. Python strings are all represented in Unicode, a character definition standard that can represent virtually any character in any language on the planet (and some made-up languages and random characters as well). This is done seamlessly, for the most part. So, let's think of Python 3 strings as an immutable sequence of Unicode characters. So what can we do with this immutable sequence? We've touched on many of the ways strings can be manipulated in previous examples, but let's quickly cover it all in one place: a crash course in string theory!
String manipulation
As you know, strings can be created in Python by wrapping a sequence of characters in single or double quotes. Multiline strings can easily be created using three quote characters, and multiple hardcoded strings can be concatenated together by placing them side by side. Here are some examples:
a = "hello" b = 'world' c = '''a multiple line string''' d = """More multiple""" e = ("Three " "Strings " "Together")
That last string is automatically composed into a single string by the interpreter. It is also possible to concatenate strings using the +
operator (as in "hello " + "world"
). Of course, strings don't have to be hardcoded. They can also come from various outside sources such as text files, user input, or encoded on the network.
Tip
The automatic concatenation of adjacent strings can make for some hilarious bugs when a comma is missed. It is, however, extremely useful when a long string needs to be placed inside a function call without exceeding the 79 character line-length limit suggested by the Python style guide.
Like other sequences, strings can be iterated over (character by character), indexed, sliced, or concatenated. The syntax is the same as for lists.
The str
class has numerous methods on it to make manipulating strings easier. The dir
and help
commands in the Python interpreter can tell us how to use all of them; we'll consider some of the more common ones directly.
Several Boolean convenience methods help us identify whether or not the characters in a string match a certain pattern. Here is a summary of these methods. Most of these, such as isalpha
, isupper
/islower
, and startswith
/endswith
have obvious interpretations. The isspace
method is also fairly obvious, but remember that all whitespace characters (including tab, newline) are considered, not just the space character.
The istitle
method returns True
if the first character of each word is capitalized and all other characters are lowercase. Note that it does not strictly enforce the English grammatical definition of title formatting. For example, Leigh Hunt's poem "The Glove and the Lions" should be a valid title, even though not all words are capitalized. Robert Service's "The Cremation of Sam McGee" should also be a valid title, even though there is an uppercase letter in the middle of the last word.
Be careful with the isdigit
, isdecimal
, and isnumeric
methods, as they are more nuanced than you would expect. Many Unicode characters are considered numbers besides the ten digits we are used to. Worse, the period character that we use to construct floats from strings is not considered a decimal character, so '45.2'.isdecimal()
returns False
. The real decimal character is represented by Unicode value 0660, as in 45.2, (or 45\u06602
). Further, these methods do not verify whether the strings are valid numbers; "127.0.0.1" returns True
for all three methods. We might think we should use that decimal character instead of a period for all numeric quantities, but passing that character into the float()
or int()
constructor converts that decimal character to a zero:
>>> float('45\u06602') 4502.0
Other methods useful for pattern matching do not return Booleans. The count
method tells us how many times a given substring shows up in the string, while find
, index
, rfind
, and rindex
tell us the position of a given substring within the original string. The two 'r
' (for 'right' or 'reverse') methods start searching from the end of the string. The find
methods return -1
if the substring can't be found, while index
raises a ValueError
in this situation. Have a look at some of these methods in action:
>>> s = "hello world" >>> s.count('l') 3 >>> s.find('l') 2 >>> s.rindex('m') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: substring not found
Most of the remaining string methods return transformations of the string. The upper
, lower
, capitalize
, and title
methods create new strings with all alphabetic characters in the given format. The translate
method can use a dictionary to map arbitrary input characters to specified output characters.
For all of these methods, note that the input string remains unmodified; a brand new str
instance is returned instead. If we need to manipulate the resultant string, we should assign it to a new variable, as in new_value = value.capitalize()
. Often, once we've performed the transformation, we don't need the old value anymore, so a common idiom is to assign it to the same variable, as in value = value.title()
.
Finally, a couple of string methods return or operate on lists. The split
method accepts a substring and splits the string into a list of strings wherever that substring occurs. You can pass a number as a second parameter to limit the number of resultant strings. The rsplit
behaves identically to split
if you don't limit the number of strings, but if you do supply a limit, it starts splitting from the end of the string. The partition
and rpartition
methods split the string at only the first or last occurrence of the substring, and return a tuple of three values: characters before the substring, the substring itself, and the characters after the substring.
As the inverse of split
, the join
method accepts a list of strings, and returns all of those strings combined together by placing the original string between them. The replace
method accepts two arguments, and returns a string where each instance of the first argument has been replaced with the second. Here are some of these methods in action:
>>> s = "hello world, how are you" >>> s2 = s.split(' ') >>> s2 ['hello', 'world,', 'how', 'are', 'you'] >>> '#'.join(s2) 'hello#world,#how#are#you' >>> s.replace(' ', '**') 'hello**world,**how**are**you' >>> s.partition(' ') ('hello', ' ', 'world, how are you')
There you have it, a whirlwind tour of the most common methods on the str
class! Now, let's look at Python 3's method for composing strings and variables to create new strings.
String formatting
Python 3 has a powerful string formatting and templating mechanism that allows us to construct strings comprised of hardcoded text and interspersed variables. We've used it in many previous examples, but it is much more versatile than the simple formatting specifiers we've used.
Any string can be turned into a format string by calling the format()
method on it. This method returns a new string where specific characters in the input string have been replaced with values provided as arguments and keyword arguments passed into the function. The format
method does not require a fixed set of arguments; internally, it uses the *args
and **kwargs
syntax that we discussed in Chapter 7, Python Object-oriented Shortcuts.
The special characters that are replaced in formatted strings are the opening and closing brace characters: {
and }
. We can insert pairs of these in a string and they will be replaced, in order, by any positional arguments passed to the str.format
method:
template = "Hello {}, you are currently {}." print(template.format('Dusty', 'writing'))
If we run these statements, it replaces the braces with variables, in order:
Hello Dusty, you are currently writing.
This basic syntax is not terribly useful if we want to reuse variables within one string or decide to use them in a different position. We can place zero-indexed integers inside the curly braces to tell the formatter which positional variable gets inserted at a given position in the string. Let's repeat the name:
template = "Hello {0}, you are {1}. Your name is {0}." print(template.format('Dusty', 'writing'))
If we use these integer indexes, we have to use them in all the variables. We can't mix empty braces with positional indexes. For example, this code fails with an appropriate ValueError
exception:
template = "Hello {}, you are {}. Your name is {0}." print(template.format('Dusty', 'writing'))
Escaping braces
Brace characters are often useful in strings, aside from formatting. We need a way to escape them in situations where we want them to be displayed as themselves, rather than being replaced. This can be done by doubling the braces. For example, we can use Python to format a basic Java program:
template = """ public class {0} {{ public static void main(String[] args) {{ System.out.println("{1}"); }} }}""" print(template.format("MyClass", "print('hello world')"));
Wherever we see the {{
or }}
sequence in the template, that is, the braces enclosing the Java class and method definition, we know the format
method will replace them with single braces, rather than some argument passed into the format
method. Here's the output:
public class MyClass { public static void main(String[] args) { System.out.println("print('hello world')"); } }
The class name and contents of the output have been replaced with two parameters, while the double braces have been replaced with single braces, giving us a valid Java file. Turns out, this is about the simplest possible Python program to print the simplest possible Java program that can print the simplest possible Python program!
Keyword arguments
If we're formatting complex strings, it can become tedious to remember the order of the arguments or to update the template if we choose to insert a new argument. The format
method therefore allows us to specify names inside the braces instead of numbers. The named variables are then passed to the format
method as keyword arguments:
template = """ From: <{from_email}> To: <{to_email}> Subject: {subject} {message}""" print(template.format( from_email = "a@example.com", to_email = "b@example.com", message = "Here's some mail for you. " " Hope you enjoy the message!", subject = "You have mail!" ))
We can also mix index and keyword arguments (as with all Python function calls, the keyword arguments must follow the positional ones). We can even mix unlabeled positional braces with keyword arguments:
print("{} {label} {}".format("x", "y", label="z"))
As expected, this code outputs:
x z y
Container lookups
We aren't restricted to passing simple string variables into the format
method. Any primitive, such as integers or floats can be printed. More interestingly, complex objects, including lists, tuples, dictionaries, and arbitrary objects can be used, and we can access indexes and variables (but not methods) on those objects from within the format
string.
For example, if our e-mail message had grouped the from and to e-mail addresses into a tuple, and placed the subject and message in a dictionary, for some reason (perhaps because that's the input required for an existing send_mail
function we want to use), we can format it like this:
emails = ("a@example.com", "b@example.com") message = { 'subject': "You Have Mail!", 'message': "Here's some mail for you!" } template = """ From: <{0[0]}> To: <{0[1]}> Subject: {message[subject]} {message[message]}""" print(template.format(emails, message=message))
The variables inside the braces in the template string look a little weird, so let's look at what they're doing. We have passed one argument as a position-based parameter and one as a keyword argument. The two e-mail addresses are looked up by 0[x]
, where x
is either 0
or 1
. The initial zero represents, as with other position-based arguments, the first positional argument passed to format
(the emails
tuple, in this case).
The square brackets with a number inside are the same kind of index lookup we see in regular Python code, so 0[0]
maps to emails[0]
, in the emails
tuple. The indexing syntax works with any indexable object, so we see similar behavior when we access message[subject]
, except this time we are looking up a string key in a dictionary. Notice that unlike in Python code, we do not need to put quotes around the string in the dictionary lookup.
We can even do multiple levels of lookup if we have nested data structures. I would recommend against doing this often, as template strings rapidly become difficult to understand. If we have a dictionary that contains a tuple, we can do this:
emails = ("a@example.com", "b@example.com") message = { 'emails': emails, 'subject': "You Have Mail!", 'message': "Here's some mail for you!" } template = """ From: <{0[emails][0]}> To: <{0[emails][1]}> Subject: {0[subject]} {0[message]}""" print(template.format(message))
Object lookups
Indexing makes format
lookup powerful, but we're not done yet! We can also pass arbitrary objects as parameters, and use the dot notation to look up attributes on those objects. Let's change our e-mail message data once again, this time to a class:
class EMail: def __init__(self, from_addr, to_addr, subject, message): self.from_addr = from_addr self.to_addr = to_addr self.subject = subject self.message = message email = EMail("a@example.com", "b@example.com", "You Have Mail!", "Here's some mail for you!") template = """ From: <{0.from_addr}> To: <{0.to_addr}> Subject: {0.subject} {0.message}""" print(template.format(email))
The template in this example may be more readable than the previous examples, but the overhead of creating an e-mail class adds complexity to the Python code. It would be foolish to create a class for the express purpose of including the object in a template. Typically, we'd use this sort of lookup if the object we are trying to format already exists. This is true of all the examples; if we have a tuple, list, or dictionary, we'll pass it into the template directly. Otherwise, we'd just create a simple set of positional and keyword arguments.
Making it look right
It's nice to be able to include variables in template strings, but sometimes the variables need a bit of coercion to make them look right in the output. For example, if we are doing calculations with currency, we may end up with a long decimal that we don't want to show up in our template:
subtotal = 12.32 tax = subtotal * 0.07 total = subtotal + tax print("Sub: ${0} Tax: ${1} Total: ${total}".format( subtotal, tax, total=total))
If we run this formatting code, the output doesn't quite look like proper currency:
Sub: $12.32 Tax: $0.8624 Total: $13.182400000000001
Note
Technically, we should never use floating-point numbers in currency calculations like this; we should construct decimal.Decimal()
objects instead. Floats are dangerous because their calculations are inherently inaccurate beyond a specific level of precision. But we're looking at strings, not floats, and currency is a great example for formatting!
To fix the preceding format
string, we can include some additional information inside the curly braces to adjust the formatting of the parameters. There are tons of things we can customize, but the basic syntax inside the braces is the same; first, we use whichever of the earlier layouts (positional, keyword, index, attribute access) is suitable to specify the variable that we want to place in the template string. We follow this with a colon, and then the specific syntax for the formatting. Here's an improved version:
print("Sub: ${0:0.2f} Tax: ${1:0.2f} " "Total: ${total:0.2f}".format( subtotal, tax, total=total))
The 0.2f
format specifier after the colons basically says, from left to right: for values lower than one, make sure a zero is displayed on the left side of the decimal point; show two places after the decimal; format the input value as a float.
We can also specify that each number should take up a particular number of characters on the screen by placing a value before the period in the precision. This can be useful for outputting tabular data, for example:
orders = [('burger', 2, 5), ('fries', 3.5, 1), ('cola', 1.75, 3)] print("PRODUCT QUANTITY PRICE SUBTOTAL") for product, price, quantity in orders: subtotal = price * quantity print("{0:10s}{1: ^9d} ${2: <8.2f}${3: >7.2f}".format( product, quantity, price, subtotal))
Ok, that's a pretty scary looking format string, so let's see how it works before we break it down into understandable parts:
PRODUCT QUANTITY PRICE SUBTOTAL burger 5 $2.00 $ 10.00 fries 1 $3.50 $ 3.50 cola 3 $1.75 $ 5.25
Nifty! So, how is this actually happening? We have four variables we are formatting, in each line in the for
loop. The first variable is a string and is formatted with {0:10s}
. The s
means it is a string variable, and the 10
means it should take up ten characters. By default, with strings, if the string is shorter than the specified number of characters, it appends spaces to the right side of the string to make it long enough (beware, however: if the original string is too long, it won't be truncated!). We can change this behavior (to fill with other characters or change the alignment in the format string), as we do for the next value, quantity
.
The formatter for the quantity
value is {1: ^9d}
. The d
represents an integer value. The 9
tells us the value should take up nine characters. But with integers, instead of spaces, the extra characters are zeros, by default. That looks kind of weird. So we explicitly specify a space (immediately after the colon) as a padding character. The caret character ^
tells us that the number should be aligned in the center of this available padding; this makes the column look a bit more professional. The specifiers have to be in the right order, although all are optional: fill first, then align, then the size, and finally, the type.
We do similar things with the specifiers for price and subtotal. For price
, we use {2: <8.2f}
and for subtotal
, {3: >7.2f}
. In both cases, we're specifying a space as the fill character, but we use the <
and >
symbols, respectively, to represent that the numbers should be aligned to the left or right within the minimum space of eight or seven characters. Further, each float should be formatted to two decimal places.
The "type" character for different types can affect formatting output as well. We've seen the s
, d
, and f
types, for strings, integers, and floats. Most of the other format specifiers are alternative versions of these; for example, o
represents octal format and X
represents hexadecimal for integers. The n
type specifier can be useful for formatting integer separators in the current locale's format. For floating-point numbers, the %
type will multiply by 100 and format a float as a percentage.
While these standard formatters apply to most built-in objects, it is also possible for other objects to define nonstandard specifiers. For example, if we pass a datetime
object into format
, we can use the specifiers used in the datetime.strftime
function, as follows:
import datetime print("{0:%Y-%m-%d %I:%M%p }".format( datetime.datetime.now()))
It is even possible to write custom formatters for objects we create ourselves, but that is beyond the scope of this book. Look into overriding the __format__
special method if you need to do this in your code. The most comprehensive instructions can be found in PEP 3101 at http://www.python.org/dev/peps/pep-3101/, although the details are a bit dry. You can find more digestible tutorials using a web search.
The Python formatting syntax is quite flexible but it is a difficult mini-language to remember. I use it every day and still occasionally have to look up forgotten concepts in the documentation. It also isn't powerful enough for serious templating needs, such as generating web pages. There are several third-party templating libraries you can look into if you need to do more than basic formatting of a few strings.
Strings are Unicode
At the beginning of this section, we defined strings as collections of immutable Unicode characters. This actually makes things very complicated at times, because Unicode isn't really a storage format. If you get a string of bytes from a file or a socket, for example, they won't be in Unicode. They will, in fact, be the built-in type bytes
. Bytes are immutable sequences of... well, bytes. Bytes are the lowest-level storage format in computing. They represent 8 bits, usually described as an integer between 0 and 255, or a hexadecimal equivalent between 0 and FF. Bytes don't represent anything specific; a sequence of bytes may store characters of an encoded string, or pixels in an image.
If we print a byte object, any bytes that map to ASCII representations will be printed as their original character, while non-ASCII bytes (whether they are binary data or other characters) are printed as hex codes escaped by the \x
escape sequence. You may find it odd that a byte, represented as an integer, can map to an ASCII character. But ASCII is really just a code where each letter is represented by a different byte pattern, and therefore, a different integer. The character "a" is represented by the same byte as the integer 97, which is the hexadecimal number 0x61. Specifically, all of these are an interpretation of the binary pattern 01100001.
Many I/O operations only know how to deal with bytes
, even if the bytes object refers to textual data. It is therefore vital to know how to convert between bytes
and Unicode.
The problem is that there are many ways to map bytes
to Unicode text. Bytes are machine-readable values, while text is a human-readable format. Sitting in between is an encoding that maps a given sequence of bytes to a given sequence of text characters.
However, there are multiple such encodings (ASCII is only one of them). The same sequence of bytes represents completely different text characters when mapped using different encodings! So, bytes
must be decoded using the same character set with which they were encoded. It's not possible to get text from bytes without knowing how the bytes should be decoded. If we receive unknown bytes without a specified encoding, the best we can do is guess what format they are encoded in, and we may be wrong.
Converting bytes to text
If we have an array of bytes
from somewhere, we can convert it to Unicode using the .decode
method on the bytes
class. This method accepts a string for the name of the character encoding. There are many such names; common ones for Western languages include ASCII, UTF-8, and latin-1.
The sequence of bytes (in hex), 63 6c 69 63 68 e9, actually represents the characters of the word cliché in the latin-1 encoding. The following example will encode this sequence of bytes and convert it to a Unicode string using the latin-1 encoding:
characters = b'\x63\x6c\x69\x63\x68\xe9'
print(characters)
print(characters.decode("latin-1"))
The first line creates a bytes
object; the b
character immediately before the string tells us that we are defining a bytes
object instead of a normal Unicode string. Within the string, each byte is specified using—in this case—a hexadecimal number. The \x
character escapes within the byte string, and each say, "the next two characters represent a byte using hexadecimal digits."
Provided we are using a shell that understands the latin-1 encoding, the two print
calls will output the following strings:
b'clich\xe9' cliché
The first print
statement renders the bytes for ASCII characters as themselves. The unknown (unknown to ASCII, that is) character stays in its escaped hex format. The output includes a b
character at the beginning of the line to remind us that it is a bytes
representation, not a string.
The next call decodes the string using latin-1 encoding. The decode
method returns a normal (Unicode) string with the correct characters. However, if we had decoded this same string using the Cyrillic "iso8859-5" encoding, we'd have ended up with the string 'clichщ'! This is because the \xe9
byte maps to different characters in the two encodings.
Converting text to bytes
If we need to convert incoming bytes into Unicode, clearly we're also going to have situations where we convert outgoing Unicode into byte sequences. This is done with the encode
method on the str
class, which, like the decode
method, requires a character set. The following code creates a Unicode string and encodes it in different character sets:
characters = "cliché" print(characters.encode("UTF-8")) print(characters.encode("latin-1")) print(characters.encode("CP437")) print(characters.encode("ascii"))
The first three encodings create a different set of bytes for the accented character. The fourth one can't even handle that byte:
b'clich\xc3\xa9' b'clich\xe9' b'clich\x82' Traceback (most recent call last): File "1261_10_16_decode_unicode.py", line 5, in <module> print(characters.encode("ascii")) UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 5: ordinal not in range(128)
Do you understand the importance of encoding now? The accented character is represented as a different byte for each encoding; if we use the wrong one when we are decoding bytes to text, we get the wrong character.
The exception in the last case is not always the desired behavior; there may be cases where we want the unknown characters to be handled in a different way. The encode
method takes an optional string argument named errors
that can define how such characters should be handled. This string can be one of the following:
strict
replace
ignore
xmlcharrefreplace
The strict
replacement strategy is the default we just saw. When a byte sequence is encountered that does not have a valid representation in the requested encoding, an exception is raised. When the replace
strategy is used, the character is replaced with a different character; in ASCII, it is a question mark; other encodings may use different symbols, such as an empty box. The ignore
strategy simply discards any bytes it doesn't understand, while the xmlcharrefreplace
strategy creates an xml
entity representing the Unicode character. This can be useful when converting unknown strings for use in an XML document. Here's how each of the strategies affects our sample word:

It is possible to call the str.encode
and bytes.decode
methods without passing an encoding string. The encoding will be set to the default encoding for the current platform. This will depend on the current operating system and locale or regional settings; you can look it up using the sys.getdefaultencoding()
function. It is usually a good idea to specify the encoding explicitly, though, since the default encoding for a platform may change, or the program may one day be extended to work on text from a wider variety of sources.
If you are encoding text and don't know which encoding to use, it is best to use the UTF-8 encoding. UTF-8 is able to represent any Unicode character. In modern software, it is a de facto standard encoding to ensure documents in any language—or even multiple languages—can be exchanged. The various other possible encodings are useful for legacy documents or in regions that still use different character sets by default.
The UTF-8 encoding uses one byte to represent ASCII and other common characters, and up to four bytes for more complex characters. UTF-8 is special because it is backwards-compatible with ASCII; any ASCII document encoded using UTF-8 will be identical to the original ASCII document.
Tip
I can never remember whether to use encode
or decode
to convert from binary bytes to Unicode. I always wished these methods were named "to_binary" and "from_binary" instead. If you have the same problem, try mentally replacing the word "code" with "binary"; "enbinary" and "debinary" are pretty close to "to_binary" and "from_binary". I have saved a lot of time by not looking up the method help files since devising this mnemonic.
Mutable byte strings
The bytes
type, like str
, is immutable. We can use index and slice notation on a bytes
object and search for a particular sequence of bytes, but we can't extend or modify them. This can be very inconvenient when dealing with I/O, as it is often necessary to buffer incoming or outgoing bytes until they are ready to be sent. For example, if we are receiving data from a socket, it may take several recv
calls before we have received an entire message.
This is where the bytearray
built-in comes in. This type behaves something like a list, except it only holds bytes. The constructor for the class can accept a bytes
object to initialize it. The extend
method can be used to append another bytes
object to the existing array (for example, when more data comes from a socket or other I/O channel).
Slice notation can be used on bytearray
to modify the item inline. For example, this code constructs a bytearray
from a bytes
object and then replaces two bytes:
b = bytearray(b"abcdefgh") b[4:6] = b"\x15\xa3" print(b)
The output looks like this:
bytearray(b'abcd\x15\xa3gh')
Be careful; if we want to manipulate a single element in the bytearray
, it will expect us to pass an integer between 0 and 255 inclusive as the value. This integer represents a specific bytes
pattern. If we try to pass a character or bytes
object, it will raise an exception.
A single byte character can be converted to an integer using the ord
(short for ordinal) function. This function returns the integer representation of a single character:
b = bytearray(b'abcdef') b[3] = ord(b'g') b[4] = 68 print(b)
The output looks like this:
bytearray(b'abcgDf')
After constructing the array, we replace the character at index 3
(the fourth character, as indexing starts at 0
, as with lists) with byte 103. This integer was returned by the ord
function and is the ASCII character for the lowercase g
. For illustration, we also replaced the next character up with the byte number 68
, which maps to the ASCII character for the uppercase D
.
The bytearray
type has methods that allow it to behave like a list (we can append integer bytes to it, for example), but also like a bytes
object; we can use methods like count
and find
the same way they would behave on a bytes
or str
object. The difference is that bytearray
is a mutable type, which can be useful for building up complex sequences of bytes from a specific input source.
Regular expressions
You know what's really hard to do using object-oriented principles? Parsing strings to match arbitrary patterns, that's what. There have been a fair number of academic papers written in which object-oriented design is used to set up string parsing, but the result is always very verbose and hard to read, and they are not widely used in practice.
In the real world, string parsing in most programming languages is handled by regular expressions. These are not verbose, but, boy, are they ever hard to read, at least until you learn the syntax. Even though regular expressions are not object oriented, the Python regular expression library provides a few classes and objects that you can use to construct and run regular expressions.
Regular expressions are used to solve a common problem: Given a string, determine whether that string matches a given pattern and, optionally, collect substrings that contain relevant information. They can be used to answer questions like:
- Is this string a valid URL?
- What is the date and time of all warning messages in a log file?
- Which users in
/etc/passwd
are in a given group? - What username and document were requested by the URL a visitor typed?
There are many similar scenarios where regular expressions are the correct answer. Many programmers have made the mistake of implementing complicated and fragile string parsing libraries because they didn't know or wouldn't learn regular expressions. In this section, we'll gain enough knowledge of regular expressions to not make such mistakes!
Matching patterns
Regular expressions are a complicated mini-language. They rely on special characters to match unknown strings, but let's start with literal characters, such as letters, numbers, and the space character, which always match themselves. Let's see a basic example:
import re search_string = "hello world" pattern = "hello world" match = re.match(pattern, search_string) if match: print("regex matches")
The Python Standard Library module for regular expressions is called re
. We import it and set up a search string and pattern to search for; in this case, they are the same string. Since the search string matches the given pattern, the conditional passes and the print
statement executes.
Bear in mind that the match
function matches the pattern to the beginning of the string. Thus, if the pattern were "ello world"
, no match would be found. With confusing asymmetry, the parser stops searching as soon as it finds a match, so the pattern "hello wo"
matches successfully. Let's build a small example program to demonstrate these differences and help us learn other regular expression syntax:
import sys
import re
pattern = sys.argv[1]
search_string = sys.argv[2]
match = re.match(pattern, search_string)
if match:
template = "'{}' matches pattern '{}'"
else:
template = "'{}' does not match pattern '{}'"
print(template.format(search_string, pattern))
This is just a generic version of the earlier example that accepts the pattern and search string from the command line. We can see how the start of the pattern must match, but a value is returned as soon as a match is found in the following command-line interaction:
$ python regex_generic.py "hello worl" "hello world" 'hello world' matches pattern 'hello worl' $ python regex_generic.py "ello world" "hello world" 'hello world' does not match pattern 'ello world'
We'll be using this script throughout the next few sections. While the script is always invoked with the command line python regex_generic.py "<pattern>" "<string>"
, we'll only see the output in the following examples, to conserve space.
If you need control over whether items happen at the beginning or end of a line (or if there are no newlines in the string, at the beginning and end of the string), you can use the ^
and $
characters to represent the start and end of the string respectively. If you want a pattern to match an entire string, it's a good idea to include both of these:
'hello world' matches pattern '^hello world$' 'hello worl' does not match pattern '^hello world$'
Matching a selection of characters
Let's start with matching an arbitrary character. The period character, when used in a regular expression pattern, can match any single character. Using a period in the string means you don't care what the character is, just that there is a character there. For example:
'hello world' matches pattern 'hel.o world' 'helpo world' matches pattern 'hel.o world' 'hel o world' matches pattern 'hel.o world' 'helo world' does not match pattern 'hel.o world'
Notice how the last example does not match because there is no character at the period's position in the pattern.
That's all well and good, but what if we only want a few specific characters to match? We can put a set of characters inside square brackets to match any one of those characters. So if we encounter the string [abc]
in a regular expression pattern, we know that those five (including the two square brackets) characters will only match one character in the string being searched, and further, that this one character will be either an a
, a b
, or a c
. See a few examples:
'hello world' matches pattern 'hel[lp]o world' 'helpo world' matches pattern 'hel[lp]o world' 'helPo world' does not match pattern 'hel[lp]o world'
These square bracket sets should be named character sets, but they are more often referred to as character classes. Often, we want to include a large range of characters inside these sets, and typing them all out can be monotonous and error-prone. Fortunately, the regular expression designers thought of this and gave us a shortcut. The dash character, in a character set, will create a range. This is especially useful if you want to match "all lower case letters", "all letters", or "all numbers" as follows:
'hello world' does not match pattern 'hello [a-z] world' 'hello b world' matches pattern 'hello [a-z] world' 'hello B world' matches pattern 'hello [a-zA-Z] world' 'hello 2 world' matches pattern 'hello [a-zA-Z0-9] world'
There are other ways to match or exclude individual characters, but you'll need to find a more comprehensive tutorial via a web search if you want to find out what they are!
Escaping characters
If putting a period character in a pattern matches any arbitrary character, how do we match just a period in a string? One way might be to put the period inside square brackets to make a character class, but a more generic method is to use backslashes to escape it. Here's a regular expression to match two digit decimal numbers between 0.00 and 0.99:
'0.05' matches pattern '0\.[0-9][0-9]' '005' does not match pattern '0\.[0-9][0-9]' '0,05' does not match pattern '0\.[0-9][0-9]'
For this pattern, the two characters \.
match the single .
character. If the period character is missing or is a different character, it does not match.
This backslash escape sequence is used for a variety of special characters in regular expressions. You can use \[
to insert a square bracket without starting a character class, and \(
to insert a parenthesis, which we'll later see is also a special character.
More interestingly, we can also use the escape symbol followed by a character to represent special characters such as newlines (\n
), and tabs (\t
). Further, some character classes can be represented more succinctly using escape strings; \s
represents whitespace characters, \w
represents letters, numbers, and underscore, and \d
represents a digit:
'(abc]' matches pattern '\(abc\]' ' 1a' matches pattern '\s\d\w' '\t5n' does not match pattern '\s\d\w' '5n' matches pattern '\s\d\w'
Matching multiple characters
With this information, we can match most strings of a known length, but most of the time we don't know how many characters to match inside a pattern. Regular expressions can take care of this, too. We can modify a pattern by appending one of several hard-to-remember punctuation symbols to match multiple characters.
The asterisk (*
) character says that the previous pattern can be matched zero or more times. This probably sounds silly, but it's one of the most useful repetition characters. Before we explore why, consider some silly examples to make sure we understand what it does:
'hello' matches pattern 'hel*o' 'heo' matches pattern 'hel*o' 'helllllo' matches pattern 'hel*o'
So, the *
character in the pattern says that the previous pattern (the l
character) is optional, and if present, can be repeated as many times as possible to match the pattern. The rest of the characters (h
, e
, and o
) have to appear exactly once.
It's pretty rare to want to match a single letter multiple times, but it gets more interesting if we combine the asterisk with patterns that match multiple characters. .*
, for example, will match any string, whereas [a-z]*
matches any collection of lowercase words, including the empty string.
For example:
'A string.' matches pattern '[A-Z][a-z]* [a-z]*\.' 'No .' matches pattern '[A-Z][a-z]* [a-z]*\.' '' matches pattern '[a-z]*.*'
The plus (+
) sign in a pattern behaves similarly to an asterisk; it states that the previous pattern can be repeated one or more times, but, unlike the asterisk is not optional. The question mark (?) ensures a pattern shows up exactly zero or one times, but not more. Let's explore some of these by playing with numbers (remember that \d
matches the same character class as [0-9]
:
'0.4' matches pattern '\d+\.\d+' '1.002' matches pattern '\d+\.\d+' '1.' does not match pattern '\d+\.\d+' '1%' matches pattern '\d?\d%' '99%' matches pattern '\d?\d%' '999%' does not match pattern '\d?\d%'
Grouping patterns together
So far we've seen how we can repeat a pattern multiple times, but we are restricted in what patterns we can repeat. If we want to repeat individual characters, we're covered, but what if we want a repeating sequence of characters? Enclosing any set of patterns in parenthesis allows them to be treated as a single pattern when applying repetition operations. Compare these patterns:
'abccc' matches pattern 'abc{3}' 'abccc' does not match pattern '(abc){3}' 'abcabcabc' matches pattern '(abc){3}'
Combined with complex patterns, this grouping feature greatly expands our pattern-matching repertoire. Here's a regular expression that matches simple English sentences:
'Eat.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' 'Eat more good food.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' 'A good meal.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$'
The first word starts with a capital, followed by zero or more lowercase letters. Then, we enter a parenthetical that matches a single space followed by a word of one or more lowercase letters. This entire parenthetical is repeated zero or more times, and the pattern is terminated with a period. There cannot be any other characters after the period, as indicated by the $
matching the end of string.
We've seen many of the most basic patterns, but the regular expression language supports many more. I spent my first few years using regular expressions looking up the syntax every time I needed to do something. It is worth bookmarking Python's documentation for the re
module and reviewing it frequently. There are very few things that regular expressions cannot match, and they should be the first tool you reach for when parsing strings.
Getting information from regular expressions
Let's now focus on the Python side of things. The regular expression syntax is the furthest thing from object-oriented programming. However, Python's re
module provides an object-oriented interface to enter the regular expression engine.
We've been checking whether the re.match
function returns a valid object or not. If a pattern does not match, that function returns None
. If it does match, however, it returns a useful object that we can introspect for information about the pattern.
So far, our regular expressions have answered questions such as "Does this string match this pattern?" Matching patterns is useful, but in many cases, a more interesting question is, "If this string matches this pattern, what is the value of a relevant substring?" If you use groups to identify parts of the pattern that you want to reference later, you can get them out of the match return value as illustrated in the next example:
pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$"
search_string = "some.user@example.com"
match = re.match(pattern, search_string)
if match:
domain = match.groups()[0]
print(domain)
The specification describing valid e-mail addresses is extremely complicated, and the regular expression that accurately matches all possibilities is obscenely long. So we cheated and made a simple regular expression that matches some common e-mail addresses; the point is that we want to access the domain name (after the @
sign) so we can connect to that address. This is done easily by wrapping that part of the pattern in parenthesis and calling the groups()
method on the object returned by match.
The groups
method returns a tuple of all the groups matched inside the pattern, which you can index to access a specific value. The groups are ordered from left to right. However, bear in mind that groups can be nested, meaning you can have one or more groups inside another group. In this case, the groups are returned in the order of their left-most brackets, so the outermost group will be returned before its inner matching groups.
In addition to the match function, the re
module provides a couple other useful functions, search
, and findall
. The search
function finds the first instance of a matching pattern, relaxing the restriction that the pattern start at the first letter of the string. Note that you can get a similar effect by using match and putting a ^.*
character at the front of the pattern to match any characters between the start of the string and the pattern you are looking for.
The findall
function behaves similarly to search, except that it finds all non-overlapping instances of the matching pattern, not just the first one. Basically, it finds the first match, then it resets the search to the end of that matching string and finds the next one.
Instead of returning a list of match objects, as you would expect, it returns a list of matching strings. Or tuples. Sometimes it's strings, sometimes it's tuples. It's not a very good API at all! As with all bad APIs, you'll have to memorize the differences and not rely on intuition. The type of the return value depends on the number of bracketed groups inside the regular expression:
- If there are no groups in the pattern,
re.findall
will return a list of strings, where each value is a complete substring from the source string that matches the pattern - If there is exactly one group in the pattern,
re.findall
will return a list of strings where each value is the contents of that group - If there are multiple groups in the pattern, then
re.findall
will return a list of tuples where each tuple contains a value from a matching group, in order
Note
When you are designing function calls in your own Python libraries, try to make the function always return a consistent data structure. It is often good to design functions that can take arbitrary inputs and process them, but the return value should not switch from single value to a list, or a list of values to a list of tuples depending on the input. Let re.findall
be a lesson!
The examples in the following interactive session will hopefully clarify the differences:
>>> import re >>> re.findall('a.', 'abacadefagah') ['ab', 'ac', 'ad', 'ag', 'ah'] >>> re.findall('a(.)', 'abacadefagah') ['b', 'c', 'd', 'g', 'h'] >>> re.findall('(a)(.)', 'abacadefagah') [('a', 'b'), ('a', 'c'), ('a', 'd'), ('a', 'g'), ('a', 'h')] >>> re.findall('((a)(.))', 'abacadefagah') [('ab', 'a', 'b'), ('ac', 'a', 'c'), ('ad', 'a', 'd'), ('ag', 'a', 'g'), ('ah', 'a', 'h')]
Making repeated regular expressions efficient
Whenever you call one of the regular expression methods, the engine has to convert the pattern string into an internal structure that makes searching strings fast. This conversion takes a non-trivial amount of time. If a regular expression pattern is going to be reused multiple times (for example, inside a for
or while
loop), it would be better if this conversion step could be done only once.
This is possible with the re.compile
method. It returns an object-oriented version of the regular expression that has been compiled down and has the methods we've explored (match
, search
, findall
) already, among others. We'll see examples of this in the case study.
This has definitely been a condensed introduction to regular expressions. At this point, we have a good feel for the basics and will recognize when we need to do further research. If we have a string pattern matching problem, regular expressions will almost certainly be able to solve them for us. However, we may need to look up new syntaxes in a more comprehensive coverage of the topic. But now we know what to look for! Let's move on to a completely different topic: serializing data for storage.
Serializing objects
Nowadays, we take the ability to write data to a file and retrieve it at an arbitrary later date for granted. As convenient as this is (imagine the state of computing if we couldn't store anything!), we often find ourselves converting data we have stored in a nice object or design pattern in memory into some kind of clunky text or binary format for storage, transfer over the network, or remote invocation on a distant server.
The Python pickle
module is an object-oriented way to store objects directly in a special storage format. It essentially converts an object (and all the objects it holds as attributes) into a sequence of bytes that can be stored or transported however we see fit.
For basic work, the pickle
module has an extremely simple interface. It is comprised of four basic functions for storing and loading data; two for manipulating file-like objects, and two for manipulating bytes
objects (the latter are just shortcuts to the file-like interface, so we don't have to create a BytesIO
file-like object ourselves).
The dump
method accepts an object to be written and a file-like object to write the serialized bytes to. This object must have a write
method (or it wouldn't be file-like), and that method must know how to handle a bytes
argument (so a file opened for text output wouldn't work).
The load
method does exactly the opposite; it reads a serialized object from a file-like object. This object must have the proper file-like read
and readline
arguments, each of which must, of course, return bytes
. The pickle
module will load the object from these bytes and the load
method will return the fully reconstructed object. Here's an example that stores and then loads some data in a list object:
import pickle some_data = ["a list", "containing", 5, "values including another list", ["inner", "list"]] with open("pickled_list", 'wb') as file: pickle.dump(some_data, file) with open("pickled_list", 'rb') as file: loaded_data = pickle.load(file) print(loaded_data) assert loaded_data == some_data
This code works as advertised: the objects are stored in the file and then loaded from the same file. In each case, we open the file using a with
statement so that it is automatically closed. The file is first opened for writing and then a second time for reading, depending on whether we are storing or loading data.
The assert
statement at the end would raise an error if the newly loaded object is not equal to the original object. Equality does not imply that they are the same object. Indeed, if we print the id()
of both objects, we would discover they are different. However, because they are both lists whose contents are equal, the two lists are also considered equal.
The dumps
and loads
functions behave much like their file-like counterparts, except they return or accept bytes
instead of file-like objects. The dumps
function requires only one argument, the object to be stored, and it returns a serialized bytes
object. The loads
function requires a bytes
object and returns the restored object. The 's'
character in the method names is short for string; it's a legacy name from ancient versions of Python, where str
objects were used instead of bytes
.
Both dump
methods accept an optional protocol
argument. If we are saving and loading pickled objects that are only going to be used in Python 3 programs, we don't need to supply this argument. Unfortunately, if we are storing objects that may be loaded by older versions of Python, we have to use an older and less efficient protocol. This should not normally be an issue. Usually, the only program that would load a pickled object would be the same one that stored it. Pickle is an unsafe format, so we don't want to be sending it unsecured over the Internet to unknown interpreters.
The argument supplied is an integer version number. The default version is number 3, representing the current highly efficient storage system used by Python 3 pickling. The number 2 is the older version, which will store an object that can be loaded on all interpreters back to Python 2.3. As 2.6 is the oldest of Python that is still widely used in the wild, version 2 pickling is normally sufficient. Versions 0 and 1 are supported on older interpreters; 0 is an ASCII format, while 1 is a binary format. There is also an optimized version 4 that may one day become the default.
As a rule of thumb, then, if you know that the objects you are pickling will only be loaded by a Python 3 program (for example, only your program will be loading them), use the default pickling protocol. If they may be loaded by unknown interpreters, pass a protocol value of 2, unless you really believe they may need to be loaded by an archaic version of Python.
If we do pass a protocol to dump
or dumps
, we should use a keyword argument to specify it: pickle.dumps(my_object, protocol=2)
. This is not strictly necessary, as the method only accepts two arguments, but typing out the full keyword argument reminds readers of our code what the purpose of the number is. Having a random integer in the method call would be hard to read. Two what? Store two copies of the object, maybe? Remember, code should always be readable. In Python, less code is often more readable than longer code, but not always. Be explicit.
It is possible to call dump
or load
on a single open file more than once. Each call to dump
will store a single object (plus any objects it is composed of or contains), while a call to load
will load and return just one object. So for a single file, each separate call to dump
when storing the object should have an associated call to load
when restoring at a later date.
Customizing pickles
With most common Python objects, pickling "just works". Basic primitives such as integers, floats, and strings can be pickled, as can any container object, such as lists or dictionaries, provided the contents of those containers are also picklable. Further, and importantly, any object can be pickled, so long as all of its attributes are also picklable.
So what makes an attribute unpicklable? Usually, it has something to do with time-sensitive attributes that it would not make sense to load in the future. For example, if we have an open network socket, open file, running thread, or database connection stored as an attribute on an object, it would not make sense to pickle these objects; a lot of operating system state would simply be gone when we attempted to reload them later. We can't just pretend a thread or socket connection exists and make it appear! No, we need to somehow customize how such transient data is stored and restored.
Here's a class that loads the contents of a web page every hour to ensure that they stay up to date. It uses the threading.Timer
class to schedule the next update:
from threading import Timer import datetime from urllib.request import urlopen class UpdatedURL: def __init__(self, url): self.url = url self.contents = '' self.last_updated = None self.update() def update(self): self.contents = urlopen(self.url).read() self.last_updated = datetime.datetime.now() self.schedule() def schedule(self): self.timer = Timer(3600, self.update) self.timer.setDaemon(True) self.timer.start()
The url
, contents
, and last_updated
are all pickleable, but if we try to pickle an instance of this class, things go a little nutty on the self.timer
instance:
>>> u = UpdatedURL("http://news.yahoo.com/") >>> import pickle >>> serialized = pickle.dumps(u) Traceback (most recent call last): File "<pyshell#3>", line 1, in <module> serialized = pickle.dumps(u) _pickle.PicklingError: Can't pickle <class '_thread.lock'>: attribute lookup lock on _thread failed
That's not a very useful error, but it looks like we're trying to pickle something we shouldn't be. That would be the Timer
instance; we're storing a reference to self.timer
in the schedule method, and that attribute cannot be serialized.
When pickle
tries to serialize an object, it simply tries to store the object's __dict__
attribute; __dict__
is a dictionary mapping all the attribute names on the object to their values. Luckily, before checking __dict__
, pickle
checks to see whether a __getstate__
method exists. If it does, it will store the return value of that method instead of the __dict__
.
Let's add a __getstate__
method to our UpdatedURL
class that simply returns a copy of the __dict__
without a timer:
def __getstate__(self): new_state = self.__dict__.copy() if 'timer' in new_state: del new_state['timer'] return new_state
If we pickle the object now, it will no longer fail. And we can even successfully restore that object using loads
. However, the restored object doesn't have a timer attribute, so it will not be refreshing the content like it is designed to do. We need to somehow create a new timer (to replace the missing one) when the object is unpickled.
As we might expect, there is a complementary __setstate__
method that can be implemented to customize unpickling. This method accepts a single argument, which is the object returned by __getstate__
. If we implement both methods, __getstate__
is not required to return a dictionary, since __setstate__
will know what to do with whatever object __getstate__
chooses to return. In our case, we simply want to restore the __dict__
, and then create a new timer:
def __setstate__(self, data): self.__dict__ = data self.schedule()
The pickle
module is very flexible and provides other tools to further customize the pickling process if you need them. However, these are beyond the scope of this book. The tools we've covered are sufficient for many basic pickling tasks. Objects to be pickled are normally relatively simple data objects; we would not likely pickle an entire running program or complicated design pattern, for example.
Serializing web objects
It is not a good idea to load a pickled object from an unknown or untrusted source. It is possible to inject arbitrary code into a pickled file to maliciously attack a computer via the pickle. Another disadvantage of pickles is that they can only be loaded by other Python programs, and cannot be easily shared with services written in other languages.
There are many formats that have been used for this purpose over the years. XML (Extensible Markup Language) used to be very popular, especially with Java developers. YAML (Yet Another Markup Language) is another format that you may see referenced occasionally. Tabular data is frequently exchanged in the CSV (Comma Separated Value) format. Many of these are fading into obscurity and there are many more that you will encounter over time. Python has solid standard or third-party libraries for all of them.
Before using such libraries on untrusted data, make sure to investigate security concerns with each of them. XML and YAML, for example, both have obscure features that, used maliciously, can allow arbitrary commands to be executed on the host machine. These features may not be turned off by default. Do your research.
JavaScript Object Notation (JSON) is a human readable format for exchanging primitive data. JSON is a standard format that can be interpreted by a wide array of heterogeneous client systems. Hence, JSON is extremely useful for transmitting data between completely decoupled systems. Further, JSON does not have any support for executable code, only data can be serialized; thus, it is more difficult to inject malicious statements into it.
Because JSON can be easily interpreted by JavaScript engines, it is often used for transmitting data from a web server to a JavaScript-capable web browser. If the web application serving the data is written in Python, it needs a way to convert internal data into the JSON format.
There is a module to do this, predictably named json
. This module provides a similar interface to the pickle
module, with dump
, load
, dumps
, and loads
functions. The default calls to these functions are nearly identical to those in pickle
, so let us not repeat the details. There are a couple differences; obviously, the output of these calls is valid JSON notation, rather than a pickled object. In addition, the json
functions operate on str
objects, rather than bytes
. Therefore, when dumping to or loading from a file, we need to create text files rather than binary ones.
The JSON serializer is not as robust as the pickle
module; it can only serialize basic types such as integers, floats, and strings, and simple containers such as dictionaries and lists. Each of these has a direct mapping to a JSON representation, but JSON is unable to represent classes, methods, or functions. It is not possible to transmit complete objects in this format. Because the receiver of an object we have dumped to JSON format is normally not a Python object, it would not be able to understand classes or methods in the same way that Python does, anyway. In spite of the O for Object in its name, JSON is a data notation; objects, as you recall, are composed of both data and behavior.
If we do have objects for which we want to serialize only the data, we can always serialize the object's __dict__
attribute. Or we can semiautomate this task by supplying custom code to create or parse a JSON serializable dictionary from certain types of objects.
In the json
module, both the object storing and loading functions accept optional arguments to customize the behavior. The dump
and dumps
methods accept a poorly named cls
(short for class, which is a reserved keyword) keyword argument. If passed, this should be a subclass of the JSONEncoder
class, with the default
method overridden. This method accepts an arbitrary object and converts it to a dictionary that json
can digest. If it doesn't know how to process the object, we should call the super()
method, so that it can take care of serializing basic types in the normal way.
The load
and loads
methods also accept such a cls
argument that can be a subclass of the inverse class, JSONDecoder
. However, it is normally sufficient to pass a function into these methods using the object_hook
keyword argument. This function accepts a dictionary and returns an object; if it doesn't know what to do with the input dictionary, it can return it unmodified.
Let's look at an example. Imagine we have the following simple contact class that we want to serialize:
class Contact: def __init__(self, first, last): self.first = first self.last = last @property def full_name(self): return("{} {}".format(self.first, self.last))
We could just serialize the __dict__
attribute:
>>> c = Contact("John", "Smith") >>> json.dumps(c.__dict__) '{"last": "Smith", "first": "John"}'
But accessing special (double-underscore) attributes in this fashion is kind of crude. Also, what if the receiving code (perhaps some JavaScript on a web page) wanted that full_name
property to be supplied? Of course, we could construct the dictionary by hand, but let's create a custom encoder instead:
import json class ContactEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, Contact): return {'is_contact': True, 'first': obj.first, 'last': obj.last, 'full': obj.full_name} return super().default(obj)
The default
method basically checks to see what kind of object we're trying to serialize; if it's a contact, we convert it to a dictionary manually; otherwise, we let the parent class handle serialization (by assuming that it is a basic type, which json
knows how to handle). Notice that we pass an extra attribute to identify this object as a contact, since there would be no way to tell upon loading it. This is just a convention; for a more generic serialization mechanism, it might make more sense to store a string type in the dictionary, or possibly even the full class name, including package and module. Remember that the format of the dictionary depends on the code at the receiving end; there has to be an agreement as to how the data is going to be specified.
We can use this class to encode a contact by passing the class (not an instantiated object) to the dump
or dumps
function:
>>> c = Contact("John", "Smith") >>> json.dumps(c, cls=ContactEncoder) '{"is_contact": true, "last": "Smith", "full": "John Smith", "first": "John"}'
For decoding, we can write a function that accepts a dictionary and checks the existence of the is_contact
variable to decide whether to convert it to a contact:
def decode_contact(dic): if dic.get('is_contact'): return Contact(dic['first'], dic['last']) else: return dic
We can pass this function to the load
or loads
function using the object_hook
keyword argument:
>>> data = ('{"is_contact": true, "last": "smith",' '"full": "john smith", "first": "john"}') >>> c = json.loads(data, object_hook=decode_contact) >>> c <__main__.Contact object at 0xa02918c> >>> c.full_name 'john smith'
Case study
Let's build a basic regular expression-powered templating engine in Python. This engine will parse a text file (such as an HTML page) and replace certain directives with text calculated from the input to those directives. This is about the most complicated task we would want to do with regular expressions; indeed, a full-fledged version of this would likely utilize a proper language parsing mechanism.
Consider the following input file:
/** include header.html **/ <h1>This is the title of the front page</h1> /** include menu.html **/ <p>My name is /** variable name **/. This is the content of my front page. It goes below the menu.</p> <table> <tr><th>Favourite Books</th></tr> /** loopover book_list **/ <tr><td>/** loopvar **/</td></tr> /** endloop **/ </table> /** include footer.html **/ Copyright © Today
This file contains "tags" of the form /** <directive> <data> **/
where the data is an optional single word and the directives are:
include
: Copy the contents of another file herevariable
: Insert the contents of a variable hereloopover
: Repeat the contents of the loop for a variable that is a listendloop
: Signal the end of looped textloopvar
: Insert a single value from the list being looped over
This template will render a different page depending which variables are passed into it. These variables will be passed in from a so-called context file. This will be encoded as a json
object with keys representing the variables in question. My context file might look like this, but you would derive your own:
{ "name": "Dusty", "book_list": [ "Thief Of Time", "The Thief", "Snow Crash", "Lathe Of Heaven" ] }
Before we get into the actual string processing, let's throw together some object-oriented boilerplate code for processing files and grabbing data from the command line:
import re import sys import json from pathlib import Path DIRECTIVE_RE = re.compile( r'/\*\*\s*(include|variable|loopover|endloop|loopvar)' r'\s*([^ *]*)\s*\*\*/') class TemplateEngine: def __init__(self, infilename, outfilename, contextfilename): self.template = open(infilename).read() self.working_dir = Path(infilename).absolute().parent self.pos = 0 self.outfile = open(outfilename, 'w') with open(contextfilename) as contextfile: self.context = json.load(contextfile) def process(self): print("PROCESSING...") if __name__ == '__main__': infilename, outfilename, contextfilename = sys.argv[1:] engine = TemplateEngine(infilename, outfilename, contextfilename) engine.process()
This is all pretty basic, we create a class and initialize it with some variables passed in on the command line.
Notice how we try to make the regular expression a little bit more readable by breaking it across two lines? We use raw strings (the r prefix), so we don't have to double escape all our backslashes. This is common in regular expressions, but it's still a mess. (Regular expressions always are, but they're often worth it.)
The pos
indicates the current character in the content that we are processing; we'll see a lot more of it in a moment.
Now "all that's left" is to implement that process method. There are a few ways to do this. Let's do it in a fairly explicit way.
The process method has to find each directive that matches the regular expression and do the appropriate work with it. However, it also has to take care of outputting the normal text before, after, and between each directive to the output file, unmodified.
One good feature of the compiled version of regular expressions is that we can tell the search
method to start searching at a specific position by passing the pos
keyword argument. If we temporarily define doing the appropriate work with a directive as "ignore the directive and delete it from the output file", our process loop looks quite simple:
def process(self):
match = DIRECTIVE_RE.search(self.template, pos=self.pos)
while match:
self.outfile.write(self.template[self.pos:match.start()])
self.pos = match.end()
match = DIRECTIVE_RE.search(self.template, pos=self.pos)
self.outfile.write(self.template[self.pos:])
In English, this function finds the first string in the text that matches the regular expression, outputs everything from the current position to the start of that match, and then advances the position to the end of aforesaid match. Once it's out of matches, it outputs everything since the last position.
Of course, ignoring the directive is pretty useless in a templating engine, so let's set up replace that position advancing line with code that delegates to a different method on the class depending on the directive:
def process(self): match = DIRECTIVE_RE.search(self.template, pos=self.pos) while match: self.outfile.write(self.template[self.pos:match.start()]) directive, argument = match.groups() method_name = 'process_{}'.format(directive) getattr(self, method_name)(match, argument) match = DIRECTIVE_RE.search(self.template, pos=self.pos) self.outfile.write(self.template[self.pos:])
So we grab the directive and the single argument from the regular expression. The directive becomes a method name and we dynamically look up that method name on the self
object (a little error processing here in case the template writer provides an invalid directive would be better). We pass the match object and argument into that method and assume that method will deal with everything appropriately, including moving the pos
pointer.
Now that we've got our object-oriented architecture this far, it's actually pretty simple to implement the methods that are delegated to. The include
and variable
directives are totally straightforward:
def process_include(self, match, argument): with (self.working_dir / argument).open() as includefile: self.outfile.write(includefile.read()) self.pos = match.end() def process_variable(self, match, argument): self.outfile.write(self.context.get(argument, '')) self.pos = match.end()
The first simply looks up the included file and inserts the file contents, while the second looks up the variable name in the context dictionary (which was loaded from json
in the __init__
method), defaulting to an empty string if it doesn't exist.
The three methods that deal with looping are a bit more intense, as they have to share state between the three of them. For simplicity (I'm sure you're eager to see the end of this long chapter, we're almost there!), we'll handle this as instance variables on the class itself. As an exercise, you might want to consider better ways to architect this, especially after reading the next three chapters.
def process_loopover(self, match, argument): self.loop_index = 0 self.loop_list = self.context.get(argument, []) self.pos = self.loop_pos = match.end() def process_loopvar(self, match, argument): self.outfile.write(self.loop_list[self.loop_index]) self.pos = match.end() def process_endloop(self, match, argument): self.loop_index += 1 if self.loop_index >= len(self.loop_list): self.pos = match.end() del self.loop_index del self.loop_list del self.loop_pos else: self.pos = self.loop_pos
When we encounter the loopover
directive, we don't have to output anything, but we do have to set the initial state on three variables. The loop_list
variable is assumed to be a list pulled from the context dictionary. The loop_index
variable indicates what position in that list should be output in this iteration of the loop, while loop_pos
is stored so we know where to jump back to when we get to the end of the loop.
The loopvar
directive outputs the value at the current position in the loop_list
variable and skips to the end of the directive. Note that it doesn't increment the loop index because the loopvar
directive could be called multiple times inside a loop.
The endloop
directive is more complicated. It determines whether there are more elements in the loop_list
; if there are, it just jumps back to the start of the loop, incrementing the index. Otherwise, it resets all the variables that were being used to process the loop and jumps to the end of the directive so the engine can carry on with the next match.
Note that this particular looping mechanism is very fragile; if a template designer were to try nesting loops or forget an endloop
call, it would go poorly for them. We would need a lot more error checking and probably want to store more loop state to make this a production platform. But I promised that the end of the chapter was nigh, so let's just head to the exercises, after seeing how our sample template is rendered with its context:
<html> <body> <h1>This is the title of the front page</h1> <a href="link1.html">First Link</a> <a href="link2.html">Second Link</a> <p>My name is Dusty. This is the content of my front page. It goes below the menu.</p> <table> <tr><th>Favourite Books</th></tr> <tr><td>Thief Of Time</td></tr> <tr><td>The Thief</td></tr> <tr><td>Snow Crash</td></tr> <tr><td>Lathe Of Heaven</td></tr> </table> </body> </html> Copyright © Today
There are some weird newline effects due to the way we planned our template, but it works as expected.
Exercises
We've covered a wide variety of topics in this chapter, from strings to regular expressions, to object serialization, and back again. Now it's time to consider how these ideas can be applied to your own code.
Python strings are very flexible, and Python is an extremely powerful tool for string-based manipulations. If you don't do a lot of string processing in your daily work, try designing a tool that is exclusively intended for manipulating strings. Try to come up with something innovative, but if you're stuck, consider writing a web log analyzer (how many requests per hour? How many people visit more than five pages?) or a template tool that replaces certain variable names with the contents of other files.
Spend a lot of time toying with the string formatting operators until you've got the syntax memorized. Write a bunch of template strings and objects to pass into the format function, and see what kind of output you get. Try the exotic formatting operators, such as percentage or hexadecimal notation. Try out the fill and alignment operators, and see how they behave differently for integers, strings, and floats. Consider writing a class of your own that has a __format__
method; we didn't discuss this in detail, but explore just how much you can customize formatting.
Make sure you understand the difference between bytes
and str
objects. The distinction is very complicated in older versions of Python (there was no bytes
, and str
acted like both bytes
and str
unless we needed non-ASCII characters in which case there was a separate unicode
object, which was similar to Python 3's str
class. It's even more confusing than it sounds!). It's clearer nowadays; bytes
is for binary data, and str
is for character data. The only tricky part is knowing how and when to convert between the two. For practice, try writing text data to a file opened for writing bytes
(you'll have to encode the text yourself), and then reading from the same file.
Do some experimenting with bytearray
; see how it can act both like a bytes object and a list or container object at the same time. Try writing to a buffer that holds data in the bytes array until it is a certain length before returning it. You can simulate the code that puts data into the buffer by using time.sleep
calls to ensure data doesn't arrive too quickly.
Study regular expressions online. Study them some more. Especially learn about named groups greedy versus lazy matching, and regex flags, three features that we didn't cover in this chapter. Make conscious decisions about when not to use them. Many people have very strong opinions about regular expressions and either overuse them or refuse to use them at all. Try to convince yourself to use them only when appropriate, and figure out when that is.
If you've ever written an adapter to load small amounts of data from a file or database and convert it to an object, consider using a pickle instead. Pickles are not efficient for storing massive amounts of data, but they can be useful for loading configuration or other simple objects. Try coding it multiple ways: using a pickle, a text file, or a small database. Which do you find easiest to work with?
Try experimenting with pickling data, then modifying the class that holds the data, and loading the pickle into the new class. What works? What doesn't? Is there a way to make drastic changes to a class, such as renaming an attribute or splitting it into two new attributes and still get the data out of an older pickle? (Hint: try placing a private pickle version number on each object and update it each time you change the class; you can then put a migration path in __setstate__
.)
If you do any web development at all, do some experimenting with the JSON serializer. Personally, I prefer to serialize only standard JSON serializable objects, rather than writing custom encoders or object_hooks
, but the desired effect really depends on the interaction between the frontend (JavaScript, typically) and backend code.
Create some new directives in the templating engine that take more than one or an arbitrary number of arguments. You might need to modify the regular expression or add new ones. Have a look at the Django project's online documentation, and see if there are any other template tags you'd like to work with. Try mimicking their filter syntax instead of using the variable tag. Revisit this chapter when you've studied iteration and coroutines and see if you can come up with a more compact way of representing the state between related directives, such as the loop.
Summary
We've covered string manipulation, regular expressions, and object serialization in this chapter. Hardcoded strings and program variables can be combined into outputtable strings using the powerful string formatting system. It is important to distinguish between binary and textual data and bytes
and str
have specific purposes that must be understood. Both are immutable, but the bytearray
type can be used when manipulating bytes.
Regular expressions are a complex topic, but we scratched the surface. There are many ways to serialize Python data; pickles and JSON are two of the most popular.
In the next chapter, we'll look at a design pattern that is so fundamental to Python programming that it has been given special syntax support: the iterator pattern.