[36] Serialization and Unserialization
(Part of C++ FAQ Lite, Copyright © 1991-2006, Marshall Cline, cline@parashift.com)


FAQs in section [36]:


[36.1] What's this "serialization" thing all about?

It lets you take an object or group of objects, put them on a disk or send them through a wire or wireless transport mechanism, then later, perhaps on another computer, reverse the process: resurrect the original object(s). The basic mechanisms are to flatten object(s) into a one-dimensional stream of bits, and to turn that stream of bits back into the original object(s).

Like the Transporter on Star Trek, it's all about taking something complicated and turning it into a flat sequence of 1s and 0s, then taking that sequence of 1s and 0s (possibly at another place, possibly at another time) and reconstructing the original complicated "something."

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.2] How do I select the best serialization technique?

There are lots and lots (and lots) of if's, and's and but's, and in reality there are a whole continuum of techniques with lots of dimensions. Because I have a finite amount of time (translation: I don't get paid for any of this), I've simplified it to a decision between using human-readable ("text") or non-human-readable ("binary") format, followed by a list of five techniques arranged more-or-less in increasing order of sophistication.

You are, of course, not limited to those five techniques. You will probably end up mixing ideas from several techniques. And certainly you can always use a more sophisticated (higher numbered) technique than is actually needed. In fact it might be wise to use a more sophisticated technique than is minimally needed if you believe future changes will require the greater sophistication. So think of this list merely as a good starting point.

There's a lot here, so get ready!

  1. Decide between human-readable ("text") and non-human-readable ("binary") formats. The tradeoffs are non-trivial. Later FAQs show how to write simple types in text format and how to write simple types in binary format.
  2. Use the least sophisticated solution when the objects to be serialized aren't part of an inheritance hierarchy (that is, when they're all of the same class) and when they don't contain pointers to other objects.
  3. Use the second level of sophistication when the objects to be serialized are part of an inheritance hierarchy, but when they don't contain pointers to other objects.
  4. Use the third level of sophistication when the objects to be serialized contain pointers to other objects, but when those pointers form a tree with no cycles and no joins.
  5. Use the fourth level of sophistication when the objects to be serialized contain pointers to other objects, and when those pointers form a graph that with no cycles, and with joins at the leaves only.
  6. Use the most sophisticated solution when the objects to be serialized contain pointers to other objects, and when those pointers form a graph that might have cycles or joins.

Here's that same information arranged like an algorithm:

  1. The first step is to make an eyes-open decision between text- and binary-formats.
  2. If your objects aren't part of an inheritance hierarchy and don't contain pointers, use solution #1.
  3. Else if your objects don't contain pointers to other objects, use solution #2.
  4. Else if the graph of pointers within your objects contain neither cycles nor joins, use solution #3.
  5. Else if the graph of pointers within your objects don't contain cycles and if the only joins are to terminal (leaf) nodes, use solution #4.
  6. Else use solution #5.

Remember: feel free to mix and match, to add to the above list, and, if you can justify the added expense, to use a more sophisticated technique than is minimally required.

One more thing: the issues of inheritance and of pointers within the objects are logically unrelated, so there's no theoretical reason for #2 to be any less sophisticated than #3-5. However in practice it often (not always) works out that way. So please do not think of these categories as somehow sacred — they're somewhat arbitrary, and you are expected to mix and match the solutions to fit your situation. This whole area of serialization has far more variants and shades of gray than can be covered in a few questions/answers.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.3] How do I decide whether to serialize to human-readable ("text") or non-human-readable ("binary") format?

Carefully.

There is no "right" answer to this question; it really depends on your goals. Here are a few of the pros/cons of human-readable ("text") format vs. non-human-readable ("binary") format:

You might think of others to add as well... The important thing to remember is that one size does not fit all — make a careful decision here.

One more thing: no matter which you choose, you might want to start each file / stream with a "magic" tag and a version number. The version number would indicate the format rules. That way if you decide to make a radical change in the format, you hopefully will still be able to read the output produced by the old software.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.4] How serialize/unserialize simple types like numbers, characters, strings, etc.?

The answer depends on your decision regarding human-readable ("text") format vs. non-human-readable ("binary") format:

The primitives discussed in those FAQs will be needed for most of the other FAQs in this section.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.5] How exactly do I read/write simple types in human-readable ("text") format?

Before you read this, make sure to evaluate all the tradeoffs between human-readable and non-human-readable formats. The tradeoffs are non-trivial, so you should resist a knee-jerk reaction to do it the way you did it on the last project — one size does not fit all.

After you have made an eyes-open decision to use human-readable ("text") format, you should remember these keys:

Please remember that these are primitives that you will need to use in the other FAQs in this section.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.6] How exactly do I read/write simple types in non-human-readable ("binary") format?

Before you read this, make sure to evaluate all the tradeoffs between human-readable and non-human-readable formats. The tradeoffs are non-trivial, so you should resist a knee-jerk reaction to do it the way you did it on the last project — one size does not fit all.

After you have made an eyes-open decision to use non-human-readable ("binary") format, you should remember these keys:

Please remember that these are primitives that you will need to use in the other FAQs in this section.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.7] How do I serialize objects that aren't part of an inheritance hierarchy and that don't contain pointers to other objects?

This is the least sophisticated problem, and not surprisingly, it is also the least sophisticated solution:

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.8] How do I serialize objects that are part of an inheritance hierarchy and that don't contain pointers to other objects?

Suppose you want to serialize a "shape" object, where Shape is an abstract class with derived classes Rectangle, Ellipse, Line, Text, etc. You would declare a pure virtual function serialize(std::ostream&) const within class Shape, and make sure the first thing done by each override is to write out the class's identity. For example, Ellipse::serialize(std::ostream&) const would write out the identifier Ellipse (perhaps as a simple string, but there are several alternatives discussed below).

Things get a little trickier when unserializing the object. You typically start with a static member function in the base class such as Shape::unserialize(std::istream& istr). This is declared to return a Shape* or perhaps a smart pointer such as Shape::Ptr. It reads the class-name identifier, then uses some sort of creational pattern to create the object. For example, you might have a table that maps from the class name to an object of the class, then use the Virtual Constructor Idiom to create the object.

Here's a concrete example: Add a pure virtual method create(std::istream&) const within base class Shape, and define each override to a one-liner that uses new to allocate an object of the appropriate derived class. E.g., Ellipse::create(std::istream& istr) const would be { return new Ellipse(istr); }. Add a static std::map<std::string,Shape*> object that maps from the class name to a representative (AKA prototype) object of the appropriate class; e.g., "Ellipse" would map to a new Ellipse(). Function Shape::unserialize(std::istream& istr) would read the class-name, throw an exception if it's not in the map (if (theMap.count(className) == 0) throw ...something...), then look up the associated Shape* and call its create() method: return theMap[className]->create(istr).

The map is typically populated during static initialization. For example, if file Ellipse.cpp contains the code for derived class Ellipse, it would also contain a static object whose ctor adds that class to the map: theMap["Ellipse"] = new Ellipse().

Notes and caveats:

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.9] How do I serialize objects that contain pointers to other objects, but those pointers form a tree with no cycles and no joins?

Before we even start, you must understand that the word "tree" does not mean that the objects are stored in some sort of tree-like data structure. It simply means that your objects point to each other, and the "with no cycles" part means if you keep following pointers from one object to the next, you never return to an earlier object. Your objects aren't "inside" a tree; they are a tree. If you don't understand that, you really should read the lingo FAQ before continuing with this one.

Second, don't use this technique if the graph might someday contain cycles or joins.

Graphs with neither cycles nor joins are very common, even with "recursive composition" design patterns like Composite or Decorator. For example, the objects representing an XML document or an HTML document can be represented as a graph without joins or cycles.

The key to serializing these graphs is to ignore a node's identity and instead to focus only on its contents. A (typically recursive) algorithm dives through the tree and writes the contents as it goes. For example, if the current node happens to have an integer a, a pointer b, a float c, and another pointer d, then you first write the integer a, then recursively dive into the child pointed to by b, then write the float c, and finally recursively dive into the child pointed to by d. (You don't have to write/read them in the declaration order; the only essential rule is that the reader's order is consistent with the writer's order.)

When unserializing, you need a constructor that takes a std::istream&. The constructor for the above object would read an integer and store the result in a, then would allocate an object to be stored in pointer b (and will pass the std::istream to the constructor so it too can read the stream's contents), read a float into c, and finally will allocate an object to be stored in pointer d. Be sure to use smart pointers within your objects, otherwise you will get a leak if an exception is thrown while reading anything but the first pointed-to object.

It is often convenient to use the Named Constructor Idiom when allocating these objects. This has the advantage that you can enforce the use of smart pointers. To do this in a class Foo, write a static method such as FooPtr Foo::create(std::istream& istr) { return new Foo(istr); } (where FooPtr is a smart pointer to a Foo). The alert reader will note how consistent this is with the technique discussed in the previous FAQ — the two techniques are completely compatible.

If an object can contain a variable number of children, e.g., a std::vector of pointers, then the usual approach is to write the number of children just before recursively diving into the first child. When unserializing, just read the number of children, then use a loop to allocate the appropriate number of child objects.

If a child-pointer might be NULL, be sure to handle that in both the writing and reading. This shouldn't be a problem if your objects use inheritance; see that solution for details. Otherwise, if the first serialized character in an object has a known range, use something outside that range. E.g., if the first character of a serialized object is always a digit, use a non-digit like 'N' to mean a NULL pointer. Unseralization can use std::istream::peek() to check for the 'N' tag. If the first character doesn't have a known range, force one; e.g., write 'x' before each object, then use something else like 'y' to mean NULL.

If an object physically contains another object inside itself, as opposed to containing a pointer to the other object, then nothing changes: you still recursively dive into the child-node as if it were via a pointer.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.10] How do I serialize objects that contain pointers to other objects, but those pointers form a tree with no cycles and only "trivial" joins?

As before, the word "tree" does not mean that the objects are stored in some sort of tree-like data structure. It simply means your objects have pointers to each other, and the "no cycles" part means you can follow the pointers from one object to the next and never return to an earlier object. The objects are not "inside" a tree; they are a tree. If that's doesn't make sense, you really should read the lingo FAQ before continuing with this one.

Use this solution if the graph contains joins at the leaf nodes, but those joins can be easily reconstructed via a simple look-up table. For example, the parse-tree of an arithmetic expression like (3*(a+b) - 1/a) might have joins since a variable-name (like a) can show up more than once. If you want the graph to use the same exact node-object to represent both occurrences of that variable, then you could use this solution.

Although the above constraints don't fit with those of the solution without any joins, it's so close that you can squeeze things into that solution. Here are the differences:

Caveat: this assumes that all occurrences of variable a should map to the same node object; if it's more complicated than this, that is, if some occurrences of a should map to one object and some to another, you might need to use a more sophisticated solution.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.11] How do I serialize objects that contain pointers to other objects, and those pointers form a graph that might have cycles or non-trivial joins?

Caveat: the word "graph" does not mean that the objects are stored in some sort of data structure. Your objects form a graph because they point to each other. They're not "inside" a graph; they are a graph. If that doesn't make sense, you really should read the lingo FAQ before continuing with this one.

Use this solution if the graph can contain cycles, or if it can contain more complicated kinds of joins than are allowed by the solution for trivial joins. This solution handles two core issues: it avoids infinite recursion and it writes/reads each node's identity in addition its contents.

A node's identity doesn't normally have to be consistent between different output streams. For example, if a particular file uses the number 3 to represent node x, a different file can use the number 3 to represent a different node, y.

There are some clever ways to serialize the graph, but the simplest to describe is a two-pass algorithm that uses an object-ID map, e.g., std::map<Node*,unsigned> oidMap. The first pass populates our oidMap, that is, it builds a mapping from object pointer to the integer that represents the object's identity. It does this by recursively diving through the graph, at each node checking if the node is already in oidMap, and if not, adding the node and a unique integer to oidMap and recursively diving into the new node's children. The unique integer is often just the initial oidMap.size(), e.g., unsigned n = oidMap.size(); oidMap[nodePtr] = n. (Yes, we did that in two statements. You must also. Do not shorten it to a single statement. You have been warned.)

The second pass iterates through the nodes of oidMap, and at each, writes the node's identity (the associated integer) followed by its contents. When writing the contents of a node that contains pointers to other nodes, instead of diving into those "child" objects, it simply writes the identity (the associated integer) of the pointer to those nodes. For example, when your node contains Node* child, simply write the integer oidMap[child]. After the second pass, the oidMap can be discarded. In other words, the mapping from Node* to unsigned should not normally survive beyond the end of the serialization of any given graph.

There are also some clever ways to unserialize the graph, but here again the simplest to describe is a two-pass algorithm. The first pass populates a std::vector<Node*> v with objects of the right class, but any child pointers within those objects are all NULL. This means v[3] will point to the object whose oid is 3, but any child pointers inside that object will be NULL. The second pass populates the child pointers inside the objects, e.g., if v[3] has a child pointer called child that is supposed to point to the object whose oid is 5, the second pass changes changes v[3].child from NULL to v[5] (obviously encapsulation might prevent it from directly accessing v[3].child, but ultimately v[3].child gets changed to v[5]). After unserializing a given stream, the vector v can normally be discarded. In other words, the oids (3, 5, etc.) mean nothing when serializing or unserializing a different stream — those numbers are only meaningful within a given stream.

Note: if your objects contain polymorphic pointers, that is, base class pointers that might point at derived class objects, then use the technique described earlier. You'll also want to read some of the earlier techniques for handling NULL, writing version numbers, etc.

Note: you should seriously consider the Visitor pattern when recursively diving through the graph, since serialization is probably just one of many different reasons to make that recursive dive, and they'll all need to avoid infinite recursion.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.12] Are there any caveats when serializing / unserializing objects?

One of the things you don't want to do, except in unusual circumstances, is to make any changes to the node's data during the traversal. For example, some people feel they can map from Node* to integer by simply adding an integer as a data member in the Node class. They also sometimes add a bool haveVisited flag as another data member within Node objects.

But this causes lots of multi-threading and/or performance problems. Your serialize() method can no longer be const, so even though it is logically just a reader of the node data, and even though it logically makes sense to have multiple threads reading the nodes at the same time, the actual algorithm writes into the nodes. If you fully understand threading and reader/writer conflicts, the best you can hope for is to make your code more complicated and slower (you'll have to block all reader threads whenever any thread is doing anything with the graph). If you (or those who will maintain the code after you!) don't fully understand reader/writer conflicts, this technique can create very serious and very subtle errors. Reader/writer and writer/writer conflicts are so subtle they often slip through test into the field. Bad news. Trust me. If you don't trust me, talk to someone you do trust. But don't cut corners.

There's lots more I could say, such as several simplifications and special cases, but I've already spent too much time on this. If you want more info, spend some money.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


[36.13] What's all this about graphs, trees, nodes, cycles, joins, and joins at the leaves vs. internal nodes?

When your objects contain pointers to other objects, you end up with something computer scientists call a graph. Not that your objects are stored inside a tree-like data structure; they are a tree-like structure.

Your objects correspond to the graph's nodes AKA vertices, and the pointers within your objects correspond to the graph's edges. The graph is of a special variety called a rooted, directed graph. The root object to be serialized corresponds to the graph's root node, and the pointers correspond to directed edges.

If object x has a pointer to object y, we say that x is a parent of y and/or that y is a child of x.

A path through a graph corresponds to starting with an object, following a pointer to another object, etc., etc. to an arbitrary depth. If there is a path from x to z we say that x is an ancestor of z and/or that z is a descendent of x.

A join in a graph means there are two or more distinct paths to the same object. For example, if z is a child of both x and y, then the graph has a join, and z is a join node.

A cycle in a graph means there is a path from an object back to itself: if x has a pointer to itself, or to y which points to x, or to y which points to z which points to x, etc. A graph is cyclic if it has one or more cycles; otherwise it is acyclic.

An internal node is a node with children. A leaf node is a node without children.

As used in this section, the word tree means a rooted, directed, acyclic graph. Note that each node within a tree is also a tree.

TopBottomPrevious sectionNext sectionSearch the FAQ ]


E-Mail E-mail the author
C++ FAQ LiteTable of contentsSubject indexAbout the author©Download your own copy ]
Revised Mar 1, 2006