Edited 27-01-2017 to add RapidJSON results.
Python has a number of libraries available to it for the purposes of serializing data and object, generally for the purpose of passing them around from one process or node to another, or for saving the program state to disk. Serialization for a weakly-typed language such as Python brings with it some challenges which typically result in modules having limitations in what they can serialize. My interest is mostly with regards to packaging meta-data with microscopy binary data, were one might have a few megabytes of metadata alongside gigabytes of image stacks.
The built-in modules are
json. I will also look at two other third-party modules
msgpack-python. All of them produce either text or binary representations, and all are uncompressed. I thought I
would test an implementation wrapping them with the
blosc meta-compressor library to compress their outputs before
writing to disk, to see what sort of space-savings and potentially performance-enhancements could be wrote. The code
presented herein is available at:
It’s not intended as a production-ready tool.
Pickle: is Python’s most robust serialization tool. It can manage custom classes and objects, and circular references.
It does not duplicate objects found to have multiple references. It outputs binary. It is not compatible with other
pickle received a major speed upgrade with Python 3, which also came with a new file protocol.
used, for example, in the multiprocessing module to exchange data between processes.
pickle is often said to be a
potential security hazard as it can potentially carry malicious code, which is the disadvantage of its versatility
in serializing objects.
Marshal: is Python’s internal file I/O module. It can serialize only Python base types (essentially
dicts), will crash if fed a circular reference. It is not even compatible across different version of Python. It
is supposed to be the fastest of the tested standard Python modules.
only deals with two constructs: list and dicts. As such, it requires helper functions to be implemented in order to
serialize objects. It’s not binary, so it can be human edited (with some difficulty, it’s picky about commas and
similar formatting errors). As with pickle,
json received a major performance upgrade in Python 3, such that many
external implementations of JSON were obsoleted, with a couple of exceptions such as…
UltraJSON: developed by an Electronic Arts studio (but released under a BSD license), UltraJSON is just like JSON, but
faster. One drawback of
ujson versus the default library is that it can fail silently.
RapidJSON: Another fast JSON parser built on top of a C-library. Here I use Hideo Hattori’s wrapper. I have since
switched to using
Message Pack: is billed as a binary-equivalent of JSON. On first glance it was very intriguing, as it offers a
significant encoding rate and encoded size advantage over JSON, and it has implementation for basically every
programming language in-use today. However, I ran into problems in testing. By default, converts Unicode strings to
bytes strings, which can cause a loss of information. When encoding as Unicode it loses its speed advantage over the
faster JSONs. Furthermore, while
msgpack will serialize objects, it doesn’t serialize all their attributes, and so it
Blosc: isn’t a serializer, it’s a compressor, or more properly a meta-compressor.
blosc wraps a variety of compression
codecs with a thread pool, to provide very high performance. The two best compressors in
blosc in my experience are
lz4, which is ultra-fast but middling compression (and the new standard codec for the ZFS file system), and Zstandard
which achieves better compression ratios than zlib/gzip and is still very fast. Zstandard is new as of 2015 and
essentially offers something for nothing compared to older compression algorithms. It’s usually within spitting
distance (a few %) of BZIP2 for compression ratio and far faster, being heavily optimized for parallel computing. In
testing on Xeon machines I’ve achieved compression rates of about 12 GB/s with lz4 and 5 GB/s with zstd. That is
GigaBytes per second.
blosc also has a filter stage, which at present is byte- or bit-shuffling. I’ve found
bit-shuffling to be effective when compressing floating-point or otherwise dynamic-range limited data. It would
probably be extremely helpful for DNA or protein sequences, for example. Here I did not find shuffling to be effective.
Throughout this post I use compression level 1 for zstd and compression level 9 for lz4. Lz4 does not really slow at
all with compression level, and zstd saturates much earlier than zlib (there’s rarely an advantage to going past 4).
All tests were performed with Python 3.5 with an iCore5-3570K (3.4 GHz), running 4-threads for
blosc, and a Western
Digital 3 TB ‘Red’ drive formatted for NTFS. Ideally one would perform this test on a Linux system with disk cache
flushing between each test. I would expect some additional performance from Python 3.6, in particular because we are
using dicts here, but I use a lot of libraries so it will be some time before I move to it myself.
High-entropy Benchmark: 256k UUID4 keys
I tested the serialization tools on 256k UUID4 keys with singleton values. This is a fairly challenging data set for compression because there’s quite a high degree of randomness inherent in what are supposed to be unique identifiers.
Figure 1: Write times for 256k UUID4 keys.
Figure 2: Read times for 256k UUID4 keys.
Figure 3: Size of the data on disk. The difficulty of this data set is evident in that lz4 compression achieved little. However, zstd shines here, cutting the data in half.
pickle using zstd compression yields about a 25 % write time penalty, but this is nearly negated by a
corresponding reduction in read time. Since the data is small, I expect ‘writing’ is just depositing the file in the
hard drive cache.
The increased performance of
blosc-JSON compared to pure JSON is somewhat paradoxical, and not due to the JSON
serialization, but the poor performance of Python in reading and writing Unicode streams to disk. If you encode the
JSON output as UTF-8 and write it as bytes, it’s much faster. I left the benchmarks as is, because it’s something to
keep in mind. Similarly marshal seems to be faster at reading when it is passed a bytes object instead of a stream.
Message Pack looks on the surface to offer impressive performance, but as mentioned above the Python implementation often omits important information from objects. If I worked on an enterprise-level project, I might dig more deeply into why and when it fails, but I don’t, so I won’t.
Low-entropy Benchmark: JSON Generator
Here I generated 10’000 entries of pseudo-phone book entries, with the handy JSON Generator, which corresponds to about 25 MB of JSON data. This data has a lot more repeated elements, in particular a lot of non-unique keys, that should improve relative performance of the compression tools.
Figure 4: Write times for ten-thousand procedurally generated phonebook-like JSON entries.
Figure 5: Read times for ten-thousand procedurally generated phonebook-like JSON entries.
Figure 6: Size of the data on disk. Here the data is significantly less random so the compression tools, and especially lz4 perform better than with the high entropy data set. The blocksize was 64kB.
Overall lz4 reduces the disk usage by about 40-50 % and zStandard shaves another 10 % off of that. If you are
consistently dealing with larger data chunks, the blocksize could be increased. Typically
blosc is fastest when the
block fits into L2 cache, but compression ratio usually increases up to about 1 MB blocks before saturated.
Here both UltraJSON and Message Pack silently failed the read/write assert test. The ujson error appears to be related to minimum precision in reading floats, and for Message Pack the problem was that it converting Unicode to bytes.
Overall, on fairly difficult data
blosc reduces file size by about 50 % in return for a 25 % write speed penalty.
However, the read time is accelerated, such that the net time spent on file I/O is more or less a push. On more highly
compressible data (e.g. DNA base pairs, protein sequences) and in particular data large enough to swamp the hard disks’
cache (typically 0.5-1.0 GB), one would expect to see
blosc + serialization be faster than just pure serialization.
pickle offers out-of-the-box functionality for serializing objects. If you want to serialize with JSON, to
maintain cross-language compatibility, then you’ll need to implement helper methods yourself. UltraJSON looks great on
the surface but I did manage to break it, so I wouldn’t consider it an out-of-the-box robust solution. Still, it beats
pickle in speed. This could be as simple as Python’s boolean
True mapping to
'True' on disk and back to
read. Another potential JSON library that has a Python wrapper to examine is RapidJSON, which has two implementations
pyrapidjson. I’ve found
python-rapidjson to be the less buggy of the two but it only
supports Python 3.
One aspect I wanted to look at was trying to monkey-patch the multiprocessing module to use
bloscpickle instead of
pickle is not exposed so one would have to patch the reduction.py file in the module.
One disadvantage of
blosc at present is that it does not have a streaming interface, i.e. it deals with bytes objects.
This means it will store and extra (compressed) copy of the data in memory, relative to vanilla pickling. It also used
to hold onto the GIL, although that has been patched out and should go live with the next release.