Python binary packages (and origin story of récitale)
or "One headache, two headaches, three headaches, four..." Tue 08 February 2022Don't care about your origin story, gimme the good stuff
I used Prosopopée for a while for my photo blog and contributed some fixes back then. This software lets you define albums (aka galleries) with photos (and videos, audio files, text, HTML, iframes, ...) and then creates thumbnails for those photos so that your website does not make your user load 500MiB of data for your 20-photo album.
However, with 25+ albums and 1000+ images, it was not as small as a photo blog as it used to be in the beginning. After a hiccup on the server hosting my blog, I had to reinstall everything from scratch and with my slow upload link, a few GiB of pictures and thumbnails would be just too much. So instead, I uploaded the originals and "compiled" the blog on the server (like I've done it for years already, though making use of previous build cache). The thing is... it took FIVE HOURS AND A HALF to build this from scratch.
And this is where all this madness began.
I discovered that the whole Python project was single-threaded. But that it was using subprocess Python module to call GraphicsMagick which happens to be multi-threaded. While it was sub-optimal because GraphicsMagick
could be running faster by converting all thumbnails from a single picture all in one command instead of having it load the base image for each thumbnail, I also discovered that there is some nice image manipulation library in Python called Pillow. I created a small Proof-of-Concept and it was slower than GraphicsMagick
... because Pillow
is single-threaded. So, time to go for multithreading. This is where I learned about the Python Global Interpreter Lock
(aka GIL
) which means that threads in Python are only really ever useful for when you're doing IO-intensive tasks, when the CPU is waiting on some peripherals. So, threads... no go since I want all cores to be running at the same time at 100%. Then I discovered the multiprocessing Python module which is side-stepping the GIL
. Proof-of-concept... and as hoped, multiprocesses with each having its own image to create thumbnails for was much faster than GraphicsMagick
. The homemade benchmarks ran on my Asus C101-PA Chromebook (Rockchip RK3399, 6-core ARM64 SoC) and I could get between 5 to 8 times faster with multiprocessing
+ Pillow
compared to GraphicsMagick
.
I imagined the rework of the original project (prosopopée
) to be too much work compared to the likeliness that the full rework would be accepted by the original maintainer. I thus started a "fork" from scratch, just reusing the templates. One proof-of-concept later, I shared it with the maintainer of prosopopée
and explained all the challenges I faced along the way (who would have known that the GIL
was SO hard to side-step while keeping one's own sanity?). They were interested so I started to work on a proper fork with proper commits so that I could send a pull request to the original project and see where it'd go from there. Many months of work and lots of headaches later (supporting multiple versions of Pillow
turned out to be painful, with a handful of quirks and hacks to implement), the fork was in a good enough shape to be contributed back to the original project... but wait... With so many (big and drastic) changes, one needs benchmarks to highlight how much the situation improves!
While setting up multiple computers of different architectures (Aarch64
, x86_64
) and of different levels of power (absurdly slow to pretty fast), I was surprised to see some computers only improved by a factor of 50% the speed of the build. I discovered that for x86_64
SoCs with SIMD support, a fork of Pillow
exists: Pillow-SIMD which claims to be much faster than the original Pillow
. Tested and confirmed, much faster on computers that support SIMD
of Intel x86_64
instruction set. However, Pillow-SIMD
is not compatible with Aarch64
instruction set (well... any other than Intel's x86_64
actually). And that is unfortunately not something I want to support officially since I want to be able to use prosopopée
on my Chromebook or have people use it on a Raspberry Pi (a friend generates and hosts his photo blog on one so I "had to" support it officially :) ). But it's good for benchmarking nonetheless! Also stumbled upon an explanation as to why it's hard to contribute it back to Pillow which helped me understand a bit more the packaging world of computer distributions.
Back to my benchmarking :) Since I don't like having multiple sources for software packages, I usually install them from the distribution package manager (dnf
on Fedora
). Since I wanted to support multiple (and more recent than the one available in distribution official package repositories) versions of Pillow
I however had to use pip
. After benchmarking for a while, I discovered that my numbers for pip
-installed versions were worse (by a non-negligible factor) compared to the version that came with my distribution. Then I did the unthinkable: I tested the exact same version of Pillow
, one from pip
, one from my distribution. And the one from my distribution was almost twice faster than pip
's. For. The. Same. Version. After some digging, I saw that Pillow
was not using my distribution's jpeg library - libjpeg-turbo - but its own - the original, and slower, libjpeg. Stay til the end for the explanation :) Leave a like and subs... ah no, not YouTube.
I also discovered that one can build Python packages from source delivered by pip
by using python3 -m pip install --no-binary :all: pillow
(after making sure Pillow
package was entirely removed from my system). And with that, my distribution's jpeg library (libjpeg-turbo
) was used for Pillow
and the perfs were similar. Phew.
Time for some (year-old) benchmarks. For 31 galleries and ~1400 photos:
Computer | Graphicsmagick | Pillow 8.1.0 | Built Pillow 8.1.0 | Pillow-SIMD 7.0.0.post3 |
---|---|---|---|---|
Intel Q6600 (4c/4t @2.4GHz) 4GB RAM (Fedora Desktop 33) | 1:37:13.06 | 26:57.71 | 17:43.66 | N/A |
Intel Atom N2800 (2c/4t @1.86GHz 2GB RAM (Fedora Server 33) | 5:35:32.57 | 1:44:21.93 | 1:16:32.42 | N/A |
Intel Celeron G1610T (2c/2t @2.3GHz) 4GB RAM (Fedora Server 33) | 1:42:33.79 | 46:10.00 | 26:10.30 | 17:30.49 |
Intel Core i7-8700 (6c/12t @3.2GHz) 32GB RAM (Ubuntu Desktop 20.04.2) | 33:01.63 | 6:00.16 | 3:40.09 | 2:16.03 |
RaspberryPi 4 4GB RAM (Ubuntu Server 20.10) | 3:44:57.00 | 44:43.67 | 33:29.86 | N/A |
Seems like the months hard at work proved to be useful after all!
I sent the Pull Request and called it a day.
Fast forward a few months, the maintainer had merged some other pull requests of mine but didn't take the time to review this (big) pull request. So after some careful thinking, I decided to start my own fork, récitale.
And the second round of madness started. I now have a pip
package on pypi and wanted to create a container image for the project. Since I started to use container images, I've always tried to use Alpine-based container images as they are more lightweight than others and apparently also offer some decent security practices. I shall therefore create an Alpine-based container image for my project. I tried for hours and hours and pip
's Pillow
would always try to get compiled from source instead of taking the prebuilt version (aka wheels
). Some evenings spent in the matrix and here's my summary of why that is:
Probably most of us only ever developed pure Python scripts. One that only needs a Python interpreter to run and that would be it. Another kind of Python software exists though: Python extension modules. Those are actually coded in C (or C++) with the Python API and can get imported and used as Python modules in your Python scripts. Since C language is compiled and not interpreted like Python, Python extension modules need to be compiled in order to be usable. Pillow
actually mostly contains and make use of Python extension modules. Therefore it needs to be compiled. The reason for such Python extension modules is that some code is much faster if coded in low-level languages like C or C++ compared to Python. (As a side note, while there are multiple Python interpreters coded in different languages, CPython
is the most widely used and is coded in C as its name suggests).
Since having users compile source code before being able to use software is not the best adoption strategy, there needs to be a way to share prebuilt Python extension modules. The prebuilt Python extension are compiled and shared as shared libraries (commonly .so
files on UNIX systems). The compilation and packaging is being handled by wheels
and we don't have to worry about that. However, shared libraries almost always depend on (link against) other shared libraries, at the very least against the standard C library (aka libc
). So, the maintainer of the Python extension module will compile it into a shared library with wheels
and then package it and publish it on some Python package index such as PyPi. Here comes the first problem: the shared library against which the Python extension module was linked may not be the same as the one installed on user computers. This would result in the inability to run the Python extension module anywhere else than on the maintainer computer, which kind of defeats the purpose of being able to share them.
Instead, Python community decided to redact a contract that each maintainer Python extension module should fulfil in order to be publicly shared. This contract is defined in PEP-0513. wheels
packages for Linux systems with the manylinux1
tag expect a given set of system libraries to be present on the user computer, each with a specific major version, and guarantee that they do work in that environment. This is great since there's no need to share those system libraries with the Python extension module in PyPi
, since they have to be on the system so that the Python package manager (e.g. pip
) is able to fetch the prebuilt version of the modules. The not-so-nice thing is that it means this environment will probably get outdated over time since the source code of those system libraries do evolve too and will eventually have some backwards incompatibility. Meaning prebuilt modules will only be available for rather old systems. That's when PEP-0599 comes into play with the manylinux2014
tag and updated contract for the set of system libraries installed on user computers. (Additionally, this PEP brings support for non-Intel architectures, such as ARM or PowerPC). So now the maintainers need to compile two different wheels
packages, one fulfilling the manylinux1
contract and another for manylinux2014
's. And multiply this by the number of architectures they want to support. This is also where another issue arises: the PEP "contracts" needs to be constantly updated to match what is done on computer distribution. This is tedious for the Python community and therefore, they came up with another contract: PEP-0600. This PEP defines new tags, each one targeting a specific GNU libc (aka glibc) version (major and minor being used to discriminate the version) and a given CPU architecture. Therefore, any shared library not part of the glibc
is not part of the new manylinux
"contract".
All that being said, the contract only ever mentions a very small set of system libraries and it is very likely some Python extension modules will link against other shared libraries. Such is the case of Pillow
. After installing Pillow
with pip
, one can find the Python extension module shared libraries in ~/.local/lib/python3.7/site-packages/PIL/
directory. One can discover which shared libraries they are linked against by running the following command:
$ ldd ~/.local/lib/python3.7/site-packages/PIL/_imaging.cpython-37m-aarch64-linux-gnu.so
linux-vdso.so.1 (0x000000766a429000)
libjpeg-35e8c64c.so.62.3.0 => /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/libjpeg-35e8c64c.so.62.3.0 (0x000000766a277000)
libopenjp2-ae40752c.so.2.4.0 => /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/libopenjp2-ae40752c.so.2.4.0 (0x000000766a1c4000)
libz-21b81fdb.so.1.2.11 => /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/libz-21b81fdb.so.1.2.11 (0x000000766a183000)
libtiff-e22335e6.so.5.7.0 => /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/libtiff-e22335e6.so.5.7.0 (0x000000766a081000)
libxcb-be71eb15.so.1.1.0 => /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/libxcb-be71eb15.so.1.1.0 (0x000000766a00c000)
libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007669fc9000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007669e57000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007669d9a000)
liblzma-4da4ab69.so.5.2.5 => /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/liblzma-4da4ab69.so.5.2.5 (0x0000007669d39000)
libXau-21870672.so.6.0.0 => /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/libXau-21870672.so.6.0.0 (0x0000007669d08000)
/lib/ld-linux-aarch64.so.1 (0x000000766a3fb000)
Here you can see that this _imaging.so
library is linking against libjpeg.so.62.3.0
from /home/qsdevices/.local/lib/python3.7/site-packages/PIL/../Pillow.libs/libjpeg-35e8c64c.so.62.3.0
and not my system's. And this is where I discovered that the shared libraries that aren't part of the manylinux
contract are actually shared and installed with the wheels
package. This seems to be handled automatically by auditwheel repair
. So now you understand why, with the exact same Pillow
version installed from pip
or my distribution package manager, the benchmarks were so different. Prebuilt Pillow
ships with the original libjpeg
while the one I get from my distribution links against libjpeg-turbo
, thus being much faster. This also explains why recompiling Pillow
from source with pip
instead of taking the wheels
package would make it so similar to my distribution's: Pillow
being built locally, it would find libjpeg-turbo
from my system instead of libjpeg
and use the former. As a note, Pillow
9.0.0 and later wheels
are now built against libjpeg-turbo
. I did a very quick test on my Intel Q6600-based system from above. Pillow
8.4.0 would take around 29 minutes for the current state of the blog against 17 minutes for Pillow
9.0.1. No need anymore to compile Pillow
from source to have better perfs! (Though Pillow-SIMD
is likely to still be more performant for computers that support it).
If you paid attention earlier, I stated that the manylinux
tag is a contract for packages linking agains the glibc
. It happens that Alpine Linux is not using glibc
but rather musl. Therefore, when pip
tries to find a wheels
package that can fulfil a contract with the musl
libc, it certainly couldn't find any (because none existed). This means that pip
could only fetch the source version of the Python module extension and had to compile it. This is why using Alpine container for Python packages was so discouraged. However, that is now history because there's a new PEP-0656 which introduces a contract for musl
-based systems with the musllinux
tag. Pillow
still does not support it though but one can only dream it'll be supported soon enough :)
I'm happy to have decided to work on my recitale
fork for that taught me a lot about Python and packaging :)
Now let's see how long I keep torturing myself with multiprocessing in Python instead of reimplementing it in a more adapted language (and probably much faster too).