TL;DR -> take a look at the summary.
I thought it would be a good idea to make my C/C++ library (Edlib) for calculating edit distance also usable from Python. In this blog post I give an overview of how I did it, including all the tricks and traps that I encountered, in hope that it will help others and also serve as a reminder for me.
What does it mean to have our C/C++ library usable from Python?
My goal was to have Edlib installable from pip, using
pip install edlib.
For that, we have to create a Python extension module that uses our C/C++ library and then package it in such a way that it can be published to PyPI.
- Writing Python extension module
- Building and distributing Python extension module
- Publishing to PyPI
Writing Python extension module
First step is to create a Python extension module, which is basically a piece of code that somehow interfaces with our C/C++ library and creates a new interface around it, that will be used from Python.
There are multiple ways to do this. Some of the most interesting that I found are: Python API, Cython, cffi, ctypes. Unfortunately, I did not have enough time to try out all of them, but I did do some investigation and eventually picked Cython.
Python API approach means writing a code in C that uses Python API defined in
Python.h. It demands a lot of additional work regarding portability and you also need to learn how to use the API. It just seemed too complex and too “low” for what I needed.
Reading about ctypes, I got a feeling that they are more suitable for interfacing with other C/C++ libraries from your Python code than wrapping your own C/C++ library for Python. Also, I read on multiple places that Cython and cffi are more advanced than ctypes, so my final choice was between cffi and Cython.
Why did I choose Cython over cffi? For no hard reason - I’ve read few posts, had a quick look at docs for both and got a better feeling about Cython - I hope to also try cffi in the future and learn more about it.
Doing it with Cython
I will not go into too many details about Cython, since it has good documentation on wrapping C libraries, but the main idea is that you write in a superset of Python that looks like a mix of Python and C, which enables you to easily mix both.
My C/C++ library is fairly simple and it has only two files,
edlib.h, without external dependencies.
First, I created a
cedlib.pxd file, which is basically just a description of parts of
edlib.h that we want to interface with in our module (I somewhat simplified Edlib interface for this blog post):
cdef extern from "edlib.h": int edlibAlign(const char* query, int queryLength, const char* target, int targetLength)
It basically comes down to copying declarations from your
.h file to
.pxd while adding few keywords like
ctypedef for structs and enums and doing some formatting. This is a somewhat boring part, but it’s relatively easy to do.
Next, I created an
edlib.pyx file (be careful not to name it the same as .pxd file because that can cause problems), which contains the actual Cython code of our extension module. In my case, ``edlib.pyx` contains a call to my C/C++ library and some type conversions from Python to C:
cimport cedlib def align(query, target): # Transform python strings into c strings. cdef bytes query_bytes = query.encode(); cdef char* cquery = query_bytes; cdef bytes target_bytes = target.encode(); cdef char* ctarget = target_bytes; # Run calculation - central method of Edlib. editDistance = cedlib.edlibAlign(cquery, len(query), ctarget, len(target)) return editDistance
After compilation, this should give us a simple Python module, which is used like this:
import edlib print(edlib.align("telephone", "elephant")) # Prints "3"
Next step is to compile .pyx and .pdx files with Cython (e.g.
cython edlib.pyx in command line), which gives us a .c file that we can further compile into Python extension module (.so). This .c file is independent of Cython so we do not have to think about Cython anymore once it is generated.
But how do we actually make all this (Cython compiling, compiling extension module) work nicely? This is intertwined with packaging and distributing the module, so let’s talk more about that.
Building and distributing Python extension module
package.json for npm, Python has
setup.py - main file for describing the Python module distribution.
Not only does
setup.py provide module information (author name, version, description, …) but it also provides information on how to distribute, build and install the module.
setup.py is actually just a file that calls
setup function with named arguments.
There are two main ways to get the
setup function: either from the
setuptools module or from the
distutils module. At first glance they look very similar, however
setuptools is newer and was created to replace the
distutils, so that is what you should be usin. When reading examples online, be careful to check if they are for
setuptools or for
Here is a simplified example of
setup.py for Edlib:
from setuptools import setup, Extension setup( # Information name = "edlib", version = "1.0.0", url = "https://github.com/Martinsos/edlib", author = "Martin Sosic", license = "MIT", keywords = "edit distance levenshtein align sequence bioinformatics", # Build instructions ext_modules = [Extension("edlib", ["edlib.bycython.c", "edlib/src/edlib.cpp"], include_dirs=["edlib/include"])] )
edlib.bycython.c is generated by compiling
edlib.pyx with Cython, while
edlib/src/edlib.cpp is source file of Edlib C/C++ library.
I will talk about them in more details later.
Part of setup.py that is most interesting to us is the
ext_modules argument of the
setup function - it defines extension modules to be built and needed source files.
Source distribution (sdist)
After we have created
setup.py, all it takes to build a module is running
python setup.py build_ext -i and it will compile source files and generate a
.so file that is our extension module. We can run Python interpreter from directory in which the
.so file is and
import edlib will work. This is useful for testing, but how can we make our module into something that can be distributed among other platforms?
Generating source distribution is the key to distributing our module.
To do that, we run
python setup.py sdist to generate a source distribution tarball, which can then either be published to PyPI or installed directly using
pip install <path_to_tarball>. This will not compile source files and generate
.so file like
python setup.py build_ext -i does. Instead, it will copy the needed source files to tarball so they can be shipped to the user, who will trigger the compilation/building when running
pip install <path_to_tarball>. This way, the user builds our module for his specific platform based on the source files and build instructions that we’ve supplied to him via the source distribution.
Source distribution can also be supplemented with wheels (platform-specific binaries), which is cool because no compilation is needed on the user’s side. I did not create any wheels for Edlib because support for Linux is not complete, I did not have simple access to needed platforms in order to build these binaries and also Edlib is small and compiles so fast that having binaries would not bring many benefits.
Structure of my project
Before continuing, let’s take a quick look at how I structured my project:
... - edlib/ # C/C++ library. - src/ - edlib.cpp - include/ - edlib.h - bindings/ - python/ # Python module. - edlib.pyx # Cython source file. - cedlib.pxd # Cython header file. - setup.py # Main module distribution file. - README.rst # Module readme. - MANIFEST.in # Describes additional files to copy to sdist tarball. - Makefile # I use it to automate certain steps when developing/publishing module. ...
As you can see, I keep both C/C++ library and the Python module in the same git repository. This enables me to easily do changes that span both projects and keep them synced regarding versions.
Compiling Cython and getting
Now, let’s take a closer look at that
ext_modules argument in
setup.py and how I defined the extension source files:
ext_modules = [Extension("edlib", ["edlib.bycython.c", "edlib/src/edlib.cpp"], include_dirs=["edlib/include"])]
First of all, how can I write
edlib/src/edlib.cpp when that is not correct path? Edlib C/C++ library directory is two levels above the setup.py and I should be using
../../edlib/src/edlib.cpp, right? Well, I tried that and just couldn’t get it to work - simply building the extension with
python setup.py build_ext -i works, but it all gets messed up when I try to build from created source distribution.
I solved this with my custom Makefile, so for example when running
make sdist (instead of
python setup.py sdist) it will first copy the
edlib/ directory to
bindings/python/ and then run
python setup.py sdist - that is the reason why paths
edlib/include work in
Next, if you take a look at Cython examples, you will see that most of them use
setup.py together with .pyx source files.
On the other hand, I have none of that in my
setup.py - why?
Well, at first I did try to use it, and my
ext_modules was looking something like this:
... from Cython.Build import cythonize ... ext_modules = cythonize([Extension("edlib", ["edlib.pyx", "edlib/src/edlib.cpp"], include_dirs=["edlib/include"])]) ...
This actually worked fine on my machine and I liked how simple it was, however I realized soon that for this to work, the user has to have Cython installed on his machine!
I didn’t like that, since I wanted the install to be as simple as
pip install edlib, and I didn’t want the user to have to explicitly install anything else, especially not a module that he/she may not even know what it does (Cython). Next to that, there is no way for me to control which version of the Cython user uses, which may give unexpected results during compilation.
I did some research and identified two solutions for this:
- Having the correct version of Cython installed automatically for the user, as dependency of Edlib, so that the user does not have to worry about that.
- Pre-compiling Cython source files and distributing only C/C++ files to the user.
Compiling Cython - first approach
At first, I decided to go with first approach. I found out about
setup_requires argument of
setup(), where we can specify dependencies needed for our module to install, and that seemed like a good fit.
Of course, that did not work because when executing
setup.py in order to process
cythonize is not defined and it will fail. There were some hacks for solving this but they just seemed too dirty to me.
Luckily, I found out that since version 18.0 setuptools has special support for Cython. All we have to do is specify Cython in
setup_requires and remove
cythonize(). Setuptools will download the correct version of Cython as local egg (which is great because it does not mess with the global Cython, if user has it pre-installed), then it will recognize that there are .pyx source files in
ext_modules and compile them using Cython. Since now there are no references to Cython in
setup.py, we don’t have a problem anymore with execution failing because of the missing Cython before reaching
setup_requires. My code then looked like this:
ext_modules = [Extension("edlib", ["edlib.pyx", "edlib/src/edlib.cpp"], include_dirs=["edlib/include"])], setup_requires = ["setuptools>=18.0", "cython>=0.25"]
This seemed like a great solution, until I realized that installation is taking a really long time (half a minute, minute). Setuptools has to download Cython and compile it! I really did not like this, so I decided to also try the second approach.
Compiling Cython - second approach
I added a command to my Makefile that compiles Cython file(s) to .c file(s):
cython edlib.pyx -o edlib.bycython.c.
Next, I modified setup.py to the final version, which I have already shown above:
ext_modules = [Extension("edlib", ["edlib.bycython.c", "edlib/src/edlib.cpp"], include_dirs=["edlib/include"])]
Since these are now just C/C++ files (
edlib.bycython.c is generated by Cython from
edlib.pyx), the user doesn’t need Cython anymore to build the module from source distribution. This worked well, and resulted with a very fast installation. Instead of taking almost a minute, as it was the case with the first approach, it took only a second or two.
Therefore, I decided to stick with the second approach, since it is both simple and fast, and makes it easy for the user to install Edlib.
Including all needed files into source distribution
Another problem that I encountered was ensuring that all needed source files are copied to the source distribution tarball.
The thing is, setuptools uses
ext_modules for figuring out which source files to copy to sdist tarball, and I had a problem that
edlib.h wouldn’t get copied. I also wanted to copy
edlib.pyx files because it was mentioned as a good practice to include Cython source files in source distribution, even when already pre-compiled.
That is why I added
include_dirs argument to
Extension, however that also didn’t help. I tried adding
edlib.h directly to the list of source files, next to
edlib/src/edlib.cpp, without any luck.
At the end, I found out that such files should be specified with
MANIFEST.in file. All files in
MANIFEST.in, including some files that setuptools figures out on its own, will be included in source distribution.
There are also
data_files arguments of
setup(), which are used for the similar purpose. I still haven’t figured out completely what is their relationship with
MANIFEST.in and when they should be used, however I am pretty sure they are meant also to copy files on install, while
MANIFEST.in only copies files to source distribution, which is what I needed. Hence, I decided to stick with
MANIFEST.in. This is what my
MANIFEST.in looked like:
recursive-include edlib/include * include *.pxd include *.pyx
And that is it! Running
make sdist from my
Makefile would run following commands and build source distribution tarball:
cp -R ../../edlib . cython edlib.pyx -o edlib.bycython.c python setup.py sdist
Publishing to PyPI
Now that we have a way to generate source distribution tarball, the only thing that remains is to publish it to PyPI, so that everybody can download and install it using
pip install edlib.
There are a few ways to set everything up on PyPI. I ended up doing it all manually (registering, creating initial entry for our package).
After that I use twine to republish new versions of package:
twine upload <path_to_sdist_tarball>.
For building and distributing a module, I use setuptools.
I pre-compile the Cython source files using a Cython command tool via custom Makefile. This made my
setup.py simpler and the user doesn’t have to install Cython when installing my module. Moreover, the installation is very fast.
For easier maintainance I keep both Python extension module and C/C++ library in the same repository. As a consequence, I had problem including source files of my C/C++ library to the source distribution, since they are positioned outside of the module directory. I solved this by automatically (via Makefile) copying them to the module directory when creating source distribution.
This is how my
ext_modules argument of
setup() looks like (
edlib.bycython.c is generated by Cython):
ext_modules = [Extension("edlib", ["edlib.bycython.c", "edlib/src/edlib.cpp"], include_dirs=["edlib/include"], depends=["edlib/include/edlib.h"])],
MANIFEST.in to include .h, .pxd and .pyx files in source distribution.
Finally, I publish my module as source a distribution (no wheels) to PyPI using twine.
Illustration of my build process:
For more details, check my Edlib Python extension module on Github!