TL;DR -> take a look at the summary.

I thought it would be a good idea to make my C/C++ library (Edlib) for calculating edit distance also usable from Python. In this blog post I give an overview of how I did it, including all the tricks and traps that I encountered, in hope that it will help others and also serve as a reminder for me.

What does it mean to have our C/C++ library usable from Python? My goal was to have Edlib installable from pip, using pip install edlib. For that, we have to create a Python extension module that uses our C/C++ library and then package it in such a way that it can be published to PyPI.

Contents

Writing Python extension module

First step is to create a Python extension module, which is basically a piece of code that somehow interfaces with our C/C++ library and creates a new interface around it, that will be used from Python.

Options

There are multiple ways to do this. Some of the most interesting that I found are: Python API, Cython, cffi, ctypes. Unfortunately, I did not have enough time to try out all of them, but I did do some investigation and eventually picked Cython.

Python API approach means writing a code in C that uses Python API defined in Python.h. It demands a lot of additional work regarding portability and you also need to learn how to use the API. It just seemed too complex and too “low” for what I needed.

Reading about ctypes, I got a feeling that they are more suitable for interfacing with other C/C++ libraries from your Python code than wrapping your own C/C++ library for Python. Also, I read on multiple places that Cython and cffi are more advanced than ctypes, so my final choice was between cffi and Cython.

Why did I choose Cython over cffi? For no hard reason - I’ve read few posts, had a quick look at docs for both and got a better feeling about Cython - I hope to also try cffi in the future and learn more about it.

Doing it with Cython

I will not go into too many details about Cython, since it has good documentation on wrapping C libraries, but the main idea is that you write in a superset of Python that looks like a mix of Python and C, which enables you to easily mix both.

My C/C++ library is fairly simple and it has only two files, edlib.cpp and edlib.h, without external dependencies.

First, I created a cedlib.pxd file, which is basically just a description of parts of edlib.h that we want to interface with in our module (I somewhat simplified Edlib interface for this blog post):

cdef extern from "edlib.h":
    int edlibAlign(const char* query, int queryLength,
                   const char* target, int targetLength)

It basically comes down to copying declarations from your .h file to .pxd while adding few keywords like ctypedef for structs and enums and doing some formatting. This is a somewhat boring part, but it’s relatively easy to do.

Next, I created an edlib.pyx file (be careful not to name it the same as .pxd file because that can cause problems), which contains the actual Cython code of our extension module. In my case, ``edlib.pyx` contains a call to my C/C++ library and some type conversions from Python to C:

cimport cedlib

def align(query, target):
    # Transform python strings into c strings.
    cdef bytes query_bytes = query.encode();
    cdef char* cquery = query_bytes;
    cdef bytes target_bytes = target.encode();
    cdef char* ctarget = target_bytes;

    # Run calculation - central method of Edlib.
    editDistance = cedlib.edlibAlign(cquery, len(query), ctarget, len(target))

    return editDistance

After compilation, this should give us a simple Python module, which is used like this:

import edlib
print(edlib.align("telephone", "elephant"))  # Prints "3"

Next step is to compile .pyx and .pdx files with Cython (e.g. cython edlib.pyx in command line), which gives us a .c file that we can further compile into Python extension module (.so). This .c file is independent of Cython so we do not have to think about Cython anymore once it is generated.

But how do we actually make all this (Cython compiling, compiling extension module) work nicely? This is intertwined with packaging and distributing the module, so let’s talk more about that.

Building and distributing Python extension module

Setup.py

Similar to package.json for npm, Python has setup.py - main file for describing the Python module distribution. Not only does setup.py provide module information (author name, version, description, …) but it also provides information on how to distribute, build and install the module.

setup.py is actually just a file that calls setup function with named arguments.
There are two main ways to get the setup function: either from the setuptools module or from the distutils module. At first glance they look very similar, however setuptools is newer and was created to replace the distutils, so that is what you should be usin. When reading examples online, be careful to check if they are for setuptools or for distutils.

Here is a simplified example of setup.py for Edlib:

from setuptools import setup, Extension

setup(
    # Information
    name = "edlib",
    version = "1.0.0",
    url = "https://github.com/Martinsos/edlib",
    author = "Martin Sosic",
    license = "MIT",
    keywords = "edit distance levenshtein align sequence bioinformatics",
    # Build instructions
    ext_modules = [Extension("edlib",
                             ["edlib.bycython.c", "edlib/src/edlib.cpp"],
                             include_dirs=["edlib/include"])]
)

edlib.bycython.c is generated by compiling edlib.pyx with Cython, while edlib/src/edlib.cpp is source file of Edlib C/C++ library. I will talk about them in more details later.

Part of setup.py that is most interesting to us is the ext_modules argument of the setup function - it defines extension modules to be built and needed source files.

Source distribution (sdist)

After we have created setup.py, all it takes to build a module is running python setup.py build_ext -i and it will compile source files and generate a .so file that is our extension module. We can run Python interpreter from directory in which the .so file is and import edlib will work. This is useful for testing, but how can we make our module into something that can be distributed among other platforms?

Generating source distribution is the key to distributing our module.
To do that, we run python setup.py sdist to generate a source distribution tarball, which can then either be published to PyPI or installed directly using pip install <path_to_tarball>. This will not compile source files and generate .so file like python setup.py build_ext -i does. Instead, it will copy the needed source files to tarball so they can be shipped to the user, who will trigger the compilation/building when running pip install <path_to_tarball>. This way, the user builds our module for his specific platform based on the source files and build instructions that we’ve supplied to him via the source distribution.

Source distribution can also be supplemented with wheels (platform-specific binaries), which is cool because no compilation is needed on the user’s side. I did not create any wheels for Edlib because support for Linux is not complete, I did not have simple access to needed platforms in order to build these binaries and also Edlib is small and compiles so fast that having binaries would not bring many benefits.

Structure of my project

Before continuing, let’s take a quick look at how I structured my project:

...
- edlib/  # C/C++ library.
  - src/
    - edlib.cpp
  - include/
    - edlib.h
- bindings/
  - python/  # Python module.
    - edlib.pyx  # Cython source file.
    - cedlib.pxd  # Cython header file.
    - setup.py  # Main module distribution file.
    - README.rst  # Module readme.
    - MANIFEST.in  # Describes additional files to copy to sdist tarball.
    - Makefile  # I use it to automate certain steps when developing/publishing module.
...

As you can see, I keep both C/C++ library and the Python module in the same git repository. This enables me to easily do changes that span both projects and keep them synced regarding versions.

Compiling Cython and getting ext_modules right

Now, let’s take a closer look at that ext_modules argument in setup.py and how I defined the extension source files:

ext_modules = [Extension("edlib",
                         ["edlib.bycython.c", "edlib/src/edlib.cpp"],
                         include_dirs=["edlib/include"])]

First of all, how can I write edlib/src/edlib.cpp when that is not correct path? Edlib C/C++ library directory is two levels above the setup.py and I should be using ../../edlib/src/edlib.cpp, right? Well, I tried that and just couldn’t get it to work - simply building the extension with python setup.py build_ext -i works, but it all gets messed up when I try to build from created source distribution.
I solved this with my custom Makefile, so for example when running make sdist (instead of python setup.py sdist) it will first copy the edlib/ directory to bindings/python/ and then run python setup.py sdist - that is the reason why paths edlib/src/edlib.cpp and edlib/include work in ext_modules.

Next, if you take a look at Cython examples, you will see that most of them use cythonize() in setup.py together with .pyx source files. On the other hand, I have none of that in my setup.py - why? Well, at first I did try to use it, and my ext_modules was looking something like this:

...
from Cython.Build import cythonize
...
ext_modules = cythonize([Extension("edlib",
                                   ["edlib.pyx", "edlib/src/edlib.cpp"],
                                   include_dirs=["edlib/include"])])
...

This actually worked fine on my machine and I liked how simple it was, however I realized soon that for this to work, the user has to have Cython installed on his machine! I didn’t like that, since I wanted the install to be as simple as pip install edlib, and I didn’t want the user to have to explicitly install anything else, especially not a module that he/she may not even know what it does (Cython). Next to that, there is no way for me to control which version of the Cython user uses, which may give unexpected results during compilation.

I did some research and identified two solutions for this:

  1. Having the correct version of Cython installed automatically for the user, as dependency of Edlib, so that the user does not have to worry about that.
  2. Pre-compiling Cython source files and distributing only C/C++ files to the user.

Compiling Cython - first approach

At first, I decided to go with first approach. I found out about setup_requires argument of setup(), where we can specify dependencies needed for our module to install, and that seemed like a good fit. Of course, that did not work because when executing setup.py in order to process setup_requires argument, cythonize is not defined and it will fail. There were some hacks for solving this but they just seemed too dirty to me.

Luckily, I found out that since version 18.0 setuptools has special support for Cython. All we have to do is specify Cython in setup_requires and remove cythonize(). Setuptools will download the correct version of Cython as local egg (which is great because it does not mess with the global Cython, if user has it pre-installed), then it will recognize that there are .pyx source files in ext_modules and compile them using Cython. Since now there are no references to Cython in setup.py, we don’t have a problem anymore with execution failing because of the missing Cython before reaching setup_requires. My code then looked like this:

ext_modules = [Extension("edlib",
                         ["edlib.pyx", "edlib/src/edlib.cpp"],
                         include_dirs=["edlib/include"])],
setup_requires = ["setuptools>=18.0", "cython>=0.25"]

This seemed like a great solution, until I realized that installation is taking a really long time (half a minute, minute). Setuptools has to download Cython and compile it! I really did not like this, so I decided to also try the second approach.

Compiling Cython - second approach

I added a command to my Makefile that compiles Cython file(s) to .c file(s): cython edlib.pyx -o edlib.bycython.c. Next, I modified setup.py to the final version, which I have already shown above:

ext_modules = [Extension("edlib",
                         ["edlib.bycython.c", "edlib/src/edlib.cpp"],
                         include_dirs=["edlib/include"])]

Since these are now just C/C++ files (edlib.bycython.c is generated by Cython from edlib.pyx), the user doesn’t need Cython anymore to build the module from source distribution. This worked well, and resulted with a very fast installation. Instead of taking almost a minute, as it was the case with the first approach, it took only a second or two. Therefore, I decided to stick with the second approach, since it is both simple and fast, and makes it easy for the user to install Edlib.

Including all needed files into source distribution

Another problem that I encountered was ensuring that all needed source files are copied to the source distribution tarball. The thing is, setuptools uses ext_modules for figuring out which source files to copy to sdist tarball, and I had a problem that edlib.h wouldn’t get copied. I also wanted to copy cedlib.pxd and edlib.pyx files because it was mentioned as a good practice to include Cython source files in source distribution, even when already pre-compiled.

That is why I added include_dirs argument to Extension, however that also didn’t help. I tried adding edlib.h directly to the list of source files, next to edlib.bycython.c and edlib/src/edlib.cpp, without any luck.

At the end, I found out that such files should be specified with MANIFEST.in file. All files in MANIFEST.in, including some files that setuptools figures out on its own, will be included in source distribution.
There are also package_data and data_files arguments of setup(), which are used for the similar purpose. I still haven’t figured out completely what is their relationship with MANIFEST.in and when they should be used, however I am pretty sure they are meant also to copy files on install, while MANIFEST.in only copies files to source distribution, which is what I needed. Hence, I decided to stick with MANIFEST.in. This is what my MANIFEST.in looked like:

recursive-include edlib/include *
include *.pxd
include *.pyx

It works!

And that is it! Running make sdist from my Makefile would run following commands and build source distribution tarball:

cp -R ../../edlib .
cython edlib.pyx -o edlib.bycython.c
python setup.py sdist

Publishing to PyPI

Now that we have a way to generate source distribution tarball, the only thing that remains is to publish it to PyPI, so that everybody can download and install it using pip install edlib.

There are a few ways to set everything up on PyPI. I ended up doing it all manually (registering, creating initial entry for our package). After that I use twine to republish new versions of package: twine upload <path_to_sdist_tarball>.

Summary

I used Cython to create a Python extension module that wraps my C/C++ library.

For building and distributing a module, I use setuptools.

I pre-compile the Cython source files using a Cython command tool via custom Makefile. This made my setup.py simpler and the user doesn’t have to install Cython when installing my module. Moreover, the installation is very fast.

For easier maintainance I keep both Python extension module and C/C++ library in the same repository. As a consequence, I had problem including source files of my C/C++ library to the source distribution, since they are positioned outside of the module directory. I solved this by automatically (via Makefile) copying them to the module directory when creating source distribution.

This is how my ext_modules argument of setup() looks like (edlib.bycython.c is generated by Cython):

ext_modules = [Extension("edlib", ["edlib.bycython.c", "edlib/src/edlib.cpp"],
                         include_dirs=["edlib/include"],
                         depends=["edlib/include/edlib.h"])],

I use MANIFEST.in to include .h, .pxd and .pyx files in source distribution.

Finally, I publish my module as source a distribution (no wheels) to PyPI using twine.

Illustration of my build process: My build process

For more details, check my Edlib Python extension module on Github!