Automating The Dark Arts

Packaging and Distributing cppyy-generated Python Bindings for C++ Projects with CMake and Setuptools

TL;DR

I rewrote the cppyy CMake modules to be much more user friendly and to work using only Anaconda/PyPI packages, and to generate more feature-complete and customizable Python packages using CMake’s configure_file, while also supporting distribution of cppyy pythonization functions. I then rewrote the existing k-nearest-neighbors example project to use my new system, and wrote bindings generation for bbhash as an example with a real library. Finally, I wrote a recipe for cookiecutter to generate project templates for CMake-built cppyy bindings.

A Bit About Bindings

I’ve been writing Python bindings for C++ code for a number of years now. My first experiences were with raw CPython for our lab’s khmer/oxli project; if you’ve ever done this before, you’re aware that it’s laborious, filled with boilerplate and incantations which are required but easily overlooked. After a while, we were able to convince Titus that the maintenance burden and code bloat would be much lessened by switching to Cython for bindings generation. After a major refactor, most of the old raw CPython was excised.

Cython, however, while powerful, is much more suited for interacting with C than C++. Its support for templates is limited (C++11 features are only marginally supported, let alone newer features like variadic templates); it often fails with more than a few overloads; and there’s no built-in mechanism for generating Python classes from templated code. Furthermore, one has to redefine the C/C++ interface in cdef extern blocks in Cython .pxd files, before dealing with many issues converting types and dealing with encodings. In short, Cython obviates the need for a lot of the boilerplate required with raw CPython, while requiring its own boilerplate and keeping three distinct layers of code synchronized.

My own research codebase makes considerably more use of C++11 and beyond template features than the khmer/oxli codebase, but I still wrap it in Python for easier high-level scripting and, in particular, testing. I’ve been working around Cython’s template limitations with a jury-rigged solution: Cython files defined with Jinja2 templates and a hacknied template type substitution mechanism, all hooked together with my own build system written in pydoit with its own pile of half-assed code for deducing Cython compilation dependencies, a feature which setuptools for Cython currently lacks. This system has worked surprisingly well, regardless of my labmates’ looks of horror when I describe it, but its obviously brittle. Luckily, there is a better way…

Automatic Bindings Generation

While discussing this eldritch horror of a build system, my labmate Luiz mentioned a project he’d heard of in passing called cppyy. I began looking into it, and then tested it out on my project. I quickly came to the conclusion that it’s the darkest bit of code art I’ve ever laid eyes on, and I can only assume that Wim Lavrijsen and coauthors spent many a long night around a demon-summoning circle to bring it into existence. Regardless, when it comes to projects that so nicely solve my problems, I can work around Azazel being a core dev.

cppyy is built around the cling interactive C++ interpreter, which itself grew out of the ROOT project at CERN. It uses Clang/LLVM to generate introspection data for C++ code, and then generates and JITs bindings for use in Python. A demo and some more explanation is available on the Python Wiki, though I believe the cppyy docs and their notebook tutorial are more up to date.

cppyy does, to put it mildly, some damn cool shit. A few examples are:

  • Python metaclasses for C++ template classes: the C++ templates become first-class objects in Python, and used the bracket operators for type selection.
  • C++ namespaces as Python submodules.
  • C++17 features. It handles templates so well it can even wrap BOOST.
  • Ability to pass Python functions into C++ by converting them to std::function!
  • Cross-language inheritance. You can even override C++ pure virtual methods, or overload C++ functions, in derived classes, on the Python side
  • Everything is generated lazily. This adds some startup time the first time classes are requested, but it’s all JIT’d after that.

And remember that this happens automatically. Awesome!

Meta

This post isn’t meant to be a complete tutorial on cppyy. For that, you should look through the documentation and tutorial. Rather, I’m aiming to head off problems that I ran into along the way, and then provide some solutions. So, read on!

Getting it Working

Now, with much love to our colleagues at CERN and elsewhere, my experience interacting with code written by physicists, or even touched by physicists, has been checkered at best. cppyy suffers from some of that familiar lack of tutorials and documentation, but is greatly served by an extremely responsive developer who also happens to seem like a nice person. wlav was very helpful during my experimentation, and I thank them for that.

With that said, I’m going to go through some of the problems I ran into. Ultimately, I decided that if I was going to solve those problems for me, I ought to solve them for everybody, hence this post and the associated software.

To start testing, I immediately began trying to run the associated tools on my own code to see if I could get a few classes working. First, I would need to install cppyy, which turns out to be quite simple. Unfortunately, there is not a recent package on conda-forge, but PyPI is up-to-date, and you can install with pip install cppyy. This will build cppyy’s modified libcling.

Dependencies

The documentation suggested running the code through genreflex, which would require an interface file #includeing my headers and explicitly declaring any template specializations I would need. genreflex runs rootcling with a bunch of preconfigured options, which ends up calling into clang and hence needs properly configured includes and library paths. It’s likely it will fail at first, due to being unable to find libclang.so or some variant; this can be solved by a sudo apt install clang-7 libclang-7-dev, or whatever the equivalent for your distribution. You can then go on to run genreflex and then compile with your system compilers as described in the docs.

This is fine for mucking about on your own computer, but ultimately, with modern scientific software, it’s desirable to get this working in a conda environment (and for my purposes, bioconda). This required a lot of trial and error, but ultimately, the necessary minimal invocation for the rest of this post is:

conda create -n cppyy-example python=3 cxx-compiler c-compiler clangdev libcxx libstdcxx-ng libgcc-ng cmake
conda activate cppyy-example 
pip install cppyy clang

The cxx-compiler and c-compiler packages bring in the conda-configured gcc and g++ binaries, and libcxx, libstdcxx-ng, and libgcc-ng bring in the standard libraries. Finally, clangdev brings in libclang.so, which is needed down the line, as is the Python clang package. Then we install cppyy from PyPI with pip, which will build its own cling and whatnot.

At this point, you should be able to generate and build bindings, and then load the resulting dictionaries, as described.

Build Systems

Now that I was able to get things working, I wanted to get it fit into a proper build system. I had already decided to convert my project to CMake, and cppyy happens to include its own cmake modules for automating the process. Unfortunately, I rather quickly ran into issues here:

  • As opposed to the documentation, which uses genreflex for introspection generation, followed by a call to the compiler to create a shared , the cppyy_add_bindings function provided by FindCppyy.cmake calls rootcling directly. This means you can’t use a selection XML to select and unselect C++ names to bind, and I was completely unable to get rootcling’s LinkDef mechanism working. This resulted in invalid code being generated, and I was ultimately unable to get it to compile.
  • cppyy includes a script called cppyy-generator which parses your headers and provides a mapping so that a provided initializor routine can inject all the C++ bindings names into the Python module’s namespace; this allows you to call dir() on a namespace and see what’s available before the names are requested and lazily compiled. This script, however, uses the Python clang bindings (pip install clang), which need to find a libclang.so. You can do a conda install clangdev to get the dynamically linked library in a conda environment, but this will still fail, because it also needs the clangdev headers. The conda package tucks these away in $CONDA_PREFIX/lib/clang/<CLANG_VERSION>/include, which is not in any default include path, and so the provided FindLibClang.cmake fails. Minor modifications are not enough: if you add this directory to the provided INCLUDE_DIRS argument to cppyy_add_bindings, it will be passed to the rootcling and g++ invocations as well, which will fail with all sorts of compilation and linking errors because you’ve just mixed up several standard library versions.
  • If you get it all working, the resulting bindings library will fail to find symbols, as described here. This is because CMake needs the LINK_WHAT_YOU_USE directive to instruct ld not to drop the symbols for your C++ shared library.
  • The provided setup.py generators provide no ability to customize for your own target. They dump a string directly to a file with only a few basic package and author options.
  • There is no ability to package pythonization routines. These are sorely need to make some of the directly-generated C++ interfaces more Pythonic. The autogenerated package also lacks things like a MANIFEST.in.

Results

A First Pass: bbhash

So, I went about fixing these things. I took a step away from my rather more complex project, and aimed at wrapping a smaller library (which happens to be one of my dependencies), bbhash, which provides minimal perfect hash functions. This also gave me the chance to learn a lot more about CMake. The results can be found on my github. Essentially, this implementation solves the problems listed above: it uses genreflex and a selection XML; it properly finds and utilizes all the conda compilers and libraries; it allows for packaging pythonizations with a discovery mechanism similar to pytest and other such packages; and it uses templates for the generation of the necessary Python package files. The end result also provides installation targets for the both the underlying C++ library and the resulting bindings. Finally, it’s portable enough that I even have it running and continuous integration.

The Second pass: knn-nearest-neighbors-example

The bbhash example above, while more complex than the existing toy example, does not use a dynamically linked library: rather, the underling header-only library is stuck into a static library and bundled directly into the bindings’ shared library. I figured I should also apply my work to the existing example, and at the same time fix the previously mentioned issue, so I went ahead and used the same project structure from cppyy-bbhash for a new knn-example. This time, a shared library is created for the underlying C++, and dynamically linked to the bindings library.

The Third pass: a cookiecutter template

Now that I’d worked out most of the kinks, I figured I ought to make usage a bit easier. So, I created a cookiecutter recipe that will sketch out a basic project structure with my CMake modules and packaging templates. This is a work in progress, but is sufficient to reproduce the previous two repositories.

fin. Well, not really.

I plan to continue work on improving the cookiecutter template and ironing out any more kinks. And of course, I now have to finish applying this work to my own project, as I set out to do in the first place! Finally, I plan on working up a conda recipe demonstrating distribution, so that hopefully, one will soon be able to do a simple conda install mybindings. Look for a future post on that front :)

If you got this far, thanks for your patience and happy hacking!

Function-level fixture parametrization (or, some pytest magic)

In my work my more recent work with de Bruijn graphs, I’ve been making heavy use of py.test fixtures and parametrization to generate sequence graphs of a particular structure. Fixtures are a wonderful bit of magic for performing test setup and dependency injection, and their cascading nature (fixtures using fixtures!) means a few can be recombined in myriad ways. This post will assume you’ve already bowed to the wonder of fixtures and have some close familiarity with them; if not, it will appear to you as cosmic horror – which maybe it is, but cosmic horror never felt so good.

The Problem

I’ve got a bunch of fixtures, heavily parametrized, which are all composed. For example, I have one for generating varying flavors of our de Bruijn Graph (dBG) objects:

@pytest.fixture(params=[khmer.Nodegraph, khmer.Countgraph],
                ids=['(Type=Nodegraph)', '(Type=Countgraph)'])
def graph(request, ksize):
    ''' Instantiate a dBG with the parametrized
    underlying dBG type.
    ''' 
    num_kmers = 50000
    target_fp = 0.00001
    args = optimal_fp(num_kmers, target_fp)

    return request.param(ksize, args.htable_size,
                         args.num_htables) 

For those not familiar with the dBG, for our purposes it is a graph where the nodes are sequences of length for some alphabet (in our case, ). We draw an edge if the length suffix of matches the length prefix of . This turns out to be highly useful when we want take a pile of highly redundant short random samples of an underlying sequence and try to extract something close to the underlying sequence. A more in-depth discussion of dBGs is, uh, left as an exercise to the reader – what we really care about here is . It seems to be showing up often: as an argument to our dBG objects, as a way to prevent loops and overlaps in our randomly generated sequences, and, as it turns out, all over our tests for various indexing operations.

And so, there’s also fixture for generating random nucleotide sequences that don’t overlap in a dBG of order :

@pytest.fixture(params=list(range(500, 1600, 500)),
                ids=lambda val: '(L={0})'.format(val))
def random_sequence(request, ksize):
    global_seen = set()
    def get(exclude=None):
        # note that the definition of get random sequence
        # is also left as an exercise to the reader
        sequence = get_random_sequence(request.param,
                                       ksize,
                                       exclude=exclude,
                                       seen=global_seen)         
        for i in range(len(sequence)-ksize):      
            global_seen.add(sequence[i:i+ksize-1])
            global_seen.add(revcomp(sequence[i:i+ksize-1]))
        return sequence

    return get

…and naturally, one to compose them:

@pytest.fixture
def linear_structure(request, graph, ksize, random_sequence):
    '''Sets up a simple linear path graph structure.

    sequence
    [0]→o→o~~o→o→[-K]
    '''
    sequence = random_sequence()
    graph.consume(sequence)

    # Check for false positive neighbors in our graph:
    # this linear path should have no "high degree nodes"
    # Mark as an expected failure if any are found
    if hdn_counts(sequence, graph):
        request.applymarker(pytest.mark.xfail)

    return graph, sequence

You might notice a few things about these fixtures:

  1. OMG you’re testing with random data. Yes I am! But, the space for that data is highly constrained, it seems to be doing the job well, and I can always introspect unexpected failures.
  2. The random_sequence fixture returns a function! This is a trick to share some state at function scope: we keep track of the global set of seen k-mers, and the resulting function can generate many sequences.
  3. There’s an undefined parameter or fixture: ksize.

The last bit is the interesting part.

So, of course, it seems we should just write a fixture for ! The simplest approach might be:

@pytest.fixture
def ksize():
    return 21

This sets one default for each fixture and test using it. This kinda sucks though: we should be testing different sizes! So…

@pytest.fixture(params=[21, 25, 29])
def ksize(request):
    return request.param

Slightly better! We test three values for instead of one. Unfortunately, it still doesn’t quite cut it: for some tests we want more a more trivial dBG ( say with ), or we might not want or need to have three instances of every single test. We need individual tests to be able to set their own , and importantly, it still needs to trickle down to all the fixtures the test depends on. I’d also like this to be somewhat clear and concise: turns out that what I’m about to show you
can more or less be achieved with indirect parametrization, but I find that interface clunky (and not very well documented) , and besides, this taught me a bit more about pytest.

The Solution

My first thought was that it’d be nice to just set a variable within a test function and reach through the request object to pull it out with getattr, which would produce tests something like:

def test_something(graph, linear_structure):
    ksize = 5
    # stuff
    assert condition == True

Turns out this doesn’t work properly with test collection and Python’s scoping rules, and just feels icky to boot. We need a way to pass some information to the fixture, while also making it clear that it’s a property of the test itself and not some detail of the test’s implementation. Then I realized: decorators!

So, I came up with this:

def using_ksize(K):
    '''A decorator to set the _ksize 
    attribute for individual tests.
    '''
    def wrap(func):                                      
        setattr(func, '_ksize', K)
        return func                                      
    return wrap

Pretty straightforward: all it does is add an attribute called _ksize to the test function. However, we need to tell pytest and our fixtures about it. Turns out that the pytest API already has a hook for more granular control over parametrization, called pytest_generate_tests. This lets us grab the fixtures being used by whatever function pytest is currently setting up and poke at their generation in various ways. For example, in my case…

# goes in conftest.py

def pytest_generate_tests(metafunc):
    if 'ksize' in metafunc.fixturenames:
        ksize = getattr(metafunc.function, '_ksize', None)
        if ksize is None:
            ksize = [21]
        if isinstance(ksize, int):
            ksize = [ksize]
        metafunc.parametrize('ksize', ksize,  
                             ids=lambda k: 'K={0}'.format(k))

So what is this nonsense? We look at the metafunc, which contains the requesting context, and into its list of fixture names. If we find one called ksize, we check the calling function in metafunc.function for the _ksize attribute; if we don’t find it, we set a default value, and if we do, we just use it.

Now, we can write a couple different sorts of tests:

def test_something_default_ksize(ksize, random_sequence):
    # do stuff...
    assert ksize == 21

@using_ksize(25)
def test_something_override_ksize(ksize, random_sequence):
    # random sequence also gets the
    # override in its ksize fixture
    assert ksize == 25

@using_ksize([21,31])
def test_something_ksize_parametrize(ksize, random_sequence):
    # this one gets properly parametrized for 21 and 31
    assert condition

I rather like this approach: it’s quite clear and retains all the pytest fixture goodness, while also giving more granular control. This is a simple parametrization case which admittedly could be accomplished with indirect parametrization, but one could imagine scenarios where the indirect method would be insufficient. Curiously, you don’t even have to write an actual fixture function with this approach, as its implied by the function argument lists.

In my case, I eventually get tests like…

tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Nodegraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Nodegraph)-K=31] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Countgraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Countgraph)-K=31] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Nodegraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Nodegraph)-K=31] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Countgraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Countgraph)-K=31] PASSED

In the end, only mild cosmic horror.

SciPy 2016 Retrospective

Last week I attended my first SciPy conference in Austin. I’ve been to the past three PyCons in Montreal and Portland, and aside from my excitement to learn more about the great scientific Python community, I was curious to see how it compared to the general conference I’ve come to know and love.

SciPy, by my account, is a curious microcosm of the academic open source community as a whole. It is filled with great people doing amazing work, releasing incredible tools, and pushing the frontiers of features and accessibility in scientific software. It is also marked by some of the same problems as the larger community: a stark lack of gender (and other) diversity and a surprising (or not) lack of consciousness of the problem. I’ll start by going over some of the cool projects I learned about and then move on to some thoughts on the gender issue.

Cool Stuff

nbflow

Several new projects were announced, and several existing projects were given some needed visibility. The first I’ll talk about is nbflow. This is Jessica Hamrick’s system for “one-button button reproducible workflows with the Jupyter notebook and scons.” In short, you can link up notebooks in build system via two special variables in the first cells of a collection of notebooks – __depends__ and __dest__ – which contain lists of source and target filenames and are parsed out of the JSON to automagically generate build tasks. Jessica’s implementation is clean and can be pretty easily grokked with only a few minutes of reading the code, and it’s intuitive and relatively well tested. She delivered a great presentation with excellent slides and nice demos (which all worked ;)).

The only downside is that it uses scons, which isn’t Python 3 compatible and isn’t what I use, which must mean it’s bad or something. However, this turned out to be a non-issue due to the earlier point about the clean codebase: I was able to quickly build a pydoit module with her extractor, and she’s been responsive to the PR (thanks!). It would be pretty easy to build modules for any number of build systems – it only requires about 50 lines of code. I’m definitely looking forward to using nbflow in future projects.

JupyterLab

The Jupyter folks made a big splash with JupyterLab, which is currently in alpha. They’ve built an awesome extension API that makes adding new functionality dead simple, and it appears they’ve removed many of the warts from the current Jupyter client. State is seamlessly and quickly shipped around the system, making all the components fully live and interactive. They’re calling it an IDE – an Interactive Development Environment – and it will likely improve greatly upon the current Python data exploration workflow. It’s reminiscent of Rstudio in a lot of ways, which I think is a Good Thing; intuitive and simple interfaces are important to getting new users up and running with the language, and particularly helpful in the classroom. They’re shooting to have a 1.0 release out by next year’s SciPy, emphasizing that they’ll require a 1.0 to be squeaky clean. I’ll be anxiously awaiting its arrival!

Binder

Binder might be oldish news to many people at this point, but it was great to see it represented. For those not in the know, it allows you to spin up Jupyter notebooks on-demand from a github repo, specifying dependencies with Docker, PyPI, and Conda. This is a great boon for reproducibility, executable papers, classrooms, and the like.

Altair

The first keynote of the conference was yet another plotting library, Altair. I must admit that I was somewhat skeptical going in. The lament and motivation behind Altair was how users have too many plotting libraries to choose from and too much complexity, and solving this problem by introducing a new library invokes the obligatory xkcd. However, in the end, I think the move here is needed.

Altair a python interface to vega-lite; the API is a straight-forward plotting interface which spits out a vega-lite spec to be rendered by whatever vega-compatible graphics frontend the user might like. This is a massive improvement over the traditional way of using vega-lite, which is “simply write raw JSON(!)” It looks to have sane defaults and produce nice looking plots with the default frontend. More important, however, is the paradigm shift they are trying to initiate: that plotting should be driven by a declarative grammar, with the implementation details left up to the individual libraries. This shifts much of the programming burden off the users (and on to the library developers), and would be a major step toward improving the state of Python data visualization.

Imperative (hah!) to this shift is the library developers all agreeing to use the same grammar. Several of the major libraries (bokeh and plot.ly?) already use bespoke internal grammars, and according to the talk, looking to adopt vega. Altair has taken the aggressive approach: the tactic seems to be to firmly plant the graphics grammar flag and force the existing tools to adopt before they have a chance to pollute the waters with competing standards. Somebody needed to do it, and I think it’s better that vega does.

There are certainly deficiencies though. vega-lite is relatively spartan at this point – as one questioner in the audience highlighted, it can’t even put error bars on plots. This sort of obvious feature vacuum will need to be rapidly addressed if the authors expect the spec to be adopted wholeheartedly by the scientific python community. Given the chops behind it, I fully expect these issues to be addressed.

Gender Stuff

I’ve focused on the cool stuff at the conference so far, but not everything was so rosy. Let’s talk about diversity – of the gender sort, but the complaint applies to race, ability, and so forth.

There’s no way to state this other than frankly: it was abysmal. I immediately noticed the sea of male faces, and a friend of mine had at least one conversation with a fellow conference attendee while he had a conversation with her boobs. The Code of Conduct was not clearly stated at the beginning of the conference, which makes a CoC almost entirely useless: it shows potential violators that the organizers don’t really prioritize the CoC and probably won’t enforce it, and it signals the same to the minority groups that the conference ostensibly wants to engage with. As an example, while Chris Calloway gave a great lightening talk about how PyData North Carolina is working through the aftermath of HB2, several older men directly behind me giggled amongst themselves at the mention of gender neutral bathrooms. They probably didn’t consider that there was a trans person sitting right in front of them, and they certainly didn’t consider the CoC, given that it was hardly mentioned. This sort of shit gives all the wrong signals for folks like myself. At PyCon the previous two years, I felt comfortable enough to create a #QueerTransPycon BoF, which was well attended; although the more focused nature of SciPy makes such an event less appropriate, I would not have felt comfortable trying regardless.

The stats are equally bad: 12 out of 124 speakers, 8 out of 52 poster presenters, and 4 out of 37 tutorial presenters were women, and the stats are much worse for people of color. The lack of consciousness of the problem was highlighted by some presenters noting the great diversity of the conference (maybe they were talking about topics?), and in one case, by the words of an otherwise well-meaning man whom I had a conversation with; when the 9% speaker rate for women was pointed out to him, he pondered the number and said that it “sounded pretty good.” It isn’t! He further pressed as to whether we would be satisfied once it hit 50%; somehow the “when is enough enough?” question always comes up. What’s clear is that “enough” is a lot more than 9%. This state of things isn’t new – several folks have written about it in regards to previous years.

There are some steps that can be taken here – organizers could look toward the PSF’s successful efforts to improve the gender situation at PyCon, where funding was sought for a paid chair (as opposed to SciPy’s unpaid position). The Code of Conduct should be clearly highlighted and emphasized at the beginning of the conference. For my part, I plan to submit a tutorial and a talk for next year.

I don’t want to only focus on the bad; the diversity luncheon was well attended, there was a diversity panel, and a group has been actively discussing the issues in a dedicated channel on the conference Slack team. These things signal that there is some will to address this. I also don’t want to give any indication that things are okay – they aren’t, and there’s a ton of work to be done.

Closing

I’m grateful to my adviser Titus for paying for the trip, and generally supporting my attending events like this and rabble rousing. I’m also grateful to the conference organizers for putting together an all-in-all good conference, and to all the funders present who make all this scientific Python software that much more viable and robust. For anyone reading this and thinking, “I’m doing thing X to combat the gender problem, why don’t you help out?” feel free to contact me on twitter.

some notes on py.test, travis-ci, and SciPy 2016

I’ve been in Austin since Tuesday for SciPy 2016, and after a couple weeks in Brazil and some time off the grid in the Sierras, I can now say that I’ve been officially bludgeoned back into my science and my Python. Aside from attending talks and meeting new people, I’ve been working on getting a little package of mine up to scratch with tests and continuous integration, with the eventual goal of submitting it to the Journal of Open Source Software. I had never used travis-ci before, nor had I used py.test in an actual project, and as expected, there were some hiccups – learn from mine to avoid your own :)

Note: this blog post is not beginner friendly. For a simple intro to continuous integration, check out our pycon tutorial, travis ci’s intro docs, or do further googling. Otherwise, to quote Worf: ramming speed!

travis

Having used drone.io in the past, I had a good idea of where to start here. travis is much more feature rich than drone though, and as such, requires a bit more configuration. My package, shmlast, is not large, but it has some external dependencies which need to be installed and relies on the numpy-scipy-pandas stack. drone’s limited configuration options and short maximum run time quickly make it intractable for projects with non-trivial dependencies, and this was where travis stepped in.

getting your scientific python packages

The first stumbling block here was deciding on a python distribution. Using virtualenv and PyPI is burdensome with numpy, scipy, and pandas – they almost always want to compile, which takes much too long. Being an impatient page-refreshing fiend, I simply could not abide the wait. The alternative is to use anaconda, which does us the favor of compiling them ahead of time (while also being a little smarter about managing dependencies). The default distribution is quite large though, so instead, I suggest using the stripped-down miniconda and installing the packages you need explicitly. Detailed instructions are available here, and I’ll run through my setup.

The miniconda setup goes under the install directive in your .travis.yml:

install:
  - sudo apt-get update
  - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
      wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -O miniconda.sh;
    else
      wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
    fi
  - bash miniconda.sh -b -p $HOME/miniconda
  - export PATH="$HOME/miniconda/bin:$PATH"
  - hash -r
  - conda config --set always_yes yes --set changeps1 no
  - conda update -q conda
  - conda info -a
  - conda create -q -n test python=$TRAVIS_PYTHON_VERSION numpy scipy pandas=0.17.0 matplotlib pytest pytest-cov coverage sphinx nose
  - source activate test
  - pip install -U codecov
  - python setup.py install

Woah! Let’s break it down. Firstly, there’s a check of travis’s python environment variable to grab the correct miniconda distribution. Then we install it, add it to PATH, and configure it to work without interaction. The conda info -a is just a convenience for debugging. Finally, we go ahead and create the environment. I do specify a version for Pandas; if I were more organized, I might write out a conda environment.yml and use that instead. After creating the environment and installing a non-conda dependency with pip, I install the package. This gets use ready for testing.

After a lot of fiddling around, I believe this is the fastest way to get your Python environment up and running with numpy, scipy, and pandas. You can probably safely use virtualenv and pip if you don’t need to compile massive libraries. The downside is that this essentially locks your users into the conda ecosystem, unless they’re willing to risk going it alone re: platform testing.

non-python stuff

Bioinformatics software (or more accurately, users…) often have to grind their way through the Nine Circles (or perhaps orders of magnitude) of Dependency Hell to get software installed, and if you want CI for your project, you’ll have to automate this devilish journey. Luckily, travis has extensive support for this. For example, I was easily able to install LAST aligner from source by adding some commands under before_script:

before_script:
  - curl -LO http://last.cbrc.jp/last-658.zip
  - unzip last-658.zip
  - pushd last-658 && make && sudo make install && popd

The source is first downloaded and unpacked. We need to avoid mucking up our current location when compiling, so we use pushd to save our directory and move to the folder, then make and install before using popd to jump back out.

Software from Ubuntu repos is even simpler. We can these commands to before_install:

before_install:
  - sudo apt-get -qq update
  - sudo apt-get install -y emboss parallel

This grabbed emboss (which includes transeq, for 6-frame DNA translation) and gnu-parallel. These commands could probably just as easily go in the install section, but the travis docs recommended they go here and I didn’t feel like arguing.

py.test

and the import file mismatch

I’ve used nose in my past projects, but I’m told the cool kids (and the less-cool kids who just don’t like deprecated software) are using py.test these days. Getting some basic tests up and running was easy enough, as the patterns are similar to nose, but getting everything integrated was more difficult. Pretty soon, after running a python setup.py test or even a simple py.test, I was running into a nice collection of these errors:

import file mismatch:
imported module 'shmlast.tests.test_script' has this __file__ attribute:
  /work/shmlast/shmlast/tests/test_script.py
which is not the same as the test file we want to collect:
  /work/shmlast/build/lib/shmlast/tests/test_script.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules

All the google results for this were to threads with devs and other benevolent folks patiently explaining that you need to have unique basenames for your test modules (I mean it’s right there in the error duh), or that I needed to delete __pycache__. My basenames were unique and my caches clean, so something else was afoot. An astute reader might have noticed that one of these paths given is under the build/ directory, while the other is in the root of the repo. Sure enough, deleting the build/ directory fixes the problem. This seemed terribly inelegant though, and quite silly for such a common use-case.

Well, it turns out that this problem is indirectly addressed in the docs. Unfortunately, it’s 1) under the obligatory “good practices” section, and who goes there? and 2) doesn’t warn that this error can result (instead there’s a somewhat confusing warning telling you not to use an __init__.py in your tests subdirectory, but also that you need to use one if you want to inline your tests and distribute them with your package). The problem is that py.test happily slurps up the tests in the build directory as well as the repo, which triggers the expected unique basename error. The solution is to be a bit more explicit about where to find tests.

Instead of running a plain old py.test, you run py.test --pyargs <pkg>, which in clear and totally obvious language in the help is said to make py.test “try to interpret all arguments as python packages.” Clarity aside, it fixes it! To be extra double clear, you can also add a pytest.ini to your root directory with a line telling where the tests are:

[pytest]
testpaths = path/to/tests

organizing test data

Other than documentation gripes, py.test is a solid library. Particularly nifty are fixtures, which make it easy to abstract away more boilerplate. For example, in the past I’ve use the structure of our lab’s khmer project for grabbing test data and copying it into temp directories, but it involves a fair amount of code and bookkeeping. With a fixture, I can easily access test data in any test, while cleaning up the garbage:

@fixture
def datadir(tmpdir, request):
    '''
    Fixture responsible for locating the test data directory and copying it
    into a temporary directory.
    '''
    filename = request.module.__file__
    test_dir = os.path.dirname(filename)
    data_dir = os.path.join(test_dir, 'data') 
    dir_util.copy_tree(data_dir, str(tmpdir))

    def getter(filename, as_str=True):
        filepath = tmpdir.join(filename)
        if as_str:
            return str(filepath)
        return filepath

    return getter

Deep in my heart of hearts I must be a functional programmer, because I’m really pleased with this. Here, we get the path to the tests directory, and then the data directory which it contains. The test data is then all copied to a temp directory, and by the awesome raw power of closures, we return a function which will join the temp dir with a requested filename. A better version would handle a nonexistant file, but I said raw power, not refined and domesticated power. Best of all, this fixture uses another fixture, the builtin tmpdir, which makes sure then files get blown away when you’re done with them.

Use it as a fixture in a test in the canonical way:

def test_test(datadir):
    test_filename = datadir('test.csv')
    assert True # professional coder on a closed course

Next up: thoughts on SciPy 2016!

pydoit half-day workshop debrief

About

Earlier today I taught a half-day workshop introducing students to doit for automating their workflows and building applications. This was an intermediate-level python workshop, in that it expected students to have operational python knowledge. The materials are freely available, and the workshop was live-streamed on YouTube, where it is still available.

This workshop was part of a series being put on by our lab the next few quarters. A longer list of the workshops is at the dib training site, and Titus has written on them before.

Outcome

Overall, I was happy with the results. Between the on-site participants and live-stream viewers, our attendance was okay (about ten people total), and all students communicated that they enjoyed the workshop and found it informative. Most of my materials (which I mostly wrote from scratch) seemed to parse well, with the exception of a few minor bugs which I caught during the lesson and was able to fix. As per usual, our training coordinator Jessica did a great job handling the logistics, and we were able to use of the brand new Data Science Initiative space in the Shields Library on the UC Davis campus.

Thoughts for the Future

We did have a number of no-shows, which was disappointing. My intuition is that this was caused by a mixture of it being the beginning of the quarter here, with many students, postdocs, staff, and faculty just returning, and the more advanced nature of the material, which tends to scare folks away. It might be another piece of data to support the idea of charging five bucks or so for tickets to require a small amount of activation energy and thus filter out likely no-shows, but we’ve had good luck so far with attendance, and it’d be best to make such a decision after we run a few more similar workshops (perhaps it would only need to be done for the intermediate or advanced ones, for example).

We also had several students with installation issues, a recurring problem for these sorts of events. I’m leaning toward trying out browser-based approaches in the future, which would allow me to set up configurations ahead of time (likely via docker files) and short-circuit the usual cross-platform, python distribution, and software installation issues.

I really enjoyed the experience, as this was the first workshop I’ve run where I created all the materials myself. I’m looking forward to doing more in the future.

–camille

ps. this has been my first post in a long time, and I’m hoping to keep them flowing.