SciPy 2016 Retrospective

Last week I attended my first SciPy conference in Austin. I’ve been to the past three PyCons in Montreal and Portland, and aside from my excitement to learn more about the great scientific Python community, I was curious to see how it compared to the general conference I’ve come to know and love.

SciPy, by my account, is a curious microcosm of the academic open source community as a whole. It is filled with great people doing amazing work, releasing incredible tools, and pushing the frontiers of features and accessibility in scientific software. It is also marked by some of the same problems as the larger community: a stark lack of gender (and other) diversity and a surprising (or not) lack of consciousness of the problem. I’ll start by going over some of the cool projects I learned about and then move on to some thoughts on the gender issue.

Cool Stuff


Several new projects were announced, and several existing projects were given some needed visibility. The first I’ll talk about is nbflow. This is Jessica Hamrick’s system for “one-button button reproducible workflows with the Jupyter notebook and scons.” In short, you can link up notebooks in build system via two special variables in the first cells of a collection of notebooks – __depends__ and __dest__ – which contain lists of source and target filenames and are parsed out of the JSON to automagically generate build tasks. Jessica’s implementation is clean and can be pretty easily grokked with only a few minutes of reading the code, and it’s intuitive and relatively well tested. She delivered a great presentation with excellent slides and nice demos (which all worked ;)).

The only downside is that it uses scons, which isn’t Python 3 compatible and isn’t what I use, which must mean it’s bad or something. However, this turned out to be a non-issue due to the earlier point about the clean codebase: I was able to quickly build a pydoit module with her extractor, and she’s been responsive to the PR (thanks!). It would be pretty easy to build modules for any number of build systems – it only requires about 50 lines of code. I’m definitely looking forward to using nbflow in future projects.


The Jupyter folks made a big splash with JupyterLab, which is currently in alpha. They’ve built an awesome extension API that makes adding new functionality dead simple, and it appears they’ve removed many of the warts from the current Jupyter client. State is seamlessly and quickly shipped around the system, making all the components fully live and interactive. They’re calling it an IDE – an Interactive Development Environment – and it will likely improve greatly upon the current Python data exploration workflow. It’s reminiscent of Rstudio in a lot of ways, which I think is a Good Thing; intuitive and simple interfaces are important to getting new users up and running with the language, and particularly helpful in the classroom. They’re shooting to have a 1.0 release out by next year’s SciPy, emphasizing that they’ll require a 1.0 to be squeaky clean. I’ll be anxiously awaiting its arrival!


Binder might be oldish news to many people at this point, but it was great to see it represented. For those not in the know, it allows you to spin up Jupyter notebooks on-demand from a github repo, specifying dependencies with Docker, PyPI, and Conda. This is a great boon for reproducibility, executable papers, classrooms, and the like.


The first keynote of the conference was yet another plotting library, Altair. I must admit that I was somewhat skeptical going in. The lament and motivation behind Altair was how users have too many plotting libraries to choose from and too much complexity, and solving this problem by introducing a new library invokes the obligatory xkcd. However, in the end, I think the move here is needed.

Altair a python interface to vega-lite; the API is a straight-forward plotting interface which spits out a vega-lite spec to be rendered by whatever vega-compatible graphics frontend the user might like. This is a massive improvement over the traditional way of using vega-lite, which is “simply write raw JSON(!)” It looks to have sane defaults and produce nice looking plots with the default frontend. More important, however, is the paradigm shift they are trying to initiate: that plotting should be driven by a declarative grammar, with the implementation details left up to the individual libraries. This shifts much of the programming burden off the users (and on to the library developers), and would be a major step toward improving the state of Python data visualization.

Imperative (hah!) to this shift is the library developers all agreeing to use the same grammar. Several of the major libraries (bokeh and already use bespoke internal grammars, and according to the talk, looking to adopt vega. Altair has taken the aggressive approach: the tactic seems to be to firmly plant the graphics grammar flag and force the existing tools to adopt before they have a chance to pollute the waters with competing standards. Somebody needed to do it, and I think it’s better that vega does.

There are certainly deficiencies though. vega-lite is relatively spartan at this point – as one questioner in the audience highlighted, it can’t even put error bars on plots. This sort of obvious feature vacuum will need to be rapidly addressed if the authors expect the spec to be adopted wholeheartedly by the scientific python community. Given the chops behind it, I fully expect these issues to be addressed.

Gender Stuff

I’ve focused on the cool stuff at the conference so far, but not everything was so rosy. Let’s talk about diversity – of the gender sort, but the complaint applies to race, ability, and so forth.

There’s no way to state this other than frankly: it was abysmal. I immediately noticed the sea of male faces, and a friend of mine had at least one conversation with a fellow conference attendee while he had a conversation with her boobs. The Code of Conduct was not clearly stated at the beginning of the conference, which makes a CoC almost entirely useless: it shows potential violators that the organizers don’t really prioritize the CoC and probably won’t enforce it, and it signals the same to the minority groups that the conference ostensibly wants to engage with. As an example, while Chris Calloway gave a great lightening talk about how PyData North Carolina is working through the aftermath of HB2, several older men directly behind me giggled amongst themselves at the mention of gender neutral bathrooms. They probably didn’t consider that there was a trans person sitting right in front of them, and they certainly didn’t consider the CoC, given that it was hardly mentioned. This sort of shit gives all the wrong signals for folks like myself. At PyCon the previous two years, I felt comfortable enough to create a #QueerTransPycon BoF, which was well attended; although the more focused nature of SciPy makes such an event less appropriate, I would not have felt comfortable trying regardless.

The stats are equally bad: 12 out of 124 speakers, 8 out of 52 poster presenters, and 4 out of 37 tutorial presenters were women, and the stats are much worse for people of color. The lack of consciousness of the problem was highlighted by some presenters noting the great diversity of the conference (maybe they were talking about topics?), and in one case, by the words of an otherwise well-meaning man whom I had a conversation with; when the 9% speaker rate for women was pointed out to him, he pondered the number and said that it “sounded pretty good.” It isn’t! He further pressed as to whether we would be satisfied once it hit 50%; somehow the “when is enough enough?” question always comes up. What’s clear is that “enough” is a lot more than 9%. This state of things isn’t new – several folks have written about it in regards to previous years.

There are some steps that can be taken here – organizers could look toward the PSF’s successful efforts to improve the gender situation at PyCon, where funding was sought for a paid chair (as opposed to SciPy’s unpaid position). The Code of Conduct should be clearly highlighted and emphasized at the beginning of the conference. For my part, I plan to submit a tutorial and a talk for next year.

I don’t want to only focus on the bad; the diversity luncheon was well attended, there was a diversity panel, and a group has been actively discussing the issues in a dedicated channel on the conference Slack team. These things signal that there is some will to address this. I also don’t want to give any indication that things are okay – they aren’t, and there’s a ton of work to be done.


I’m grateful to my adviser Titus for paying for the trip, and generally supporting my attending events like this and rabble rousing. I’m also grateful to the conference organizers for putting together an all-in-all good conference, and to all the funders present who make all this scientific Python software that much more viable and robust. For anyone reading this and thinking, “I’m doing thing X to combat the gender problem, why don’t you help out?” feel free to contact me on twitter.

some notes on py.test, travis-ci, and SciPy 2016

I’ve been in Austin since Tuesday for SciPy 2016, and after a couple weeks in Brazil and some time off the grid in the Sierras, I can now say that I’ve been officially bludgeoned back into my science and my Python. Aside from attending talks and meeting new people, I’ve been working on getting a little package of mine up to scratch with tests and continuous integration, with the eventual goal of submitting it to the Journal of Open Source Software. I had never used travis-ci before, nor had I used py.test in an actual project, and as expected, there were some hiccups – learn from mine to avoid your own :)

Note: this blog post is not beginner friendly. For a simple intro to continuous integration, check out our pycon tutorial, travis ci’s intro docs, or do further googling. Otherwise, to quote Worf: ramming speed!


Having used in the past, I had a good idea of where to start here. travis is much more feature rich than drone though, and as such, requires a bit more configuration. My package, shmlast, is not large, but it has some external dependencies which need to be installed and relies on the numpy-scipy-pandas stack. drone’s limited configuration options and short maximum run time quickly make it intractable for projects with non-trivial dependencies, and this was where travis stepped in.

getting your scientific python packages

The first stumbling block here was deciding on a python distribution. Using virtualenv and PyPI is burdensome with numpy, scipy, and pandas – they almost always want to compile, which takes much too long. Being an impatient page-refreshing fiend, I simply could not abide the wait. The alternative is to use anaconda, which does us the favor of compiling them ahead of time (while also being a little smarter about managing dependencies). The default distribution is quite large though, so instead, I suggest using the stripped-down miniconda and installing the packages you need explicitly. Detailed instructions are available here, and I’ll run through my setup.

The miniconda setup goes under the install directive in your .travis.yml:

  - sudo apt-get update
  - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
      wget -O;
      wget -O;
  - bash -b -p $HOME/miniconda
  - export PATH="$HOME/miniconda/bin:$PATH"
  - hash -r
  - conda config --set always_yes yes --set changeps1 no
  - conda update -q conda
  - conda info -a
  - conda create -q -n test python=$TRAVIS_PYTHON_VERSION numpy scipy pandas=0.17.0 matplotlib pytest pytest-cov coverage sphinx nose
  - source activate test
  - pip install -U codecov
  - python install

Woah! Let’s break it down. Firstly, there’s a check of travis’s python environment variable to grab the correct miniconda distribution. Then we install it, add it to PATH, and configure it to work without interaction. The conda info -a is just a convenience for debugging. Finally, we go ahead and create the environment. I do specify a version for Pandas; if I were more organized, I might write out a conda environment.yml and use that instead. After creating the environment and installing a non-conda dependency with pip, I install the package. This gets use ready for testing.

After a lot of fiddling around, I believe this is the fastest way to get your Python environment up and running with numpy, scipy, and pandas. You can probably safely use virtualenv and pip if you don’t need to compile massive libraries. The downside is that this essentially locks your users into the conda ecosystem, unless they’re willing to risk going it alone re: platform testing.

non-python stuff

Bioinformatics software (or more accurately, users…) often have to grind their way through the Nine Circles (or perhaps orders of magnitude) of Dependency Hell to get software installed, and if you want CI for your project, you’ll have to automate this devilish journey. Luckily, travis has extensive support for this. For example, I was easily able to install LAST aligner from source by adding some commands under before_script:

  - curl -LO
  - unzip
  - pushd last-658 && make && sudo make install && popd

The source is first downloaded and unpacked. We need to avoid mucking up our current location when compiling, so we use pushd to save our directory and move to the folder, then make and install before using popd to jump back out.

Software from Ubuntu repos is even simpler. We can these commands to before_install:

  - sudo apt-get -qq update
  - sudo apt-get install -y emboss parallel

This grabbed emboss (which includes transeq, for 6-frame DNA translation) and gnu-parallel. These commands could probably just as easily go in the install section, but the travis docs recommended they go here and I didn’t feel like arguing.


and the import file mismatch

I’ve used nose in my past projects, but I’m told the cool kids (and the less-cool kids who just don’t like deprecated software) are using py.test these days. Getting some basic tests up and running was easy enough, as the patterns are similar to nose, but getting everything integrated was more difficult. Pretty soon, after running a python test or even a simple py.test, I was running into a nice collection of these errors:

import file mismatch:
imported module 'shmlast.tests.test_script' has this __file__ attribute:
which is not the same as the test file we want to collect:
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules

All the google results for this were to threads with devs and other benevolent folks patiently explaining that you need to have unique basenames for your test modules (I mean it’s right there in the error duh), or that I needed to delete __pycache__. My basenames were unique and my caches clean, so something else was afoot. An astute reader might have noticed that one of these paths given is under the build/ directory, while the other is in the root of the repo. Sure enough, deleting the build/ directory fixes the problem. This seemed terribly inelegant though, and quite silly for such a common use-case.

Well, it turns out that this problem is indirectly addressed in the docs. Unfortunately, it’s 1) under the obligatory “good practices” section, and who goes there? and 2) doesn’t warn that this error can result (instead there’s a somewhat confusing warning telling you not to use an in your tests subdirectory, but also that you need to use one if you want to inline your tests and distribute them with your package). The problem is that py.test happily slurps up the tests in the build directory as well as the repo, which triggers the expected unique basename error. The solution is to be a bit more explicit about where to find tests.

Instead of running a plain old py.test, you run py.test --pyargs <pkg>, which in clear and totally obvious language in the help is said to make py.test “try to interpret all arguments as python packages.” Clarity aside, it fixes it! To be extra double clear, you can also add a pytest.ini to your root directory with a line telling where the tests are:

testpaths = path/to/tests

organizing test data

Other than documentation gripdes, py.test is a solid library. Particularly nifty are fixtures, which make it easy to abstract away more boilerplate. For example, in the past I’ve use the structure of our lab’s khmer project for grabbing test data and copying it into temp directories, but it involves a fair amount of code and bookkeeping. With a fixture, I can easily access test data in any test, while cleaning up the garbage:

def datadir(tmpdir, request):
    Fixture responsible for locating the test data directory and copying it
    into a temporary directory.
    filename = request.module.__file__
    test_dir = os.path.dirname(filename)
    data_dir = os.path.join(test_dir, 'data') 
    dir_util.copy_tree(data_dir, str(tmpdir))

    def getter(filename, as_str=True):
        filepath = tmpdir.join(filename)
        if as_str:
            return str(filepath)
        return filepath

    return getter

Deep in my heart of hearts I must be a functional programmer, because I’m really pleased with this. Here, we get the path to the tests directory, and then the data directory which it contains. The test data is then all copied to a temp directory, and by the awesome raw power of closures, we return a function which will join the temp dir with a requested filename. A better version would handle a nonexistant file, but I said raw power, not refined and domesticated power. Best of all, this fixture uses another fixture, the builtin tmpdir, which makes sure then files get blown away when you’re done with them.

Use it as a fixture in a test in the canonical way:

def test_test(datadir):
    test_filename = datadir('test.csv')
    assert True # professional coder on a closed course

Next up: thoughts on SciPy 2016!

pydoit half-day workshop debrief


Earlier today I taught a half-day workshop introducing students to doit for automating their workflows and building applications. This was an intermediate-level python workshop, in that it expected students to have operational python knowledge. The materials are freely available, and the workshop was live-streamed on YouTube, where it is still available.

This workshop was part of a series being put on by our lab the next few quarters. A longer list of the workshops is at the dib training site, and Titus has written on them before.


Overall, I was happy with the results. Between the on-site participants and live-stream viewers, our attendance was okay (about ten people total), and all students communicated that they enjoyed the workshop and found it informative. Most of my materials (which I mostly wrote from scratch) seemed to parse well, with the exception of a few minor bugs which I caught during the lesson and was able to fix. As per usual, our training coordinator Jessica did a great job handling the logistics, and we were able to use of the brand new Data Science Initiative space in the Shields Library on the UC Davis campus.

Thoughts for the Future

We did have a number of no-shows, which was disappointing. My intuition is that this was caused by a mixture of it being the beginning of the quarter here, with many students, postdocs, staff, and faculty just returning, and the more advanced nature of the material, which tends to scare folks away. It might be another piece of data to support the idea of charging five bucks or so for tickets to require a small amount of activation energy and thus filter out likely no-shows, but we’ve had good luck so far with attendance, and it’d be best to make such a decision after we run a few more similar workshops (perhaps it would only need to be done for the intermediate or advanced ones, for example).

We also had several students with installation issues, a recurring problem for these sorts of events. I’m leaning toward trying out browser-based approaches in the future, which would allow me to set up configurations ahead of time (likely via docker files) and short-circuit the usual cross-platform, python distribution, and software installation issues.

I really enjoyed the experience, as this was the first workshop I’ve run where I created all the materials myself. I’m looking forward to doing more in the future.


ps. this has been my first post in a long time, and I’m hoping to keep them flowing.

Context Managers and IPython Notebook

Recently, I’ve been making a lot of progress on the lamprey transcriptome project, and that has involved a lot of IPython notebook. While I’ll talk about lamprey in a later post, I first want to talk about a nice technical tidbit I came up with while trying to manage a large IPython notebook with lots of figures. This involved learning some more about the internals of matplotlib, as well as the usefulness of the with statement in python.

So first, some background! matplotlib is the go-to plotting package for python. It has many weaknesses, and a whole series of posts could be (and has been) written about why we should use something else, but for now, its reach is long and it is widely used in the scientific community. It’s particularly useful in concert with IPython notebook, where figures can be embedded into cells inline. However, an important feature(?) of matplotlib is that it’s built around a state machine; when it comes to deciding what figure (and other components) are currently being worked with, matplotlib keeps track of the current context globally. That allows you to just call plot() at any given time and have your figures be pushed more or less where you’d like. It also means that you need to keep track of the current context, lest you end up drawing a lot of figures onto the same plot and producing a terrible abomination from beyond space and time itself.

IPython has a number of ways of dealing with this. While in its inline mode, the default behavior is to simply create a new plotting context at the beginning of each cell, and close it at the cell’s completion. This is convenient because it means the user doesn’t have to open and close figures manually, saving a lot of coding time and boilerplate. It becomes a burden, however, when you have a large notebook, with lots of figures, some of which you don’t want to be automatically displayed. While we can turn off the automatic opening and closing of figures with

%config InlineBackend.close_figures = False

we’re now stuck with having to manage our own figure context. Suddenly, our notebooks aren’t nearly as clean and beautiful as they once were, being littered with ugly declarations of new figures and axes, calls to gcf() and, and other such not-pretty things. I like pretty things, so I sought out a solution. As it tends to do, python delivered.

Enter context managers!

Some time ago, many’s a programmer was running into a similar problem with opening and closing files (well, and a lot of other use cases). To do things properly, we needed to do exception handling to properly and cleanly call close() on our file pointers when something went wrong. To handle such instances, python introduced context managers and the with statement. From the docs:

A context manager is an object that defines the runtime context to be established when executing a with statement. The context manager handles the entry into, and the exit from, the desired runtime context for the execution of the block of code.

Though this completely washes out the ~awesomeness~ of context managers, it does sound about like what we want! In simple terms, context managers are just objects that implement the __enter__ and __exit__ methods. When you use the with statement on one of them, __enter__ is called, where we put our setup code ; if it returns something, it takes the name given it by as. __exit__ is called after the with block is left, and contains the teardown code. For our purposes, we want to take care of matplotlib context. Without further ado, let’s look at an example that does what we want:

import matplotlib.pyplot as plt

class FigManager(object):

    def __init__(self, fn='', exts=['svg', 'pdf', 'png', 'eps'], 
        show=False, nrows=1, ncols=1, 
        figsize=(18,12), tight_layout=True,

    if plt.gcf():
            print 'leaving context of', repr(plt.gcf())

    self.fig, = plt.subplots(nrows=nrows, 

    self.fn = fn
    self.exts = exts = show

    assert self.fig == plt.gcf()

def __enter__(self):

    return self.fig,

def __exit__(self, exc_t, exc_v, trace_b):

    if exc_t is not None:
    print 'ERROR', exc_t, exc_v, trace_b

    if self.fn:
    print 'saving figure', repr(self.fig)
    for ext in self.exts:
        self.fig.savefig('{}.{}'.format(self.fn, ext))

    assert self.fig == plt.gcf()
    print 'showing figure', repr(self.fig)

    print 'closing figure', repr(self.fig)
    del self.fig
    print 'returning context to', repr(plt.gcf())

Let’s break this down. The __init__ actually does most of our setup here; it takes some basic parameters to pass to plt.subplots, as well as some parameters for whether we want to show the plot and whether we want to save the result to file(s). The __enter__ method returns the generated figure and axes objects. Finally, __exit__ saves the figure to the file name with the given extensions (matplotlib uses the extension to infer the file format), and shows the plot if necessary. It then calls plt.close() on the figure, deletes the axes objects from the figure, and calls del on both instances just to be sure. The three expected parameters to __exit__ are for exception handling, which is discussed in greater detail in the docs.

Here’s an example of how I used it in practice:

with FigManager('genes_per_sample', figsize=tall_size) as (fig, ax):

    genes_support_df.sum().plot(kind='barh', fontsize=14, color=labels_df.color, figure=fig, ax=ax)
    ax.set_title('Represented Genes per Sample')

That’s taken directly out of the lamprey notebook where I first implemented this. I usually put a filelink in there, so that the resulting image can easily be viewed in its own tab for closer inspection.

The point is, all the normal boilerplate for handling figures is done in one line and the code is much more clear and pretty! And of course, most importantly, the original goal of not automatically displaying figures is also taken care of.

I consider this yak shaved.


Profiling C++ Extensions with Yep


For those of you who work with both the python codebase and the c++ backend, I found a pretty useful tool. Seeing as we work with performance-sensitive software, profiling is very useful; but, it can be a pain to profile our c++ code when called through python, which necessitates writing c++ wrappers to functions for basic profiling. The solution I found is called yep, which is a python module made specifically to profile c++ python extensions.


In order to install, simply run:

sudo apt-get install google-perftools
sudo apt-get install python-dbg
pip install yep

For khmer, you should also be sure to turn on debugging at compile time:

cd /path/to/khmer
make debug

The first is the python module implementing the profiler; the second is the tool for analyzing the resulting profile information.


There are a couple ways to use it. You can call it directly from the command line with:

python -m yep [options] -- /path/to/script [args ... ]

The -- is necessary, as it tells UNIX not to parse the resulting arguments as flag arguments, which allows the profiler to pass them on to the script being profiled instead of choking on them itself. Thanks for this trick, @mr-c. Also make sure to use the absolute path to the script to be profiled.

You can also use the module directly in your code, with:

import yep
# <code to profile...>

The resulting file is then visualized using google-pprof, with:

google-pprof --gif <python executable> <profile> > prof.gif

In order to get python debugging symbols, you need to use the debugging executable. So, while you may run the script in your virtualenv if using one, you give google-pprof the debug executable so it can properly construct callgraphs:

python -m yep -- /w/khmer/scripts/ \
 -k 25 -x 1e9 -o test_sweep -i /w/tests/test-sweep-contigs.fp \
google-pprof --gif /usr/bin/python2.7-dbg \ > prof.gif


Here is some example output:


In this call graph, the python debugging symbols were not properly included; this is resolved by using the debugging executable.

The call graph is in standard form, where the first percentage is the time in that particular function alone, and where the second percentage is the time in all functions called by that function. See the description for more details.

And that’s it. Happy profiling!