Last week I attended my first SciPy conference in Austin. I’ve been to the past three PyCons in Montreal and Portland,
and aside from my excitement to learn more about the great scientific Python community, I was curious to see how it
compared to the general conference I’ve come to know and love.
SciPy, by my account, is a curious microcosm of the academic open source community as a whole. It is filled with great
people doing amazing work, releasing incredible tools, and pushing the frontiers of features and accessibility in scientific
software. It is also marked by some of the same problems as the larger community: a stark lack of gender (and other) diversity and a
surprising (or not) lack of consciousness of the problem. I’ll start by going over some of the cool projects I learned about
and then move on to some thoughts on the gender issue.
Several new projects were announced, and several existing projects were given some needed visibility. The first I’ll talk about is
nbflow. This is Jessica Hamrick’s system for “one-button button reproducible workflows with the
Jupyter notebook and scons.” In short, you can link up notebooks in build system via two special variables in the first cells of a
collection of notebooks – __depends__ and __dest__ – which contain lists of source and target filenames and
are parsed out of the JSON to automagically generate build tasks. Jessica’s implementation is clean and can be pretty easily grokked with only
a few minutes of reading the code, and it’s intuitive and relatively well tested. She delivered a great presentation with excellent slides and
nice demos (which all worked ;)).
The only downside is that it uses scons, which isn’t Python 3 compatible and isn’t what I use, which must mean
it’s bad or something. However, this turned out to be a non-issue due to the earlier point about the clean codebase: I was able to quickly
build a pydoit module with her extractor, and she’s been responsive to the PR (thanks!).
It would be pretty easy to build modules for any number of build systems – it only requires about 50 lines of code. I’m definitely looking forward
to using nbflow in future projects.
The Jupyter folks made a big splash with JupyterLab, which is currently in alpha.
They’ve built an awesome extension API
that makes adding new functionality dead simple, and it appears they’ve removed many of the warts from the current Jupyter client. State is seamlessly
and quickly shipped around the system, making all the components fully live and interactive. They’re calling it an IDE – an Interactive Development
Environment – and it will likely improve greatly upon the current Python data exploration workflow. It’s reminiscent of Rstudio in a lot of ways,
which I think is a Good Thing; intuitive and simple interfaces are important to getting new users up and running with the language, and particularly
helpful in the classroom. They’re shooting to have a 1.0 release out by next year’s SciPy, emphasizing that they’ll require a 1.0 to be squeaky clean.
I’ll be anxiously awaiting its arrival!
Binder might be oldish news to many people at this point, but it was great to see it represented. For those not in the know,
it allows you to spin up Jupyter notebooks on-demand from a github repo, specifying dependencies with Docker, PyPI, and Conda. This is a great boon
for reproducibility, executable papers, classrooms, and the like.
The first keynote of the conference was yet another plotting library, Altair. I must admit that I was somewhat skeptical going in. The lament
and motivation behind Altair was how users have too many plotting libraries to choose from and too much complexity, and solving this problem
by introducing a new library invokes the obligatory xkcd. However, in the end, I think the move here is needed.
Altair a python interface to vega-lite; the API is a straight-forward plotting interface which spits out a vega-lite
spec to be rendered by whatever vega-compatible graphics frontend the user might like. This is a massive improvement over
the traditional way of using vega-lite, which is “simply write raw JSON(!)” It looks to have sane defaults and produce nice looking
plots with the default frontend. More important, however, is the paradigm shift they are trying to initiate: that plotting should be
driven by a declarative grammar, with the implementation details left up to the individual libraries. This shifts much of the
programming burden off the users (and on to the library developers), and would be a major step toward improving the state of Python
Imperative (hah!) to this shift is the library developers all agreeing to use the same grammar. Several of the major libraries (bokeh and plot.ly?)
already use bespoke internal grammars, and according to the talk, looking to adopt vega. Altair has taken the aggressive approach: the tactic seems to be
to firmly plant the graphics grammar flag and force the existing tools to adopt before they have a chance to pollute the waters with competing standards.
Somebody needed to do it, and I think it’s better that vega does.
There are certainly deficiencies though. vega-lite is relatively spartan at this point – as one questioner in the audience highlighted, it can’t
even put error bars on plots. This sort of obvious feature vacuum will need to be rapidly addressed if the authors expect the spec to be adopted wholeheartedly
by the scientific python community. Given the chops behind it, I fully expect these issues to be addressed.
I’ve focused on the cool stuff at the conference so far, but not everything was so rosy. Let’s talk about diversity – of the gender sort, but the complaint
applies to race, ability, and so forth.
There’s no way to state this other than frankly: it was abysmal. I immediately noticed the sea of male faces, and
a friend of mine had at least one conversation with a fellow conference attendee while he had a conversation with her boobs. The Code of Conduct was
not clearly stated at the beginning of the conference, which makes a CoC almost entirely useless: it shows potential violators that the organizers don’t
really prioritize the CoC and probably won’t enforce it, and it signals the same to the minority groups that the conference ostensibly wants to engage
with. As an example, while Chris Calloway gave a great lightening talk about how PyData North Carolina is working through the aftermath of HB2, several older men
directly behind me giggled amongst themselves at the mention of gender neutral bathrooms. They probably didn’t consider that there was a trans person sitting
right in front of them, and they certainly didn’t consider the CoC, given that it was hardly mentioned. This sort of shit gives all the wrong signals
for folks like myself. At PyCon the previous two years, I felt comfortable enough to create a #QueerTransPycon
BoF, which was well attended; although the more focused nature of SciPy makes such an event less appropriate, I would not have felt comfortable trying
The stats are equally bad: 12 out of 124 speakers, 8 out of 52 poster presenters,
and 4 out of 37 tutorial presenters were women, and the stats are much worse for people of color. The lack of consciousness of the problem was highlighted
by some presenters noting the great diversity of the conference (maybe they were talking about topics?), and in one case, by the words of an otherwise well-meaning man whom I had a conversation
with; when the 9% speaker rate for women was pointed out to him, he pondered the number and said that it “sounded pretty good.” It isn’t! He further pressed
as to whether we would be satisfied once it hit 50%; somehow the “when is enough enough?” question always comes up. What’s clear is
that “enough” is a lot more than 9%. This state of things isn’t new – several folks have written about it in regards to previous years.
There are some steps that can be taken here – organizers could look toward the PSF’s successful efforts to improve the gender situation at PyCon, where funding was sought
for a paid chair (as opposed to SciPy’s unpaid position). The Code of Conduct should be clearly highlighted and emphasized at the beginning of the conference.
For my part, I plan to submit a tutorial and a talk for next year.
I don’t want to only focus on the bad; the diversity luncheon was well attended, there was a diversity panel, and a group has been actively discussing the issues in a dedicated channel on the
conference Slack team. These things signal that there is some will to address this. I also don’t want to give any indication that things are okay – they aren’t,
and there’s a ton of work to be done.
I’m grateful to my adviser Titus for paying for the trip, and generally supporting my attending events like this and rabble rousing. I’m
also grateful to the conference organizers for putting together an all-in-all good conference, and to all the funders present who make all this scientific Python software
that much more viable and robust.
For anyone reading this and thinking, “I’m doing thing X to combat the gender problem, why don’t you help out?” feel free to contact me on twitter.
I’ve been in Austin since Tuesday for SciPy 2016, and
after a couple weeks in Brazil and some time off the grid in the Sierras, I can now say that I’ve been officially bludgeoned back into
my science and my Python. Aside from attending talks and meeting new people, I’ve been working on getting a little package of mine
up to scratch with tests and continuous integration, with the eventual goal of submitting it to the
Journal of Open Source Software. I had never used travis-ci before, nor had I used
py.test in an actual project, and as expected, there were some hiccups –
learn from mine to avoid your own :)
Note: this blog post is not beginner friendly. For a simple intro to continuous integration, check out our pycon tutorial,
travis ci’s intro docs, or do further googling. Otherwise, to quote Worf: ramming speed!
Having used drone.io in the past, I had a good idea of where to start here. travis is much more feature rich than drone though, and as such,
requires a bit more configuration. My package, shmlast, is not large, but it has some external dependencies
which need to be installed and relies on the numpy-scipy-pandas stack. drone’s limited configuration options and short maximum run time quickly make it intractable
for projects with non-trivial dependencies, and this was where travis stepped in.
getting your scientific python packages
The first stumbling block here was deciding on a python distribution. Using virtualenv and PyPI is burdensome with numpy, scipy, and pandas – they almost always
want to compile, which takes much too long. Being an impatient page-refreshing fiend, I simply could not abide the wait.
The alternative is to use anaconda,
which does us the favor of compiling them ahead of time (while also being a little smarter about managing dependencies). The default distribution is quite large though,
so instead, I suggest using the stripped-down miniconda and installing the packages you need explicitly. Detailed instructions are available here,
and I’ll run through my setup.
The miniconda setup goes under the install directive in your .travis.yml:
- sudo apt-get update
- if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -O miniconda.sh;
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
- bash miniconda.sh -b -p $HOME/miniconda
- export PATH="$HOME/miniconda/bin:$PATH"
- hash -r
- conda config --set always_yes yes --set changeps1 no
- conda update -q conda
- conda info -a
- conda create -q -n test python=$TRAVIS_PYTHON_VERSION numpy scipy pandas=0.17.0 matplotlib pytest pytest-cov coverage sphinx nose
- source activate test
- pip install -U codecov
- python setup.py install
Woah! Let’s break it down. Firstly, there’s a check of travis’s python environment variable to grab the correct miniconda distribution. Then we install it, add it to PATH,
and configure it to work without interaction. The conda info -a is just a convenience for debugging. Finally, we go ahead and create the environment. I do specify a version
for Pandas; if I were more organized, I might write out a conda environment.yml and use that instead. After creating the environment and installing a non-conda dependency
with pip, I install the package. This gets use ready for testing.
After a lot of fiddling around, I believe this is the fastest way to get your Python environment up and running with numpy, scipy, and pandas. You can probably safely use
virtualenv and pip if you don’t need to compile massive libraries. The downside is that this essentially locks your users into the conda ecosystem, unless they’re
willing to risk going it alone re: platform testing.
Bioinformatics software (or more accurately, users…) often have to grind their way through the Nine Circles (or perhaps orders of magnitude) of Dependency Hell to
get software installed, and if you want CI for your project, you’ll have to automate this devilish journey. Luckily, travis has extensive support for this. For example,
I was easily able to install LAST aligner from source by adding some commands under before_script:
- curl -LO http://last.cbrc.jp/last-658.zip
- unzip last-658.zip
- pushd last-658 && make && sudo make install && popd
The source is first downloaded and unpacked. We need to avoid mucking up our current location when compiling, so we use pushd to save our directory and
move to the folder, then make and install before using popd to jump back out.
Software from Ubuntu repos is even simpler. We can these commands to before_install:
This grabbed emboss (which includes transeq, for 6-frame DNA translation) and gnu-parallel. These commands could probably just as easily go in the install section,
but the travis docs recommended they go here and I didn’t feel like arguing.
and the import file mismatch
I’ve used nose in my past projects, but I’m told the cool kids (and the less-cool kids who just don’t like deprecated software) are using py.test these days. Getting
some basic tests up and running was easy enough, as the patterns are similar to nose, but getting everything integrated was more difficult. Pretty soon, after
running a python setup.py test or even a simple py.test, I was running into a nice collection of these errors:
import file mismatch:
imported module 'shmlast.tests.test_script' has this __file__ attribute:
which is not the same as the test file we want to collect:
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules
All the google results for this were to threads with devs and other benevolent folks patiently explaining that you need to have unique basenames for your
test modules (I mean it’s right there in the error duh), or that I needed to delete __pycache__. My basenames were unique and my caches clean, so something
else was afoot. An astute reader might have noticed that one of these paths given is under the build/ directory, while the other is in the root of the repo.
Sure enough, deleting the build/ directory fixes the problem. This seemed terribly inelegant though, and quite silly for such a common use-case.
Well, it turns out that this problem is indirectly addressed in the docs. Unfortunately, it’s 1) under the
obligatory “good practices” section, and who goes there? and 2) doesn’t warn that this error can result (instead there’s a somewhat confusing warning
telling you not to use an __init__.py in your tests subdirectory, but also that you need to use one if you want to inline your tests and distribute them
with your package). The problem is that py.test happily slurps up the tests
in the build directory as well as the repo, which triggers the expected unique basename error. The solution is to be a bit more explicit about where to find tests.
Instead of running a plain old py.test, you run py.test --pyargs <pkg>, which in clear and totally obvious language in the help is said to
make py.test “try to interpret all arguments as python packages.” Clarity aside, it fixes it! To be extra double clear, you can also add a pytest.ini to your
root directory with a line telling where the tests are:
testpaths = path/to/tests
organizing test data
Other than documentation gripdes, py.test is a solid library. Particularly nifty are fixtures, which make it easy to abstract away more boilerplate. For example,
in the past I’ve use the structure of our lab’s khmer project for grabbing test data and copying it into temp directories,
but it involves a fair amount of code and bookkeeping. With a fixture, I can easily access test data in any test, while cleaning up the garbage:
Deep in my heart of hearts I must be a functional programmer, because I’m really pleased with this. Here, we get the path to the tests directory,
and then the data directory which it contains. The test data is then all copied to a temp directory, and by the awesome raw power of closures,
we return a function which will join the temp dir with a requested filename. A better version would handle a nonexistant file, but I said raw power,
not refined and domesticated power. Best of all, this fixture uses another fixture, the builtin tmpdir, which makes sure then files get blown away
when you’re done with them.
Use it as a fixture in a test in the canonical way:
Earlier today I taught a half-day workshop introducing students to
doit for automating their workflows and building applications.
This was an intermediate-level python workshop, in that it expected students to have
operational python knowledge. The materials
are freely available, and the workshop was live-streamed on YouTube, where it is
This workshop was part of a series being put on by our lab the next few quarters. A
longer list of the workshops is at the dib training site, and Titus has written on them before.
Overall, I was happy with the results. Between the on-site participants and live-stream
viewers, our attendance was okay (about ten people total), and all students communicated
that they enjoyed the workshop and found it informative. Most of my materials (which I
mostly wrote from scratch) seemed to parse well, with the exception of a few minor bugs
which I caught during the lesson and was able to fix. As per usual, our training
coordinator Jessica did a great job handling the
logistics, and we were able to use of the brand new Data Science Initiative space in
the Shields Library on the UC Davis campus.
Thoughts for the Future
We did have a number of no-shows, which was disappointing. My intuition is that this
was caused by a mixture of it being the beginning of the quarter here, with many
students, postdocs, staff, and faculty just returning, and the more advanced nature
of the material, which tends to scare folks away. It might be another piece of data
to support the idea of charging five bucks or so for tickets to require a small amount
of activation energy and thus filter out likely no-shows, but we’ve had good luck so
far with attendance, and it’d be best to make such a decision after we run a few more
similar workshops (perhaps it would only need to be done for the intermediate or
advanced ones, for example).
We also had several students with installation issues, a recurring problem for these
sorts of events. I’m leaning toward trying out browser-based approaches in the future,
which would allow me to set up configurations ahead of time (likely via docker files)
and short-circuit the usual cross-platform, python distribution, and software
I really enjoyed the experience, as this was the first workshop I’ve run where
I created all the materials myself. I’m looking forward to doing more in the future.
ps. this has been my first post in a long time, and I’m hoping to keep them flowing.
Recently, I’ve been making a lot of progress on the lamprey
transcriptome project, and that has involved a lot of IPython notebook.
While I’ll talk about lamprey in a later post, I first want to talk
about a nice technical tidbit I came up with while trying to manage a
large IPython notebook with lots of figures. This involved learning some
more about the internals of matplotlib, as well as the usefulness of the
with statement in python.
So first, some background!
matplotlib is the go-to plotting
package for python. It has many weaknesses, and a whole series of posts
could be (and has been) written about why we should use something else,
but for now, its reach is long and it is widely used in the scientific
community. It’s particularly useful in concert with IPython notebook,
where figures can be embedded into cells inline. However, an important
feature(?) of matplotlib is that it’s built around a state machine; when
it comes to deciding what figure (and other components) are currently
being worked with, matplotlib keeps track of the current context
globally. That allows you to just call plot() at any given time and
have your figures be pushed more or less where you’d like. It also
means that you need to keep track of the current context, lest you end
up drawing a lot of figures onto the same plot and producing a terrible
abomination from beyond space and time itself.
IPython has a number of ways of dealing with this. While in its inline
mode, the default behavior is to simply create a new plotting context at
the beginning of each cell, and close it at the cell’s completion. This
is convenient because it means the user doesn’t have to open and close
figures manually, saving a lot of coding time and boilerplate. It
becomes a burden, however, when you have a large notebook, with lots of
figures, some of which you don’t want to be automatically displayed.
While we can turn off the automatic opening and closing of figures with
we’re now stuck with having to manage our own figure context. Suddenly,
our notebooks aren’t nearly as clean and beautiful as they once were,
being littered with ugly declarations of new figures and axes, calls to
gcf() and plt.show(), and other such not-pretty things. I like
pretty things, so I sought out a solution. As it tends to do, python
Enter context managers!
Some time ago, many’s a programmer was running into a similar problem
with opening and closing files (well, and a lot of other use cases). To
do things properly, we needed to do exception handling to properly and
cleanly call close() on our file pointers when something went wrong.
To handle such instances, python introduced context managers and the
From the docs:
A context manager is an object that defines the runtime context to be
established when executing a with statement. The context manager
handles the entry into, and the exit from, the desired runtime context
for the execution of the block of code.
Though this completely washes out the ~awesomeness~ of context
managers, it does sound about like what we want! In simple terms,
context managers are just objects that implement the __enter__ and
__exit__ methods. When you use the with statement on one of them,
__enter__ is called, where we put our setup code ; if it returns
something, it takes the name given it by as. __exit__ is called
after the with block is left, and contains the teardown code. For our
purposes, we want to take care of matplotlib context. Without further
ado, let’s look at an example that does what we want:
Let’s break this down. The __init__ actually does most of our setup
here; it takes some basic parameters to pass to plt.subplots, as well
as some parameters for whether we want to show the plot and whether we
want to save the result to file(s). The __enter__ method returns the
generated figure and axes objects. Finally, __exit__ saves the
figure to the file name with the given extensions (matplotlib uses the
extension to infer the file format), and shows the plot if necessary. It
then calls plt.close() on the figure, deletes the axes objects from
the figure, and calls del on both instances just to be sure. The three
expected parameters to __exit__ are for exception handling, which is
discussed in greater detail in the docs.
Here’s an example of how I used it in practice:
That’s taken directly out of the lamprey
where I first implemented this. I usually put a filelink in there, so
that the resulting image can easily be viewed in its own tab for closer
The point is, all the normal boilerplate for handling figures is done in
one line and the code is much more clear and pretty! And of course, most
importantly, the original goal of not automatically displaying figures
is also taken care of.
For those of you who work with both the python codebase and the c++
backend, I found a pretty useful tool. Seeing as we work with
performance-sensitive software, profiling is very useful; but, it can be
a pain to profile our c++ code when called through python, which
necessitates writing c++ wrappers to functions for basic profiling.
The solution I found is called
which is a python module made specifically to profile c++ python
In order to install, simply run:
For khmer, you should also be sure to turn on debugging at compile time:
The first is the python module implementing the profiler; the second is
the tool for analyzing the resulting profile information.
There are a couple ways to use it. You can call it directly from the
command line with:
The -- is necessary, as it tells UNIX not to parse the resulting
arguments as flag arguments, which allows the profiler to pass them on
to the script being profiled instead of choking on them itself. Thanks
for this trick, @mr-c. Also make sure to use the absolute path to the
script to be profiled.
You can also use the module directly in your code, with:
The resulting file is then visualized using google-pprof, with:
In order to get python debugging symbols, you need to use the debugging
executable. So, while you may run the script in your virtualenv if using
one, you give google-pprof the debug executable so it can properly
Here is some example output:
In this call graph, the python debugging symbols were not properly
included; this is resolved by using the debugging executable.
The call graph is in standard form, where the first percentage is the
time in that particular function alone, and where the second percentage
is the time in all functions called by that function. See the
for more details.