Dependencies and Task StatusΒΆ

Learning Objectives

  • Introduce task uptodate
  • Introduce file dependencies

If you run doit again, you’ll notice that you get the same output as before – the dot and task name indicating that the task was run again. Preferably, we’d like the task not to run once it already has (the file might be large, for example). pydoit handles this two ways: through dependencies and the uptodate function. File dependencies are quite intuitive. The task names the files it depends on, and if any of those change, the task is rerun. Our download command already adds a bit of complexity though, because it does not depend on any files. This is where uptodate comes in.

uptodate is another entry in the task dictionary. It is a list of booleans, callables (that is, functions), or strings specifying shell commands. If any of the uptodate items evaluates to False, the task is out-of-date and run.

For our download task, we are going to use a function which is included in the doit library, run_once. As one might expect, this makes sure the task is run at least once. Let’s add it to our dodo.py.

from doit.tools import run_once

def task_download_data():
    return {'actions': ['curl -OL https://s3.amazonaws.com/pydoit-intermediate/Melee_data.csv.gz'],
            'targets': ['Melee_data.csv.gz'],
            'uptodate': [run_once]}

Now if we run doit again (twice more, actually), our output will change.

-- download_data

The dash-dash indicates that the task was determined to be up to date, and was not executed.

By default, the task name will be taken from the function defining it. We can also define our own task names with the name entry in the task dictionary.

doit’s system for determining whether a task is up to date is actually somewhat complex. The documentation provides a more detailed list of conditions where a task is not up to date:

  • An uptodate item is (or evaluates to) False
  • A file is added to or removed from file_dep
  • A file_dep changed since last successful execution
  • A target path does not exist
  • A task has no file_dep and uptodate item equal to True

The last of these explains why we need to include the uptodate entry in our download task; it has no file dependencies, and so will always be considered out of date, even if the target exists.

Experimenting with uptodate

What would happen if we changed uptodate to [True]? How about [False]?

Now that we know more about how tasks are considered up to date, let’s introduce a file dependency. The file we downloaded was a gzip archive, so we’ll write a task to extract it. The command we would run in the shell might be:

$ gunzip Melee_data.csv.gz

This would produce a file called Melee_data.csv. We can see then that we have a target (Melee_data.csv), an action (running gunzip), and a file dependency (Melee_data.csv.gz). Let’s add the task to our dodo.py.

def task_gunzip_data():
    return {'actions': ['gunzip -c %(dependencies)s > %(targets)s'],
            'targets': ['Melee_data.csv'],
            'file_dep': ['Melee_data.csv.gz']}

On top of the file dependency, this task also introduces automatic variables. These are in the actions string, and are recognized by the task creator. This removes redundancy and saves us some code.

When we run doit, we get output showing that only the gunzip task was executed.

-- download_data
.  gunzip_data