doit ==== [doit](https://pydoit.org) on [github](https://github.com/pydoit/doit) ~~~sh pip3 install doit apt install python3-doit ~~~ What is it? ----------- `doit` is a python-based workflow tool on file-level that lets you specify *actions* that connect *dependencies* to *targets*, e.g. ~~~py # file dodo.py def task_paper(): return {"targets" : ["paper.pdf"], "file_dep": ["paper.tex", "img.png"], "actions" : ["pdflatex paper.tex"]} def task_plot(): return {"targets" : ["img.png"], "file_dep": ["plot.py", "simulation.py"], "actions" : ["python3 simulation.py --output plot_data.vtu", "python3 plot.py --input plot_data.vtu"]} ~~~ You can define your workflows as chains of independent or connected tasks and calling `doit` in the command line will automatically run them in the correct sequence. So a `doit paper` will first run the `task_plot` to create the `img.png` and call `pdflatex` afterwards. Also, it will only rerun *actions* if their *dependencies* change. That means that running `doit paper` again, will result in no *action*, as everything is up-to-date. Modifying only `paper.tex`, will cause `doit` to run `pdflatex` again. Modifying `plot.py` requires all tasks to be run again. Why should I use it? -------------------- A single `doit` command will trigger the execution of the whole chain of actions to produce a desired output (like a paper). Benefits: * It documents all steps from source files to the output in an objective way. * It allows you to easily reproduce a publication after a long time. * Changes to input files automatically propagate you changes in the figures/pdf. * It proofs to you/colleges/reviewers that no manual manipulations occur. Working with multiple systems ----------------------------- Complex workflows may require several machines to run the actions. Reasons include: - Several scientists want to work on the same workflow - Compute tools or postprocessing tools are not available - licensing - *just reviewing* the text of a document on a laptop - HPC machine `doit` builds a database (in the folder of the `dodo.py`) that tracks (via hashes of the *dependencies*) which *actions* need to be rerun. That means that each task has to be executed at least once, even if the *targets* already exist or are up-to-date. Problems arise if, due to some of the reasons above, the current machine is unable to do so. One remedy is to store a *target* in our version control system and to decouple the whole workflow into multiple `dodo.py` files, each in a separate folder. ### Proposed setup: ~~~py repo/data/dodo.py # generates input data repo/simulations/dodo.py # turns input data into results repo/images/dodo.py # turns results into images repo/paper/dodo.py # turns images into a paper repo/dodo.py # optionally connect all tasks ~~~ This breaks the chain of tasks and the `paper/dodo.py` simply does not know that the *dependencies* of its tasks are *targets* in `images/dodo.py`. If you want to build the `paper.pdf` without rebuilding the files in `repo/images/`, you can simply run `doit` in the `repo/paper/` directory. If you want to execute the full workflow, calling `doit` in `repo` to trigger the complete workflow. If you work with absolute paths in your `dodo` files, such as ... ~~~py # repo/simulation/dodo.py from pathlib import Path def task_simulation_1(): simulation_dir = Path(__file__).absolute().parent script = str(simulation_dir / "simulation_1.py") parameters = str(simulation_dir.parent / "data" / "input_1.dat") image = str(simulation_dir.parent / "images" / "graph_1.pdf") return { "file_dep": [script, parameters], "targets": [image], "actions": [f"python3 {script} {parameters}"] } ~~~ ... and ... ~~~py # repo/paper/dodo.py from pathlib import Path def task_paper(): paper_dir = Path(__file__).absolute().parent tex = str(paper_dir / "paper.tex") pdf = str(paper_dir / "paper.pdf") image = str(paper_dir.parent / "images" / "graph_1.pdf") return { "file_dep": [tex, image], "targets": [pdf], "actions": [f"pdflatex {tex}"] } ~~~ ... that makes it easy to connect the `image` target of the simulation task with the same `image` dependency of the paper task by just importing both into the same root file ~~~py # repo/dodo.py from simulation.dodo import * from paper.dodo import * ~~~ > Note that is the opposite of why you want to use `doit` in the first place. > It is now your responsibility to run the whole workflow of `repo/dodo.py` from time to time manually, or via a *nighly build* on some server. Skipping huge simulations ------------------------- If you just want to skip tasks (e.g. if they are too time consuming) you can define a flag and skip the task definition, if that flag is set. The simplest way would be to manually set the flag in the `dodo.py` file like ~~~py import os def skip(): return True # set that to false manually def task_huge_simulation(): if skip(): return return {...} # file_dep, actions, dependencies ~~~ The `dodo.py` itself it not a dependency, so changing this file does not cause a rerun of up-to-date tasks. As the file is part of you version control, it will, however, cause a rather annoying diff, if you swap this flag. So it can be advantageous to change the flag without modifying the file, e.g. via an environment variable. ~~~py def skip(): """ return False if "DOIT_SKIP" is empty or does not exist. """ try: return os.environ["DOIT_SKIP"] != "" except: return False ~~~ This allows you to have "DOIT_SKIP=True" on your local machine, but run it with "DOIT_SKIP=" from time to time on a server to ensure full reproducibility.