Have you ever asked yourself: "Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week."
If so, this blog post might be of interest to you. If not, it might still be of interest -- what do I know about your interests? ;) Below I outline my process of running my analyses on the cluster with minimal recoding/adaptation.
Problems with the naive (common) approach
My main problems with moving my computations to the cluster have always been:
Tedious recoding of my scripts to parallelize them: Most common cluster systems like SLURM or TORQUE require you to write a script that starts each parallel job in a single line. I saw many researchers then recode their analysis scripts to parse command line arguments and have hundreds of lines of various parameter values they want to run.
Difficult debugging: I spent too many hours coding my clusterized code, copying them over, submitting a job, waiting for it to schedule, to only then find out that I had an obvious bug. The cycle of seeing the job to start is simply too long.
Non-pickleable objects: I'm sure everyone experimenting with parallel programming in Python stumbled over the problem that everything you want to distribute to your workers has to be pickleable. Unfortunately, most things in Python are not pickleable (no functions, no classes, no objects etc).
dillhelps with that but in the end is also limited (see here for a blog post discussing this issue).
Desiderata of setup
If this sounds terrible, it's because it is! My ideal setup thus has the following properties:
- I can make sure everything worked correctly on my local setup first before moving it to the cluster.
- Since I spend my time analysing data exclusively in the IPython Notebook these days I don't want to switch to a script-based setting just so I can run the analyses on the cluster.
- Finally, I don't want to be limited by what is pickable, so bringing my workers into the correct state should come with minimal extra-work.
Setting up the IPython Notebook
We are going to use IPython parallel inside the IPython Notebook (IPyNB). The IPyNB allows me to launch multiple client workers (called engines) from the dashboard on my local machine (i.e. my laptop I do most of my work on). I first launched the engines on my laptop via the dashboard; I mainly need this for testing so I'm not concerned that I only have few cores available. Then, inside the IPython notebook, I can connect to the worker engines via:
from IPython import parallel c = parallel.Client() view = c.load_balanced_view()
This sets up a
view object (see the IPython parallel docs for more info) that allows us to distribute our jobs.
The next issue we solve is that of bringing the workers into the correct state without relying on
pickle. IPython has the nifty
%%px magic operator for that. What it does is that it executes the cell on all connected clients. If we add
--local it also executes it on the main node.
%%px --local import numpy as np import pandas as pd import pickle class MyLongAssComputation(object): def compute(self, x): return x**2 def compute_func(x): c = MyLongAssComputation() return c.compute(x)
This makes sure everyone has access to the necessary modules and any classes or functions I need to execute a job. Obviously this code makes no sense but it's supposed to demonstrate that you can basically set up anything you want on the nodes and refer to them. This fulfills desiderata #3.
Finally, we want to load any data and distribute it to the workers. This we execute only from the master node (so no magic prefix).
x = np.arange(1, 1000) squared = view.map_sync(compute_func, x) pickle.dump(squared, open('output.pickle', 'w'))
In this local setup I can easily identify and debug my code. When I want to launch
pdb I can easily call the parallel function directly on the master to have full control. I can also easily edit the above code, re-execute the cell to update the workers and retry to distribute the jobs, fulfilling desiderata #1.
Executing the IPython Notebook on the cluster
Once I made sure everything functions correctly on my local machine and perhaps a smaller set of data I want to run the full, long job on the cluster going massively parallel. For that I first copy (
scp) the notebook to the cluster and then use
ssh to connect to the cluster. From hereon out, everything will be happening on the cluster.
You have to write a script with some special comments that instruct the scheduler on how many cores you want (
-n), how long the job is supposed to run (
#!/bin/sh #SBATCH -J ipython #SBATCH -n 64 #SBATCH --time=48:00:00
Next in our script we want to launch the IPython parallel cluster. First, we launch the IPython controller which distributes the jobs. We also specify that it should accept connection from external IPs (
--ip='*') instead of being restricted to
echo "Launching controller" ipcontroller --ip='*' & sleep 10
We next need to launch our engines. Most schedulers have a command to distribute a shell command on all your reserved nodes.
srun which will execute the command 64 times (as specified above). You could also use
srun ipengine & sleep 25
Finally, we want to run our IPyNB on the cluster. For this we'd like to just execute an IPyNB like a Python script. Luckily, there are some scripts that do exactly that. For our purposes,
checkipnb.py does the job. You'll have to download it and save it to the directory.
echo "Launching job" python checkipnb.py $1 echo "Done!"
$1 refers to the command line argument of the cluster script so that I can decide which IPyNB to run when I launch the job. That's the script, I now only need to copy the data and the IPyNB to the cluster and submit my job via:
sbatch submit_ipython_notebook.sh MyIPyNB.ipynb
(Other schedulers have different commands to submit jobs like
What happens next is that
SLURM will schedule my job in the queue. Once it gets started with access to 64 cores, the shell script is executed and launches
ipcontroller and 64 instances of
ipengine which automatically know how to connect and register to the controller.
checkipnb.py then executes each cell in
MyIPyNB.ipynb consecutively. The
%%px cells will bring all the engines into the correct state while the
map_sync() call distributes the work. Once the computation is done, the results are saved to the pickle output file which I can then copy back (using
scp) to my local setup to analyse.
What is nice is that I never have to touch the cluster script again. I simply write an IPyNB that computes my analyses in parallel and if the computations become too big I can just copy them to the cluster and launch them there with minimal overhead.
The full script is as follows (note that I'm running a local Anaconda install; this really helps as you normally don't have admin access on the cluster and don't want to have to keep asking them to install custom packages):
#!/bin/sh #SBATCH -J ipython #SBATCH -n 64 #SBATCH --time=48:00:00 echo "Launching controller" $HOME/anaconda/bin/ipcontroller --ip='*' & sleep 10 echo "Launching engines" srun $HOME/anaconda/bin/ipengine & sleep 25 echo "Launching job" $HOME/anaconda/bin/python checkipnb.py $1 echo "Done!"
In this post I demonstrated and explained my cluster setup. The overhead of moving my computations to the cluster is greatly reduced which is what I care most about. I also don't have to route around the problem of pickling and I never have to leave the IPython Notebook.
Other interesting projects that make parallelization easier:
Thanks to Chris Chatham for proof-reading and valuable input.