Debugging mpi programs.


Debugging mpi programs is similar to debugging any program and generally involves a long iterative process of editing, compiling and running.

The use of adding print statements to your code as a  debugging technique should not be underestimated. They can be used to help monitor variables or simply to flag where in your program execution has got to. A bisection method can be employed where one uses print statements to gradually narrow down where exactly the program is crashing. For debugging it is generally easiest to run with the minimum number of threads that you can. This will cut done on the amount of output you get which can prove confusing. By prining out the value of MPI_Wtime you can get an idea of where the time is being spent in your program. It may also be easier if you only print from one of the threads. Print statements can be used in conjuction with any of the other techniques I describe later.

 It depends what sort of bug you think you are looking for but in general it speeds up compiling and makes the hunt easier if you turn off optimisation  as a first step. You can do this with -O0 (That is a capital letter O follwed by a number zero). Obviously if it is a bug that only shows up with optimisation on you will have to investigate at what level of optimisation your bug show up. If your bug still shows up with no optimisation then try compiling with the environment variable HPCF_VERBOSE=all this sets -xcheck=stkovf -fpover -u -ansi. Having fixed any problems this shows up compiling with the -C option will check for most array subscripts going beyond their declared size. If this has shown up no problems then now we compile with the -g option and use a debugger.

dbx is the debugger most users are familiar with. This can be used interactively

mprun -np ntasks -o dbx a.out

but a control file might make it easier and allows one to submit the job to a queue. This may be needed if  the time and memory limit on the login machines make it difficult to debug your program interactively. It does of course have the drawback that you have to repeate the whole run if there is another command you want to run in dbx.  A control file might look like

$ cat dbx.in
catch FPE
catch SIGSEGV
catch SIGBUS
run inputfile
where
dump
quit


This can then be run with

mprun -np ntasks -o dbx a.out < dbx.in

This gives you the output from all the threads which is pretty confusing and a lot of it may prove redundant. It might be easier to just look at a few of the MPI processes.

$ cat rundebug.ksh
#!/bin/ksh
# mechanism to restrict debugging to a subset of MPI processes ...
if [[ $MP_RANK < 2 ]]
then
dbx a.out < dbx.in > dbx.out.t$MP_RANK
mpkill -9 $MP_JOBID
else
a.out
fi

This script needs to be an executable so alter it with
chmod +x rundebug.ksh
Then run it with
mprun -np ntasks rundebug.ksh

The alternative to dbx is prism. This is a graphical tool that provides most of the functionality of dbx but in a bit more of a friendly way.
To run your job with prism

mprun -np ntasks prism a.out

Then when prism has fired up you need to hit run.  Programs running under prism take longer to run than under dbx but some people find the interface more intuative.