Debugging mpi programs.

Debugging mpi programs is similar to debugging any program and generally involves a long iterative process of editing, compiling and running.

The use of adding print statements to your code as a debugging technique should not be underestimated. They can be used to help monitor variables or simply to flag where in your program execution has got to. A bisection method can be employed where one uses print statements to gradually narrow down where exactly the program is crashing. For debugging it is generally easiest to run with the minimum number of threads that you can. This will cut done on the amount of output you get which can prove confusing. By printing out the value of MPI_Wtime you can get an idea of where the time is being spent in your program. It may also be easier if you only print from one of the threads. Print statements can be used in conjunction with any of the other techniques I describe later.

It depends what sort of bug you think you are looking for but in general it speeds up compiling and makes the hunt easier if you turn off optimisation as a first step. You can do this with -O0 (That is a capital letter O followed by a number zero). Obviously if it is a bug that only shows up with optimisation on you will have to investigate at what level of optimisation your bug show up. Try compiling with the -C option which will check for most array subscripts going beyond their declared size and with -trapuv which will find any uninitialised variables.

We also have the NAG Fortran 95 compiler that is able to perform far more checking of your code. The NAG compiler should not be used for production work but are a very good tool to use durign development, testign and debugging.
Use nagf95 for serial programs and mpnagf95 for mpi programs (It can also compile code written in C but it uses gcc to do that).
The NAG compiler has been wrapped so that if HPCF_MODE is set (as it is by default) you get

/opt/NAG/bin/f95 -C=all -abi=64 -gline -I/opt/acml2.5.1/include <arguments> -L/opt/NAG/fll6a21d9l/acml -lacml

These options perform a large number of compile time checks, enables traceback for run time errors and links in the maths libraries for you. Fuerther information about these options and all the others can be found on the nagf95 man page.

Debuggers

If any of these techniques have shown up problems (or even if they haven't) we can compile with the -g option and use a debugger to get further information.

The pathscale debugger pathdb has finally reached a stage where I would reconmend it over gdb. It behaves in a very similar way but has a better understanding of fortran structures and arrays. A control file allows one to submit the job to a queue. A control file might look like

$ cat db.in
cont
where

quit

The extra carriage return between where and quit is there since where uses a page output and waits for a carriage return before displaying the second page. If the problem is in a subroutine more than a few layers down you would lose the output otherwise.
This can then be run with

mpirun -np ntasks -dbg=pathdb a.out < db.in

If you program requires extra command line arguments these can still be included as they were before.
If you have problems with the pathdb (For instance if you need to debug more than 4 processes at once) then gdb can be used instead.

mpirun -np ntasks -dbg=gdb a.out < db.in

Both of these runs will treat all threads equally and give you the output all in one file.

Floating point traps

On linux floating point errors are not normally trapped so your program could have a division by zero, a floating point underflow or overflow and the program would still run. Generally these sorts of behaviour suggest a problem with the code and you probably are not getting correct results. I could find no easy way to turn these traps off in fortran but the following section of C does.

#define _GNU_SOURCE 1
#include <fenv.h>
static void __attribute__ ((constructor))
trapfpe ()
{
feenableexcept (FE_ALL_EXCEPT);
}

For convenience I have compiled this as a library so linking with -ltrapfpe enables all floating point traps.
Alternatively Pathscale have now added options to control these masks
-TENV:simd_imask=OFF unmasks SIMD floating-point invalid-operation exceptions
-TENV:simd_dmask=OFF unmasks SIMD floating-point denormalized-operand exceptions
-TENV:simd_zmask=OFF unmasks SIMD floating-point zero-divide exceptions
-TENV:simd_omask=OFF unmasks SIMD floating-point overflow exceptions
-TENV:simd_umask=OFF unmasks SIMD floating-point underflow exceptions
-TENV:simd_pmask=OFF unmasks SIMD floating-point precision exceptions