Cambridge-Cranfield HPCF: History Turning Programming Guide

History of High Performance Computing in Cambridge

Cambridge-Cranfield HPCF > Historical Information > Turing Hitachi S3600

Turing was decommissioned at the end of February 1999. The pages on Turing are not being maintained, but are being left accessible for historical interest. They may well have bad links and similar errors.

Examples of Program Development on the Hitachi S-3600 (turing.hpcf)

A Simple Example

The simplest way of compiling and running small programs is where the program has already been debugged (usually on another system) and the compilation time is fairly small. It is not suitable for compiling large applications, and is not advised for initial debugging on turing.hpcf.

Assume the executable file run_job contains the following:


        #!/bin/sh
        set -e
        if [ "$1" = '' ]
        then
            echo Queue name not set
            exit 1
        fi
        if [ "$2" = '' ]
        then
            echo Job script not set
            exit 1
        fi
        qsub -q $1 <<input
        #!/bin/sh
        #@\$-me
        $2 -$1
        input

Assume the executable file job_script contains the following:


        #!/bin/sh
        set -e
        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1))' -o crunch crunch.f
        # or:  f90 -i,E -W0,'hap,opt(o(s),uinline(1))' -o crunch crunch.f
        crunch
        rm crunch

Typing 'run_job v10 job_script' on the 3050 workstation will submit a NQS job to execute the script job_script in queue v10 on turing.hpcf. When it has finished, it will send a mail message to you (on the workstation you submitted it from) and leave the standard output and standard error in files with names like STDIN.o1234 and STDIN.e1234. See the NQS User's Guide for more information.

The above scripts are only examples, and you can modify them or use different ones to taste. For example, the f77 command could be replaced by make, if you work that way, or by a call to your own F77 command which sets up your preferred options. Similarly, f90 or even cc could be used instead of f77. But the general method of using scripts rather than retyping complex commands is recommended.

Note that the uinline(1) has no effect unless inlining directives are included in your Fortran source. A better method for portable code is to keep them in a separate file, and there is an example of how to do this.

Compiling Larger Applications

If you need to compile a large application, please use queue s05 (i.e. scalar execution in 50 MB). It is a good idea to compile each module separately using the -c option, and link them in a separate command; this makes it easier to recompile a single module when making changes. make does this automatically, and has facilities to recompile when included files are changed, which is why it is usually recommended for large applications.

Assume that the files compile_script and run_script contain the following:


        #!/bin/sh
        set -e
        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1,EXT(incon)))' -c IO.f
        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1,EXT(incon)))' -c init.f
        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1,EXT(incon)))' -c restart.f
        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1,EXT(incon)))' -c compute.f
        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1,EXT(incon)))' -c main.f
        f77 -i,E,U -o crunch IO.o init.o restart.o compute.o main.o
        # or the Fortran 90 equivalent commands.
and:
        #!/bin/sh
        set -e
        crunch

Then typing 'run_job s05 compile_script' will compile and link the program crunch. You would then execute the program in queue v10 by typing 'run_job v10 run_script'.

Note that the uinline(1,EXT(incon)) takes inlining directives from file incon, and there are examples of this.

Extended Storage for Direct-Access I/O

The following code allocates of 16384 records of 4096 bytes each, for use as a direct-access scratch file on unit 10:


        open(10,access='direct',form='unformatted',recl=4096,
     *    maxrec=16384,type='es')

Other than the above, the file can be used like any other Fortran direct-access file, except that it will disappear when the command finishes, and that access will be much faster than to disk. The OPEN statement will fail if there is not enough space available. It is also possible to specify the size in a configuration file, as well as initialise and save the contents - for details, see the Fortran 77 or 90 User's Guide.

Using Standard Libraries

To call the NAG library, you need only the option -lnag, but be careful to use the right compiler options and precision. For example:


        f77 -i,E,U -W0,'hap,opt(o(s))' crunch.f -lnag

If your program uses the level 0 or level 1 BLAS, you should request that they are inlined, and there is a file of inlining directives available in the file /usr/include/NAG_BLAS. For example:


        f77 -i,E,U -c \
            -W0,'hap,opt(o(s),uinline(1,EXT(/usr/include/NAG_BLAS)))' \
            IO.f init.f restart.f compute.f main.f
        f77 -i,E,U -o \
            crunch IO.o init.o restart.o compute.o main.o -lnag

There is also a file /usr/include/NAG_functions that contains inlining directives for selected NAG functions (mostly from the S and X chapters).

To call the Matrix/HAP library, you use the option -lmathe80, which selects the version that is tuned for the S-3600 model 180. The manuals describe other possibilities, but it is unlikely that you will want to use them. For example:


        f77 -i,E,U -W0,'hap,opt(o(s))' crunch.f -lmathe80

The -i,E,U is not critical in this case, but it is recommended. To call the MSL2 library, just use -lmsl2 instead of or in addition to -lnag or -lmathe80.

Calling C from Fortran

The Fortran 77 and Fortran 90 User's Guides describe two ways of calling C from Fortran: by the use of interface functions and by the use of compiler options. The second method is recommended at Cambridge, and is described here.

Assume the files program.f, epsilon.c and job_script contain the following:


                ...
                double precision x, epsilon
                x = epsilon()
                ...
 and
        #include \lt;float.h\gt;
        double EPSILON (void) {return DBL_EPSILON;} 
 and
        #!/bin/sh
        set -e
        cc -c -O epsilon.c
        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1,EXT(incon)))' -c crinkle.f
        f77 -i,E,U -o crinkle crinkle.o epsilon.o
        # or:  f90 -i,E -W0,'hap,opt(o(s))' -c crinkle.f
        # and: f90 -i,E -o crinkle crinkle.o epsilon.o
        program
        rm program

then typing 'run_job v10 job_script' will compile and run the mixed language program. You can merge the different compilation steps (i.e. compile and link in one command) if you prefer. For larger programs, you should compile and link separately in queue s05 before running the program.

Note that the Fortran function name must be upper-cased in the C code. If you need to call C functions with names in lower or mixed case (including most standard C and UNIX functions), you should use the interface routines documented in the Fortran 77 and Fortran 90 User's Guides (i.e. the other method referred to above). You are strongly advised not to use different -i,... options, because of the difficulty in mixing code with different casing conventions.

Using Vectorisation Directives

In general, the compiler will vectorise clean, simple code efficiently without any specific directives. If this doesn't work, you may need to add some directives, but take great care that you don't confuse the compiler into generating incorrect code. The Fortran 77 and Fortran 90 User's Guides give details.

All these directives are of the form *VOPTION suboption and must immediately precede the DO loop that they are intended to affect; as with most other vector systems, loops caused by GOTO statements are not recognised. The following are examples of important directives; in all cases, a list of arrays (i.e. (A,B,C)) can be specified where the example shows just (A).

When tuning the NAG library (mainly the BLAS), the directives used were mostly INDEP, some OVLAP(S,) and a couple of OVLAP(ES,) and OVLAP(L,).

The INDEP(A) option states that arrays A and C are `independent' of the loop index I, in the sense that the elements of arrays can be operated on in any order without effecting the result. For example:


        *voption indep(a)
                DO 10 I = 1,N,K
           10   A(I) = A(I)+B(I)
        *voption indep(c)
                DO 20 I = 1,N
        C   L is assumed to be non-zero
                J = L*I
           20   C(J) = C(J)+B(I)

In the case of array A (but not C), the compiler could have deduced this because K is forbidden to be zero by the Fortran standard, but the current version uses the same logic for the two cases and so needs to be told that the subscript is `safe'.

The S-3600 handles indirect vector addressing efficiently, but the compiler needs some help, especially when an indirect reference occurs on the left hand side of an assignment. If the elements of the index vector are distinct, then the INDEP(A) option should be used, as for unknown step sizes. For example:


        *voption indep(a)
                DO 10 I = 1,N
           10   A(L(I)) = A(L(I))+B(I)

If the elements of L are NOT distinct, then using the INDEP(A) option could give wrong answers. The OVLAP(S,(A)) option states that fetches from A may overlap a previous store, but all references to A in the same loop iteration are for the same value. For example:


        *voption ovlap(s,(a))
                DO 10 I = 1,N
                A(L(I)) = B(I)
           10   B(I) = A(L(I))+1.0

The OVLAP(ES,(A)) option is similar, but states that the previous store always uses an index corresponding to elements that will be fetched at a later iteration (i.e. L(I) >= M(I)). OVLAP(L,(A)) is the converse (i.e. with L(I) < M(I)).

The NOVEC option unconditionally prevents vectorisation, in the rare cases that this is a bad idea or would produce incorrect results; it is unlikely to give much performance gain. The VEC option forces vectorisation under most circumstances, but is definitely a sledgehammer approach. You should avoid it if possible, because of the risk of generating incorrect code.

Using Inlining Directives

Unlike on the Cray YMP etc., there is currently no option to tell the compiler to inline routines automatically; you have to tell it which routines to inline. This is done by inserting directives into your code (usually at the start of a module) or into a separate file. For example:


        *uinline utility.f(flip,flop,flap)
        *uinline muckle.f(flugga)

The standard form of inlining directive will inline only the simplest and smallest routines (up to 30 lines), but is fairly safe. You can inline routines with some forms of COMMON block, DATA statement etc., but you should read the warnings in the User's Guide before doing this. And you should check that inlining larger routines will not 'bloat' your code beyond all reason. To relax the restrictions, use the following form of inlining directives:


        *uinline utility.f(flip,flop,flap),extended
        *uinline muckle.f(flugga),extended

These directives will insert the code of the routines FLIP, FLOP and FLAP from file utility.f and FLUGGA from file muckle.f (note that the filename must be in the correct case) everywhere there is a call to them. If these directives are at the start of a source file (e.g. crunch.f), then they will apply to the whole of that source file. Unfortunately, there is no shorthand for specifying a routine from the current source file.

It is much better to use a separate file of inlining directives than to modify your main source, and this is the recommended method, especially when inlining library code such as the BLAS. But do remember that the search path is interpreted when the compiler is run, and not relative to the file of inlining directives. Assume that the file inline.control contains the following directives:


        *uinline /users/hd1/fred/includes/inlining(ddot,dxscal)
        *uinline my_lib/utility.f(flip,flop,flap)

Then using the following command in your compilation script will cause the routines DDOT, DXSCAL, FLIP, FLOP and FLAP to be taken from the specified files and inlined:


        f77 -i,E,U -W0,'hap,opt(o(s),uinline(1,ext(inline.control)))' \
            -c crinkle.f

Note that you may get confused if you have a routine in a source file and you specify it in *uinline directive with a different file name; the inlining directive will USUALLY take precedence. Watch out for this trap!

You must also compile and link the inlined routines, because the compiler produces an external reference even for routines that are used only in inlined form. And remember that this can cause confusion when you put diagnostics into a routine that is eligible for inlining.

There is also an example of how to inline the BLAS from the NAG library.

Getting Compilation Messages

The previous jobs select most optimisations, but do not check how well your code was vectorised or inlined. In order to do this, it is easiest to work on one module at once. Assume that the file compile_script contains the following:


        #!/bin/sh
        set -e
        f77 -i,E,U -W0,'hap(diag(2)),opt(o(s),uinline(2)),list(e(0))' \
            -c compute.f
        # or:  f90 -i,E -W0,'hap(diag(2)),opt(o(s)),list(e(0))' \
        #          -c compute.f

Then typing 'run_job s05 compile_script' will produce a large number of compilation messages in the output file with a name like STDIN.e1234. There will be somewhat cryptic, but will say whether a particular loop was vectorised or a routine call inlined. If these actions are not happening, and you think that they should, check the User's Guides for possible explanations and appropriate directives. There are examples of using vectorisation directives.

If you are using a separate file for inlining directives (which is recommended), you should replace uinline(2) by uinline(2,EXT(inline.control)), where inline.control is the file containing inlining directives.

Instruction Profiling

Fairly standard instruction profiling is available, though it has the usual problems. In particular, its overheads are quite large, and its use distorts the values it prints. Instruction profiling is mainly useful for investigating important routines in detail, and is rarely worth while for a whole program. Assume that the file job_script contains the following:


        #!/bin/sh
        set -e
        f77 -i,E,U -W0,'analyze(c),hap,opt(o(s),uinline(1))' -c compute.f
        # or:  f90 -i,E -W0,'analyze(c),hap,opt(o(s))' -c compute.f
        f77 -i,E,U -o crunch IO.o init.o restart.o compute.o main.o
        crunch
        rm crunch

Then typing 'run_job v10 job_script' will submit a job to compile module compute.f for profiling, link it with other modules and run it. When it runs to normal completion (i.e. it does not crash), it will produce a file ft.count. This will contain execution counts down to the statement block level (i.e. consecutive statements without a branch or loop).

The commands f77ts and f77tv seem to be just front ends to f77 with the analyze option. There is also an analyze option to estimate relative CPU times. These variations are not generally recommended.

As a possibly-extreme, but certainly real-world, example of this distortion, a test code gave the following VPU and CPU times with different profiling / analysis:


Option      CPU    VPU
None       2.837  1.467
prof -p    2.837  1.467
prof -g    3.015  1.460
analyze(c)  3.494  1.470
analyze(r) 10.085  1.468

Option	CPU	VPU
None	2.837	1.467
prof -p	2.837	1.467
prof -g	3.015	1.460
analyze(c)	3.494	1.470
analyze(r)	10.085	1.468

Admittedly this short code did involve around 100,000 function calls, so may be slightly atypical, but it does show what can occur.

Incidentally, this can also be used as a debugging tool. When you are testing code, you can use profiling to check that code that you think is being executed has actually been executed! This is especially useful when checking error and exceptional case handling.

Measuring the Scalar and Vector CPU Times

The xclock and vclock subroutines measure the scalar and vector CPU time, respectively - a perfectly vectorised program will have a vector CPU time that is about the same as its scalar CPU time. These can be called without arguments to start the clocks, or with a second integer argument of 5 to read the clock into the first argument. In the latter case, they return double precision results, measured in seconds. For example:

            DOUBLE PRECISION X1, X2, V1, V2
            CALL XCLOCK
            CALL VCLOCK
            ...
            CALL XCLOCK(X1,5)
            CALL VCLOCK(V1,5)
            CALL CALC
            CALL XCLOCK(X2,5)
            CALL VCLOCK(V2,5)
            WRITE (*,' Scalar time for CALC:',F8.2,' seconds') X2-X1
            WRITE (*,' Vector time for CALC:',F8.2,' seconds') V2-V1

Because of their overheads, timing functions should not be inserted between individual statements. Their results may be misleading for sections of code that take less than about 100 microseconds to run, and better results will be obtained if the interval is more like 0.01 seconds.

For production work, you may find it more convenient to look at the accounting figures in subdirectories of /usr/local/acct/turing on either of the workstations or turing.hpcf itself. Subdirectory usage contains monthly summaries of major users and commands, giving not just CPU time but vectorisation efficiencies and much else. Subdirectory recent contains the last 3 months' daily summaries, including statistics on the largest jobs that finished during that day.

Inserting Run-time Checks

Because of their mainframe heritage, Hitachi's S-3600 Fortran compilers have much better debugging facilities than most other UNIX ones. In particular, they can insert checks to detect most argument and subscript errors. But please remember that checking code may run up to a THOUSAND times slower than fully vectorised, unchecked code.

To insert these checks, you need to compile all relevant routines with a different set of options. Assume job_script contains the following:

        #!/bin/sh
        set -e
        f77 -i,E,U -W0,'testmode(e(1),g,a(2),s)' -o crunch crunch.f
        # or:  f90 -i,E -W0,'testmode(e(1),g,a(2),s)' -o crunch crunch.f
        crunch
        rm crunch

Then typing 'run_job s05 job_script' will compile, link and run the program crunch with all run-time checks enabled.

Note that Fortran 77 on the Hitachi 3050s, NAG Fortran 90 on all systems that have it and some other compilers have options to include some or all of these checks. Please use these on workstations in preference to turing.hpcf if you need more than occasional debugging.

Interactive Debugging

Currently, the interactive debugger does not support Fortran 90 programs. Experienced hackers may be able to debug Fortran 90 using a C-level debugger, but most users are advised not to bother. You should add diagnostic checking code and WRITE statements instead.

If your program has crashed and created a core file, but did not print a traceback, you can usually use the debugger to get one. You need to log in to turing.hpcf by typing the following command from one of lovelace.hpcf or hooke.hpcf:

        rlogin turing

After you have logged in, you need to type:

        sdb crunch core
        t
        q

If this is enough information, you should then delete the core file and log out again. For more detailed interactive debugging, you usually need to recompile all relevant routines (to create a symbol table), but you do not need to recompile your whole program. Assume the file job_script contains the following:

        #!/bin/sh
        set -e
        f77 -i,E,U -g -c compute.f
        f77 -i,E,U -o crunch IO.o init.o restart.o compute.o main.o

Then typing 'run_job s05 compile_script' will compile and link program crunch with symbol tables for module compute.f. Note that it is impossible to use the interactive debugger on programs that have been vectorised (i.e. compiled with the -W0,hap... option), because the two options are incompatible, so you have to use automatic run-time checks if the problem occurs only in vectorised code. You should then log in to turing.hpcf as described above and type:

        sdb crunch

The use of sdb is documented in the HI-OSF/1-MJ OSCBASE Application Programmer's Guide. It is a C debugger adapted for use with Fortran 77, and so is not as user friendly as it might be. Note that the program file crunch and any core file must not be changed before running the sdb command. You can debug code that has no symbol table, but it falls into the category of advanced hacking and is best avoided.

Interactive debugging is a very inefficient way of using turing.hpcf, so please use a workstation in preference whenever possible. If too much interactive debugging causes performance problems, it may have to be locked out.

Transferring Binary Data to the SR2201

The floating-point format used on the S-3600 is not compatible with that used on the SR2201, but there are run-time options that allow binary data (i.e. data accessed using Fortran unformatted I/O statements) to be read or written in the latter format. You compile and link your program in the usual way, and use a run-time option.

For example, to write SR2201 format data to units 10, 11 and 16, you could run it using a command like the following:

        crunch -F'runst(cvout(3050r(10,11,16)))'

To read SR2201 format data, use cvin rather than cvout. At present, it is not possible to read and write SR2201 format data in the same program.

The format used on the SR2201 is big-endian IEEE, which is the same as used on the Motorola 68K (i.e. old Sun, old HP etc.), Sun SPARC, HP PA-RISC, IBM RS/6000 (and SP-2) and PowerPC running AIX (in all cases, except possibly for extended precision floating-point). However, SR2201 Fortran unformatted files cannot be transferred directly to or from those systems, because of differences in the implementation of Fortran unformatted records. A conversion utility should be fairly easy to write - please ask Nick Maclaren for details.

The format is NOT the same as used on the DEC VAX, DEC Alpha (including the Cray T3D), MIPS (i.e. SGI and old DEC), Intel, Cray (i.e. YMP) or PowerPC running Windows NT; data must be converted to human-readable format (i.e. using Fortran formatted I/O statements) for import from and export to these systems. A general conversion utility is impossible.