History of High Performance Computing in Cambridge

Cambridge-Cranfield HPCF  > Historical Information  > Turing Hitachi S3600

Turing was decommissioned at the end of February 1999. The pages on Turing are not being maintained, but are being left accessible for historical interest. They may well have bad links and similar errors.


Turing is the single-processor Hitachi S3600 vector supercomputer. This note discusses programming issues specific to Turing.

Contents

Need for Vectorisable Code
Arithemetic Issues (Important!)
Libraries
Using FORTRAN
FORTRAN directives
Using C
Using the Extended Storage
Measuring Performance
Debugging
Accounting Statistics


Need for Vectorisable Code

Turing has both a scalar and vector processor. The vector unit can sustain speeds of over 1GFLOP, whereas the scalar unit struggles to get above 10MFLOPS. Code which cannot be vectorised would thus run faster on a standard departmental workstation. Some simple examples of the effect of vectorisation and vector length are given below:

                 Some Fortran Timings (MFlops)

    Length     +     *     /  SQRT EXP  LOG   SIN ATAN
         1     3     3     2    0    0    0     0    0
        10    34    33    25    5    5    3     6    4
       100   291   287   177   43   16   31    65   36
     1,000   715   715   371   63   18   46   111   54
    10,000   827   826   379   63   18   46   111   54
   100,000   850   849   386   63   18   46   111   54
 1,000,000   857   870   389   63   18   46   111   54

Memory access speeds also depend very much on the type of access. Random (scalar) operations can access only about 2,500,000 values a second, whereas vector ones can access 2,000,000,000. The effect of vectorisation on the liNPACK benchmark is shown on the benchmark page.

Arithmetic Issues

Turing is rather unusual in that it does not use IEEE arithmetic, but rather IBM 370 floating-point arithmetic. This is unlike any common workstation (including DEC, HP, Intel, SGI, Sun etc.), or even babbage, but the same as Phoenix.

The precision of turing's arithmetic is similar to IEEE, but the range and rounding are both very different. The range of the exponent in double and extended precision is about -78 to 75 (decimal), unlike the IEEE double precision which has an exponent range of about -308 to 308. Although overflow is trapped by default, underflow is not. Underflow trpping may be requested with the run-time option -F'runst(uflow)', but this will trap all underflows, whereas in some cases treating them as zero (the default) would have been the correct response.

More dramatically, rounding occurs by truncation, rather than true rounding. This means that there is a significant systematic error every time that rounding occurs (essentially every operation). As the error is systematic, it is unlikely to cancel readily. The errors build up very rapidly, and can lead to total loss of accuracy or program failure.

The general advice is:

In practice, most double precision codes work fine on turing, although if your code displays unexpected numerical instability, this may be the cause. Few people have good experience of dealing with numerical algorithms on non-IEEE machines, but Nick Maclaren does and will be able to offer advice in this area should it be required.

Integer and character formats on turing are as expected (32 bit integers, ASCII character set). 64 bit integers are not supported.

Libraries

The use of supplied libraries on turing is encouraged. These tend to be reasonably debugged and optimised.

NAG Mark 17 is available, complete with online documentation ( nag_help and nag_info). NAG has good diagnostics and will probably warn you if it encounters problems due to turing's unusual float-point arithmetic. The NAG library is entirely double precision, and the compiler options -i,E or -i,E,U must be used when compiling your code.

BLAS (all three levels) and much of LAPACK is included in NAG. Levels 0 and 1 of BLAS are available for inlining, and there are examples of this.

Hitachi also provide some of there own libraries. These offer similar performance to NAG, but with slightly less checking. The main vectorised libraries, and corresponding manuals, are:

Hitachi's unvectorised libraries have not been investigated in detail. There is GKS (the ISO standard Graphics Kernel System), but please contact us (support@hpcf.cam.ac.uk) before using it, because it might have quite serious performance implications. The other libraries are documented in the manuals:

Using FORTRAN

Both FORTRAN77 and FORTRAN90 are available (f77 and f90). These compilers are both native compilers, that is, they run on turing only. We prefer compilation jobs to be submitted via NQS (queue s05 is intended for this) and not run interactively.

The FORTRAN compilers follow the standard very closely, with few extensions. In particular, when compiling FORTRAN77:

There are paper manuals on both the language structure (Reference) and the compiler options (User's Guide). Users will probably want occassionally access to these manuals. The CS User Library and several departments have copies. There are also a few notes on writing standard FORTRAN77

The recommended compiler options are:

The -i options allow FORTRAN77 names to exceed 8 characters, and allow linking with subroutines written in C and the NAG libraries.

Do not compile part of a program with -i,... and part without, or with incompatible settings of the -i,... option.

The option for converting all real*4 objects to real*8 is langlvl(precexp(4),excnvf) and would be added to the above option string to form '-W0,hap,opt(o(s)),langlvl(precexp(4),excnvf)'.


FORTRAN Directives

Users familiar with the Cray Y/MP or J90, and other similar vector computers, will be familiar with vectorisation directives. These inform the compiler than an optimisation which is not in general safe is, in fact, safe for the specific example of the loop in question. The compiler will trust your judgement, so do not use vectorisation directives unless you understand what you are doing - wrong answers will result.

This documentation does not describe the issues in detail, but, for users familiar with the concepts, the important directives are:

*voption indep states that all array references in the loop immediately following are independent. (Equivalent Cray's CDIR$ IVDEP)
*voption indep(a,b) ditto, for arrays a and b only.
*voption novec do not vectorise following loop.

The *voption must occur in the first column. If free-format is used, the option becomes 'voption.

Many more options are available, and can deal with cases of data references overlapping (e.g. DO I=1,N-1; A(I)=A(I+1); END DO) and other things. (Do not attempt to use *voption indep(a) to vectorise this example - the result is unpredictable, and may appear to work for small values of N.) See the paper "User's Guide" for full details, or these further examples, which also cover the inlining directives which can significantly improve turing's performance.


Using C

Firstly, do not use C for CPU intensive code. It does not vectorise well, and typically runs twenty times slower than the corresponding FORTRAN. This is due to intrinsic problems in the C language definition, rather than any serious deficiency in the compiler, and as such it effects all vector computers.

C is fully supported for the following purposes:

The interfaces between Fortran and C are described in section 8 of the Fortran 77 and Fortran 90 User's Guides, but here is a very brief summary for calling fairly simple functions:

There is an example of calling C from Fortran.


Using the Extended Storage

Turing has 2.5Gb of fast extended storage available to users. This device acts to give fast direct-access I/O. Only one user may access it at once (via the v25es, v50es or v90es queues), and no data is retained between jobs.

To use it from FORTRAN, it is simple necessary to add type='es' to your open statements. However, please note that each direct-access record will be rounded up to a multiple of 4K, and the number of records must be specified so that the system knows how much space to allocate. There is an example of its use.


Measuring Performance

The facilities on turing for measuring the performance of code are rather basic.

There are two high-precision timers, xclock and vclock, which are available in FORTRAN. There resolution is about 20 microseconds. The first measures elapsed CPU time, the second elapsed VPU time. The ratio of VPU to CPU time should be as high as possible. Beware: if your code is suffering from memory bank conflicts, the VPU will be stalled, and the VPU/CPU ratio artificially high.

Examples of the use of xclock and vclock are given. In C there are functions _xclock and _vclock - see the turing's man pages for more details.

There are also fairly standard and basic profiling and analysis options. Examples are again given, but unfortunately these utilities can distort the data they are measuring.

Finally there is the vtime command, an extension of the standard UNIX time command which additionally reports VPU time.

Unfortunately there is no way of measuring the achieved MFLOPS directly - one must either know the operation count for one's code, or have a time and MFLOP value from a run on another computer (Cray J90, DEC Alpha, etc.) which can then be scaled (differences in compiler optimisations notwithstanding).


Debugging

Users are strongly recommended to debug their code elsewhere, and most UNIX workstations are more suited to the task than turing. However, should you need to do debugging on turing, it is worth remembering the following points:

There is a dbx-like debugger called sdb. Further advice is given on the examples page.


Accounting Statistics

Accounting statistics for the previous days in the current month are available in /usr/local/acct/turing/recent/, with monthly summaries in /usr/local/acct/turing/usage/. These contain the following fields for a selection of jobs:

Some of the more important figures are also summarised in two other ways. The percentage of resources used by one person or command is displayed; if you start to use an unusually high proportion of any resource, you should check up with an expert. And two efficiency indicators are displayed; if they are not about 75% (and preferably above 90%), you should check the efficiency of your program or contact an expert.

The first two fields can also be obtained on a per-process basis by use of the vtime command.

The statistics can also be used for debugging. If your jobs fails unexpectedly, you can often see the resources it used and (most importantly) how it terminated. Most NQS jobs that end with a KILL signal will have exceeded some NQS limit, but this can also be caused by the system shutting down for maintenance or whatever.