Cambridge-Cranfield HPCF: History Turning Programming Guide

History of High Performance Computing in Cambridge

Cambridge-Cranfield HPCF > Historical Information > Benchmarks

Both Turing and Babbage were decommissioned by 2001. The pages are not being maintained, but are being left accessible for historical interest. They may well have bad links and similar errors.

Whilst quoting the performance of babbage as 300 MFLOPS per processor peak certainly shows it to be a reasonably serious computer, it does not appear phenomenally fast when compared to a DEC Alpha 21164 processor running at 600 MHz with a peak performance of 1.2 GFLOPS. In fact, it would seem that hodgkin, with a peak performance of 600 MFLOPS per processor, would be faster. But the peak figures are deceptive. The pseudovector architecture of the SR2201 means that the achieved performance of tuned programs can be very much higher than that of workstations such as Alphas. Similarly, the advantages of hodgkin are not quite what they appear.

The comparisons which follow are intended to be representative only, and the data on non-HPCF machines are given in good faith but may be inaccurate. Corrections are welcome (support@hpcf.cam.ac.uk).

Stream - memory bandwidth

Stream (by Dr John McCalpin) is a benchmark intended to measure memory bandwidth. Some figures for various machines are given below (in Mb/sec) for the four tests in the stream suite. The results for babbage are for a single node, and for mott/hodgkin a single CPU. As babbage is a distributed memory machine, not a shared memory machine, it would arguably show perfect scaling if this benchmark were to be parallelised, and achieve around 100 Gb/sec if all its nodes were used. The situation for hodgkin is less simple.

Machine	copy	scale	add	triad
Cray Y/MP (one proc)	2430	2430	3450	3400
Cray J90 (one proc)	1440	1420	1310	1340
babbage, vectorised	777	736	831	824
SGI Octane, 175MHz	280	280	310	300
babbage, unvectorised	255	248	212	212
mott (~= hodgkin)	241	215	233	278
DEC PW433AU	207	206	226	228
DEC 250/5/266	141	140	151	136

(a conversion into MFLOPS can be attempted by dividing the "triad" column by 12)

It would be possible to give the theoetical peak bandwidths but, as will be mentioned later, these are particularly deceptive for hodgkin.

Linpack 1000x1000 FORtrAN

The FORtrAN Linpack benchmark consists of solving a 1,000 x 1,000 system of simultaneous equations - some 2,500 x 2,500 ones have been added to show scalability. For these tests unmodified FORtrAN source was used - vendors usually quote results from highly modified algorithms, occasionally with the kernels re-written in assembler. The exceptions are the ones marked NAG, which solve the problem using the NAG library and the vendors' BLAS. Again, the results are for a single node for babbage and for a single CPU for hodgkin.


Machine                                  Theoretical Linpack
                                          (MFLOPS)  (MFLOPS)
hodgkin, NAG           600        427
babbage, NAG           300        165
hodgkin                600        143
hodgkin, C             600        135
babbage, inlined       300         57
babbage, 2500^2        300         57
babbage                300         55
babbage, CV            300         46
DEC PW433AU            866         45
hodgkin, 2500^2        600         44
DEC PW433AU, C         866         42
babbage, C             300         31
mott                   500         31
mott, C                500         30
mott, 2500^2           500         29
babbage, unvectorised   300         21
Pentium Pro 200        200         21

Machine	Theoretical	Linpack
	(MFLOPS)	(MFLOPS)
hodgkin, NAG	600	427
babbage, NAG	300	165
hodgkin	600	143
hodgkin, C	600	135
babbage, inlined	300	57
babbage, 2500^2	300	57
babbage	300	55
babbage, CV	300	46
DEC PW433AU	866	45
hodgkin, 2500^2	600	44
DEC PW433AU, C	866	42
babbage, C	300	31
mott	500	31
mott, C	500	30
mott, 2500^2	500	29
babbage, unvectorised	300	21
Pentium Pro 200	200	21

The Linpack code is vaguely representative of the sort of codes used in scientific applications, and is written using the level 2 BLAS, which are shipped with the code. The above results show a considerable number of interesting effects.

The first is that using the NAG library to solve the equations is by far the fastest method - this is because it uses the vendors' level 3 BLAS, and uses blocked algorithms where this gets best performance on cache-based systems. It shows what can be achieved, rather than what most users will achieve. It also shows why it is worth using NAG, when possible.

There is a vendor-optimised parallel version of Linpack for the SR2201, and babbage has achieved over 58 GFLOPS on this benchmark using 256 processors and working on a 69,120 x 69,120 matrix. This represents over 75% of the theoretical peak performance of 76.8 GFLOPS. There is probably a similar result for the Origin 2000, but it has not been checked up on.

[The vendor-optimised LINPACK runs tend to use very different alogorithms to solve the problem, and thus have a different (lesser) impact on the memory subsystem than the FORtrAN or C "standard" versions. Using the NAG library is the practical way of getting access to such optimisations.]

The second is that hodgkin is very much faster than babbage on a 1,000 x 1,000 problem, but mott is much slower (despite being nominally 70% of the speed of hodgkin.) However, babbage maintains its speed on a 2,500 x 2,500 problem, but hodgkin's performance drops by a factor of three. This is simply because because the first problem fits into hodgkin's cache (but not mott's) and the second doesn't, and because the pseudo-vectorising feature of babbage enables it to maintain its speed on realistic sizes of problem starting with simple Fortran code of the sort written by most scientists. In this, babbage is like vector supercomputers and unlike typical departmental workstations or hodgkin.

And, lastly, most workstations (and hodgkin) give a C and FORtrAN performance separated by no more than 10%. This is not the case for babbage, as C is almost impossible to vectorise automatically. The C results quoted above were with full vectorisation requested, but no directives given in the code, and babbage partially vectorised only some of the loops in the timed section. The results of placing directives in the C are given as "CV".

This simple benchmark shows up the most critical differences between the systems. To get good performance out of either system, it is essential to tune for that class of system. To a first approximation, babbage should be treated like a vector supercomputer, and hodgkin like a RISC workstation. Of course, the simplest and best solution is to call the NAG library and let that do the hard work!

Internode communication

Representative latencies and bandwidths for the various inter-node communications protocols possible on babbage and hodgkin are as follows:


Protocol      latency bandwidth
              (usec)  Mb/sec
babbage, DMA          5    280
babbage, MPI         31    280
babbage, Express     26    104
babbage, PVM         77     89
hodgkin, SHMEM      2-6    170
hodgkin, MPI         15    170

Protocol	latency	bandwidth
	(usec)	Mb/sec
babbage, DMA	5	280
babbage, MPI	31	280
babbage, Express	26	104
babbage, PVM	77	89
hodgkin, SHMEM	2-6	170
hodgkin, MPI	15	170

For obvious reasons, we recommend the use of MPI on babbage, and have installed neither PVM nor Express on hodgkin. If a program's time is dominated by passing short messages (less than 16 KB on babbage, or 4 KB on hodgkin), then it can be tuned by using the underlying message passing mechanisms. All new message passing programs should be written using MPI, and converted only if necessary.

A discussion of the Origin 2000's hardware is needed here. Two processors share a system bus and their inter-node access is through a single interconnect, but let us ignore that complication for now. The hardware interconnect is a nominal 800 MB/sec, but hardware and software overheads reduce that to about 550 MB/sec. Unfortunately, there is a cache coherency problem, which means that passed messages need to travel over the bus three times, which accounts for the poor performance on hodgkin.

Furthermore, as hodgkin is currently configured, the association between processors and where their memory is located is rather poor. This can cause the above performance to degrade if either a process or its data is on the same node as an unrelated memory-bound process. On the other hand, the hardware is such that such fragmentation can also improve performance. It depends.

The available 'SHMEM' facilities will be described elsewhere, but there are several of them (often inappropriately named) and there is little information on how to best use them. SGI's recommended strategy for future programs is to use OpenMP, which is not a message passing interface. Because of the different way that they are written, OpenMP programs can often get better performance than MPI ones, but it is virtually impossible to quote simple benchmarks for such use.