History of High Performance Computing in Cambridge

Cambridge-Cranfield HPCF  > Historical Information  > Benchmarks

Both Turing and Babbage were decommissioned by 2001. The pages are not being maintained, but are being left accessible for historical interest. They may well have bad links and similar errors.


Whilst quoting the performance of babbage as 300 MFLOPS per processor peak certainly shows it to be a reasonably serious computer, it does not appear phenomenally fast when compared to a DEC Alpha 21164 processor running at 600 MHz with a peak performance of 1.2 GFLOPS. In fact, it would seem that hodgkin, with a peak performance of 600 MFLOPS per processor, would be faster. But the peak figures are deceptive. The pseudovector architecture of the SR2201 means that the achieved performance of tuned programs can be very much higher than that of workstations such as Alphas. Similarly, the advantages of hodgkin are not quite what they appear.

The comparisons which follow are intended to be representative only, and the data on non-HPCF machines are given in good faith but may be inaccurate. Corrections are welcome (support@hpcf.cam.ac.uk).


Stream - memory bandwidth

Stream (by Dr John McCalpin) is a benchmark intended to measure memory bandwidth. Some figures for various machines are given below (in Mb/sec) for the four tests in the stream suite. The results for babbage are for a single node, and for mott/hodgkin a single CPU. As babbage is a distributed memory machine, not a shared memory machine, it would arguably show perfect scaling if this benchmark were to be parallelised, and achieve around 100 Gb/sec if all its nodes were used. The situation for hodgkin is less simple.

Machine copy scale add triad
Cray Y/MP (one proc) 2430 2430 3450 3400
Cray J90 (one proc) 1440 1420 1310 1340
babbage, vectorised 777 736 831 824
SGI Octane, 175MHz 280 280 310 300
babbage, unvectorised 255 248 212 212
mott (~= hodgkin) 241 215 233 278
DEC PW433AU 207 206 226 228
DEC 250/5/266 141 140 151 136

(a conversion into MFLOPS can be attempted by dividing the "triad" column by 12)

It would be possible to give the theoetical peak bandwidths but, as will be mentioned later, these are particularly deceptive for hodgkin.


Linpack 1000x1000 FORtrAN

The FORtrAN Linpack benchmark consists of solving a 1,000 x 1,000 system of simultaneous equations - some 2,500 x 2,500 ones have been added to show scalability. For these tests unmodified FORtrAN source was used - vendors usually quote results from highly modified algorithms, occasionally with the kernels re-written in assembler. The exceptions are the ones marked NAG, which solve the problem using the NAG library and the vendors' BLAS. Again, the results are for a single node for babbage and for a single CPU for hodgkin.

Machine TheoreticalLinpack
(MFLOPS) (MFLOPS)
hodgkin, NAG 600 427
babbage, NAG 300 165
hodgkin 600 143
hodgkin, C 600 135
babbage, inlined 300 57
babbage, 2500^2 300 57
babbage 300 55
babbage, CV 300 46
DEC PW433AU 866 45
hodgkin, 2500^2 600 44
DEC PW433AU, C 866 42
babbage, C 300 31
mott 500 31
mott, C 500 30
mott, 2500^2 500 29
babbage, unvectorised 300 21
Pentium Pro 200 200 21

The Linpack code is vaguely representative of the sort of codes used in scientific applications, and is written using the level 2 BLAS, which are shipped with the code. The above results show a considerable number of interesting effects.

The first is that using the NAG library to solve the equations is by far the fastest method - this is because it uses the vendors' level 3 BLAS, and uses blocked algorithms where this gets best performance on cache-based systems. It shows what can be achieved, rather than what most users will achieve. It also shows why it is worth using NAG, when possible.

There is a vendor-optimised parallel version of Linpack for the SR2201, and babbage has achieved over 58 GFLOPS on this benchmark using 256 processors and working on a 69,120 x 69,120 matrix. This represents over 75% of the theoretical peak performance of 76.8 GFLOPS. There is probably a similar result for the Origin 2000, but it has not been checked up on.

[The vendor-optimised LINPACK runs tend to use very different alogorithms to solve the problem, and thus have a different (lesser) impact on the memory subsystem than the FORtrAN or C "standard" versions. Using the NAG library is the practical way of getting access to such optimisations.]

The second is that hodgkin is very much faster than babbage on a 1,000 x 1,000 problem, but mott is much slower (despite being nominally 70% of the speed of hodgkin.) However, babbage maintains its speed on a 2,500 x 2,500 problem, but hodgkin's performance drops by a factor of three. This is simply because because the first problem fits into hodgkin's cache (but not mott's) and the second doesn't, and because the pseudo-vectorising feature of babbage enables it to maintain its speed on realistic sizes of problem starting with simple Fortran code of the sort written by most scientists. In this, babbage is like vector supercomputers and unlike typical departmental workstations or hodgkin.

And, lastly, most workstations (and hodgkin) give a C and FORtrAN performance separated by no more than 10%. This is not the case for babbage, as C is almost impossible to vectorise automatically. The C results quoted above were with full vectorisation requested, but no directives given in the code, and babbage partially vectorised only some of the loops in the timed section. The results of placing directives in the C are given as "CV".

This simple benchmark shows up the most critical differences between the systems. To get good performance out of either system, it is essential to tune for that class of system. To a first approximation, babbage should be treated like a vector supercomputer, and hodgkin like a RISC workstation. Of course, the simplest and best solution is to call the NAG library and let that do the hard work!


Internode communication

Representative latencies and bandwidths for the various inter-node communications protocols possible on babbage and hodgkin are as follows:

Protocol latency bandwidth
(usec) Mb/sec
babbage, DMA 5 280
babbage, MPI 31 280
babbage, Express 26 104
babbage, PVM 77 89
hodgkin, SHMEM 2-6 170
hodgkin, MPI 15 170

For obvious reasons, we recommend the use of MPI on babbage, and have installed neither PVM nor Express on hodgkin. If a program's time is dominated by passing short messages (less than 16 KB on babbage, or 4 KB on hodgkin), then it can be tuned by using the underlying message passing mechanisms. All new message passing programs should be written using MPI, and converted only if necessary.

A discussion of the Origin 2000's hardware is needed here. Two processors share a system bus and their inter-node access is through a single interconnect, but let us ignore that complication for now. The hardware interconnect is a nominal 800 MB/sec, but hardware and software overheads reduce that to about 550 MB/sec. Unfortunately, there is a cache coherency problem, which means that passed messages need to travel over the bus three times, which accounts for the poor performance on hodgkin.

Furthermore, as hodgkin is currently configured, the association between processors and where their memory is located is rather poor. This can cause the above performance to degrade if either a process or its data is on the same node as an unrelated memory-bound process. On the other hand, the hardware is such that such fragmentation can also improve performance. It depends.

The available 'SHMEM' facilities will be described elsewhere, but there are several of them (often inappropriately named) and there is little information on how to best use them. SGI's recommended strategy for future programs is to use OpenMP, which is not a message passing interface. Because of the different way that they are written, OpenMP programs can often get better performance than MPI ones, but it is virtually impossible to quote simple benchmarks for such use.