Hiperism Consulting, LLC: HCTR-2011-1

1.0 BANDWIDTH BENCHMARK

1.1 The b_eff code

HiPERiSM has used the b_eff bandwidth benchmark over many years. The source code and its description is available at https://fs.hlrs.de/projects/par/mpi/b_eff/. The same site lists many measurements across various HPC platforms and is useful as a stable metric for use with old and new architectures. It main purpose is to test interconnects on clusters, but here it is used to compare the developments in on-node bandwidth for several nodes at HiPERiSM.

1.2 Hardware test beds

The hardware platforms for this bandwidth testing exercise are those installed at HiPERiSM as listed in Table 1.1. Of interest here is to compare some of the latest multi-core nodes with the legacy NumaLink® interconnect of the SGI Altix® and with each other in view of the claims of peak theoretical bandwidth in the newer CPU technology from Intel and Advanced Micro Devices (AMD). The Itanium platform is a four node cluster with two single core CPUs per node, whereas all other nodes are CPUs on a single mother board sharing a bus architecture.

Table 1.1. Configuration and specification information for the Intel Itanium2®, first/second generation quad-core processor, and AMD 12-core processor platforms.

Node name	node100	node17	node18	node19
Platform	Itanium	quad-core 1	quad-core 2	12 core
Operating system	SuSE Linux 10.17	SuSE Linux 11.1	SuSE Linux 11.1	SuSE Linux 11.3
Processor	Intel™ IA64 (107W)	Intel™ IA32 (X5450)	Intel™ IA32 (W5590)	AMD™ (6176SE)
Processor count	8	2	2	4
Cores per processor	1	4	4	12
Core count	8	8	8	48
Clock	1.5GHz	3.0GHz	3.33GHz	2.3GHz
Bandwidth⁽¹⁾	6.4GB/sec	10.6GB/sec	64.0 GB/sec	42.7 GB/sec
Bus speed	400 MHz	1333 MHz	1333 MHz⁽²⁾	1333 MHz
L1 cache	32KB	64 KB	64 KB	64 KB
L2 cache⁽³⁾	1 MB	12MB⁽⁴⁾	256MB	512K⁽⁵⁾
L3 cache⁽⁶⁾	4MB	NA	8MB	12MB
(1) Theoretical maximum per CPU. (2) Value for one DDR3 DIMM per each of three channels per processor (This value drops with more DIMMs per channel). (3) For each of Data and Instruction cache. (4) Intel's first generation of Quadcore CPUs shared L2 cache between cores. (5) Per core. (6) Per socket

2.0 COMPILING THE BENCHMARK

To compile the b_eff.c code, the Portland PGCC compiler was used on the x86_64 nodes and the gcc compiler was used on the Itanium® platform. Various values of the MEMORY_PER_PROCESSOR parameter were tried, but most results reported here were for a value of 3072 MBytes. The number of processors varied up to the maximum core count for the respective nodes. The mpirun command was run with the -all-local switch to contain executions on-node.

3.0 RESULTS OF BENCHMARKS

3.1 Four platforms up to 8 cores each

Preliminary results for up to 8 MPI processes using each of four nodes are summarized in Table 3.1 and Fig. 3.1. The vendor CPU numbers are as shown for the corresponding nodes in Table 1.1.

Table 3.1. Bandwidth in MBytes per second for four separate cluster nodes for the number of processors and cores listed in Table 1.1.

Number of MPI processes	Intel Itanium2	Intel X5450	Intel W5590	AMD 6176SE
2	489	564	1038	513
4	791	715	1211	898
6		877	1688	1103
8	1191	1027	2165	1380

Fig 3.1: Effective bandwidth scaling with MPI process count with four separate cluster nodes for the number of processors and cores listed in Table 1.1.

It is notable that while the first generation Intel Quad core CPU (X5450) closely tracks the Itanium2 results with increasing MPI process count, the measured bandwidth of the second generation Intel Nehalem CPU (W5590) approximately doubles the effective bandwidth in this range. The AMD 12-core CPU (6176SE) falls between the two Intel processor results. In all cases the bandwidth rises with MPI process count, but is steepest for the Intel Nehalem platform between 4 and 8 MPI processes.

3.2 AMD versus Intel platforms

Table 3.2 and Fig. 3.2 compare the Intel Nehalem quad core and the AMD 12 core nodes. Although the Intel Nehalem has impressive bandwidth scaling this is matched by the AMD platform with a little less than double the number of processes.

Table 3.2. Bandwidth in MBytes per second for Intel Nehalem quad core and AMD 12 core platform nodes.

Process count	2	4	6	8	10	12	14	16	24	32	40	48
Intel W5590	1038	1211	1688	2165
AMD 6176SE	513	898	1103	1380	1595	1890	2170	2386	3409	4139	4579	4919

Fig 3.2: Effective bandwidth scaling with MPI process count on quad core Intel and 12 core AMD cluster nodes listed in Table 1.1.

The notable features are that above 4 MPI processes (a) the Nehalem platform shows a distinct increase in slope and (b) the AMD platform shows linear scaling up to half the total core count (24), followed by a gentle bending over above this point.

3.3 Ping-pong Latency for AMD versus Intel platforms

The b_eff benchmark also reports the ping-pong latency and selected values are reported here for the quad core Intel Nehalem and 12 core AMD platforms listed in Table 1.1. Fig. 3.3 shows results up to 12 MPI processes. In both cases latency climbs significantly above 4 MPI processes and this result is probably due to dependency on which cores and CPUs are occupied by them. The other notable feature is the lower latency for the AMD platform for 6 or fewer MPI processes. However, since most real-world applications would use more than 4 MPI processes the differences between the two CPU types will depend on the nature of the application.

Fig 3.3: Ping-pong latency in microseconds on quad core Intel and 12 core AMD cluster nodes listed in Table 1.1.

6.0 CONCLUSIONS

Exploratory benchmark measurements confirm the impressive effective bandwidth and latency results that are now available for commodity cluster nodes. Not so long ago such performance was only possible on selected proprietary HPC architectures. Now it appears that exceptional performance is available in commodity environments. Actual performance of these commodity solutions in real-world applications will vary and results for specific bench marks and Air Quality Models (AQM) are the subject of subsequent reports.