HiPERiSM's Technical Reports HiPERiSM - High Performance Algorism Consulting HCTR-2011-1: Bandwidth Benchmarks For Intel® and AMD® processors
BANDWITH BENCHMARKS FOR Intel and AMD PROCESSORS George Delic HiPERiSM Consulting, LLC.
1.0 BANDWIDTH BENCHMARK 1.1 The b_eff code HiPERiSM has used the b_eff bandwidth benchmark over many years. The source code and its description is available at https://fs.hlrs.de/projects/par/mpi/b_eff/. The same site lists many measurements across various HPC platforms and is useful as a stable metric for use with old and new architectures. It main purpose is to test interconnects on clusters, but here it is used to compare the developments in on-node bandwidth for several nodes at HiPERiSM. 1.2 Hardware test beds The hardware platforms for this bandwidth testing exercise are those installed at HiPERiSM as listed in Table 1.1. Of interest here is to compare some of the latest multi-core nodes with the legacy NumaLink® interconnect of the SGI Altix® and with each other in view of the claims of peak theoretical bandwidth in the newer CPU technology from Intel and Advanced Micro Devices (AMD). The Itanium platform is a four node cluster with two single core CPUs per node, whereas all other nodes are CPUs on a single mother board sharing a bus architecture. Table 1.1. Configuration and specification information for the Intel Itanium2®, first/second generation quad-core processor, and AMD 12-core processor platforms.
2.0 COMPILING THE BENCHMARK To compile the b_eff.c code, the Portland PGCC compiler was used on the x86_64 nodes and the gcc compiler was used on the Itanium® platform. Various values of the MEMORY_PER_PROCESSOR parameter were tried, but most results reported here were for a value of 3072 MBytes. The number of processors varied up to the maximum core count for the respective nodes. The mpirun command was run with the -all-local switch to contain executions on-node. 3.0 RESULTS OF BENCHMARKS 3.1 Four platforms up to 8 cores each Preliminary results for up to 8 MPI processes using each of four nodes are summarized in Table 3.1 and Fig. 3.1. The vendor CPU numbers are as shown for the corresponding nodes in Table 1.1. Table 3.1. Bandwidth in MBytes per second for four separate cluster nodes for the number of processors and cores listed in Table 1.1.
Fig 3.1: Effective bandwidth scaling with MPI process count with four separate cluster nodes for the number of processors and cores listed in Table 1.1. It is notable that while the first generation Intel Quad core CPU (X5450) closely tracks the Itanium2 results with increasing MPI process count, the measured bandwidth of the second generation Intel Nehalem CPU (W5590) approximately doubles the effective bandwidth in this range. The AMD 12-core CPU (6176SE) falls between the two Intel processor results. In all cases the bandwidth rises with MPI process count, but is steepest for the Intel Nehalem platform between 4 and 8 MPI processes. 3.2 AMD versus Intel platforms Table 3.2 and Fig. 3.2 compare the Intel Nehalem quad core and the AMD 12 core nodes. Although the Intel Nehalem has impressive bandwidth scaling this is matched by the AMD platform with a little less than double the number of processes. Table 3.2. Bandwidth in MBytes per second for Intel Nehalem quad core and AMD 12 core platform nodes.
Fig 3.2: Effective bandwidth scaling with MPI process count on quad core Intel and 12 core AMD cluster nodes listed in Table 1.1. The notable features are that above 4 MPI processes (a) the Nehalem platform shows a distinct increase in slope and (b) the AMD platform shows linear scaling up to half the total core count (24), followed by a gentle bending over above this point. 3.3 Ping-pong Latency for AMD versus Intel platforms The b_eff benchmark also reports the ping-pong latency and selected values are reported here for the quad core Intel Nehalem and 12 core AMD platforms listed in Table 1.1. Fig. 3.3 shows results up to 12 MPI processes. In both cases latency climbs significantly above 4 MPI processes and this result is probably due to dependency on which cores and CPUs are occupied by them. The other notable feature is the lower latency for the AMD platform for 6 or fewer MPI processes. However, since most real-world applications would use more than 4 MPI processes the differences between the two CPU types will depend on the nature of the application.
Fig 3.3: Ping-pong latency in microseconds on quad core Intel and 12 core AMD cluster nodes listed in Table 1.1. 6.0 CONCLUSIONS Exploratory benchmark measurements confirm the impressive effective bandwidth and latency results that are now available for commodity cluster nodes. Not so long ago such performance was only possible on selected proprietary HPC architectures. Now it appears that exceptional performance is available in commodity environments. Actual performance of these commodity solutions in real-world applications will vary and results for specific bench marks and Air Quality Models (AQM) are the subject of subsequent reports. HiPERiSM Consulting, LLC, (919) 484-9803 (Voice) (919) 806-2813 (Facsimile)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||