Hiperism Consulting, LLC: HCTR-2004-5

1. INTRODUCTION

This is part of a series of reports on a project to evaluate industry standard fortran 90/95 compilers for IA-32 Linux™ commodity platforms. This report shows results, in a side-by-side comparison for each compiler, for the Intel™ Pentium 3 (P3) and Pentium 4 Xeon (P4) processors for the serial STREAM benchmark. The purpose of this benchmark in this application is to measure memory bandwidth on these commodity platforms. In this report results of the STREAM benchmark are reported for the (tuned) parallel OpenMP version. This is a continuation of the previous report for the serial version and more detail on the motivation for this study may be found there.

2.0 CHOICE OF HARDWARE AND OPERATING SYSTEM

Results for the memory bandwidth in MB/second are compared for benchmarks compiled using four different Fortran compilers with the Linux™ operating system. For this report benchmarks were executed in parallel OpenMP mode on a dual processor Intel™ Pentium III (256KB L2 cache, Supermicro 370DL3 motherboard) and a dual processor Pentium 4 Xeon 3.06GHz (1MB L3 cache, Supermicro X5DPA-TGM motherboard). While the STREAM benchmark is designed to avoid compiler optimization differences, these were observed, and are reported. For these reasons results are presented both without and with compiler optimizations.

3.0 CHOICE OF COMPILERS

The choice of compilers for Linux™ IA-32 platforms now includes several vendor-supported products. The importance of this category is that vendor products have technical support and undergo continuous development with ports to new architectures as they arrive in the marketplace. The three compilers chosen in this survey support OpenMP directives and are described separately in the following sections.

3.1 Intel

The Intel Fortran Compiler version 8.0 targets both Intel IA-32 and IA-64 (Itanium) architectures, but only the former has been used in this project so far. Details on the compiler features are available at HiPERiSM Consulting, LLC’s URL. In this report this compiler is used with the Pentium 4 Xeon processor.

3.2 Lahey

The Lahey/Fujitsu Fortran 95 compiler (hereafter Lahey) for Linux™ is available from Lahey Computer Systems, Inc., (http://www.lahey.com). Release 6.2 for Linux was used with the Pentium 4 Xeon processor.

3.4 Portland

The pgf90™ fortran compiler (Linux™ distribution) from the Portland Group, (http://www.pgroup.com) was used in the CDK 4.0 release on HiPERiSM’s IA-32 Linux™ Pentium 3 cluster. Note that the CDK 5.1 release (not used here) may offer additional performance enhancement on the Pentium 4 Xeon processor.

4.0 CHOICE OF BENCHMARKS

4.1 Introduction

The STREAM (Sustainable Memory Bandwidth) benchmark is fully described (and obtainable from) http://www.cs.virginia.edu/stream. It was developed and is maintained by John D. McCalpin, and this URL contains detailed descriptions, technical papers, and numerous results. In what follows only some salient features are outlined.

4.2 The STREAM benchmark and timing

The STREAM benchmark consists of multiple repetitions of the four Kernels in Table 4.1 and the best results of typically ten trials are chosen. Only the memory bandwidth is reported here (in units of MB/second) and for the definition of how memory bandwidth is measured in STREAM visit the above URL.

Table 4.1 Compute kernels of the STREAM benchmark (referenced by number in what follows).
No.	Name	Kernel	Bytes / iterate	Flops / iterate
1	COPY	a(i)=b(i)	16	0
2	SCALE	a(i)=q*b(i)	16	1
3	SUM	a(i)=b(i)+c(i)	24	1
4	TRIAD	a(i)=b(i)+q*c(i)	24	2

For this report the iteration range of the loop is chosen to range from 1 to 20x10⁶data points with unit stride as this should ensure the data range will exceed the cache capacity. Also, the version of STREAM used here is a variant of the original stream_tuned.f code. A portable timing routine was added to record time intervals computed with calls to the Fortran 90/95 system_clock routine as follows

^{function mysecond()}

^{integer, parameter:: b8 =
selected_real_kind(14)}

^{real(b8) mysecond}

^{integer
count,count_rate,count_max}

^{call
system_clock(count,count_rate,count_max)}

^{mysecond=real(count,b8)/real(count_rate,b8)}

^{end function mysecond}

Negligible differences resulted from use of this procedure when compared with the non-portable fortran timing routines delivered with STREAM.

The compiler commands and the corresponding choice of command line switches are shown in Table 4.2 where switches without (column 2) and with (column 3) optimizations are distinguished.

Table 4.2 Compiler command and switches for the STREAM benchmark on the P3 and P4 processors.
Compiler and version	Compiler command and switches without optimization	Compiler command and switches with optimization
Intel 8.0 (P4)	ifort –tpp7 –O0 –b0 –unroll0 –r8 –FI –fpp –auto -openmp	ifc –tpp7 –xW –O3 –Ob2 –ipo –prefetch- -r8 –FI –fpp –auto -openmp
Lahey 6.2 (P4)	lf95 --O0 --tp4 --dbl --openmp --fix	lf95 --O2 --tp4 --sse2 --unroll --dbl --openmp --fix
Portland 4.0 (P3)	pgf90 –O0 -mp –r8	pgf90 –fast –Mvect=sse -mp –r8

The next section presents the results of the STREAM benchmark using the respective compilers without optimization and Section 6 presents results with optimization.

5.0 COMPARING BANDWIDTH RATES WITHOUT COMPILER OPTIMIZATION

5.1 STREAM results with no optimizations

This section reports results with three compilers for the STREAM benchmarks with no compiler optimizations with the compiler switches shown in the second column of Table 4.2. Tables 5.1 (Pentium 3), 5.2, and 5.3 (Pentium 4) show the numerical values and Figures 1 and 2, show these memory rates as bar charts, for Pentium 3 and Pentium 4, respectively.

Table 5.1 Memory bandwidth (MB/second) for the STREAM benchmarks with the Portland compiler on the Pentium III (933 MHz) without optimization for the number of OpenMP threads (P) shown.
Kernel	P=1	P=2
Copy	253.8	377.8
Scale	250.8	356.4
Add	357.1	405.4
Triad	347.6	410.3

Table 5.2 Memory bandwidth (MB/second) for the STREAM benchmarks with the Intel compiler on the Pentium 4 Xeon (3.06 GHz, 1MB L3 cache) without optimization for the number of OpenMP threads (P) shown.
N	P=1	P=2	P=4
Copy	1410.9	1438.2	1102.7
Scale	1384.1	1428.6	1149.0
Add	1701.5	1624.4	1214.3
Triad	1683.6	1624.9	1329.3

Table 5.3 Memory bandwidth (MB/second) for the STREAM benchmarks with the Lahey compiler on the Pentium 4 Xeon (3.06 GHz, 1MB L3 cache) without optimization for the number of OpenMP threads (P) shown.
N	P=1	P=2	P=4
Copy	1280.0	1264.8	1032.3
Scale	1300.8	1245.1	1009.5
Add	1643.8	1543.4	1403.5
Triad	1638.2	1538.5	1176.5

Surprisingly, the P=2 thread result for the P3 shows an increased memory bandwidth with no compiler optimization and the expected decrease with optimization enabled. For the P4 there are only small changes when P=1 and P=2 results are compared. However, for the P4 the P=4 thread case shows a distinct reduction in memory bandwidth. This loss of bandwidth in parallel mode is as much as 29% for kernel 3 with the Intel compiler, and 28% for kernel 4 with the Lahey compiler. To show this relative reduction more clearly, Figure 4 has the ratio of P=2 and P=4 values normalized to the P=1 result. Note that the dual Xeon processors have hyper-threading technology and each can support two OpenMP threads, but at the cost of some lost memory bandwidth.

Fig. 1 Memory bandwidth for the Portland compiler with the STREAM benchmark (without and with optimization) on the Pentium 3.

Fig. 2 Memory bandwidth of the Intel compiler for the STREAM benchmarks (without optimization) on the Pentium 4 Xeon.

Fig. 3 Memory bandwidth of the Lahey compiler for the STREAM benchmarks (without optimization) on the Pentium 4 Xeon.

Fig. 4 Ratio of memory bandwidth for P>1 threads to those for P=1 with Intel and Lahey compilers on the Pentium 4 Xeon for the STREAM benchmark (without optimization).

5.2 STREAM statistics for no optimizations

Figures 5 and 6 show the basic statistics for averages and dispersion over P=1, 2, and 4 OpenMP threads. The mean for the Lahey compiler tends to fall below that for the Intel compiler and dispersion in values fluctuates for both with different kernels. However, the standard deviation is seen to be small for all four kernels with both compilers.

Fig. 5 Memory bandwidth statistics for the Intel compiler with P=1, 2, and 4 threads using the STREAM benchmark (without optimization) on the Pentium 4 Xeon.

Fig. 6 Memory bandwidth statistics for the Lahey compiler with P=1, 2, and 4 threads using the STREAM benchmark (without optimization) on the Pentium 4 Xeon.

6.0 COMPARING BANDWIDTH RATES WITH COMPILER OPTIMIZATION

6.1 STREAM results with optimizations

This section reports results with three compilers for the STREAM benchmarks with compiler optimizations using the compiler switches shown in the third column of Table 4.2. Tables 6.1 (Pentium 3), 6.2, and 6.3 (Pentium 4) show the numerical values and Figures 1, 7 and 8, show these memory rates as bar charts, for Pentium 3 and Pentium 4 processors.

Table 6.1 Memory bandwidth (MB/second) for the STREAM benchmarks with the Portland compiler on the Pentium III (933 MHz) with optimization for the number of OpenMP threads (P) shown.
Kernel	P=1	P=2
Copy	425.5	403.0
Scale	417.8	381.9
Add	511.2	433.6
Triad	509.6	434.0

Table 6.2 Memory bandwidth (MB/second) for the STREAM benchmarks with the Intel compiler on the Pentium 4 Xeon (3.06 GHz, 1MB L3 cache) with optimization for the number of OpenMP threads (P) shown.
N	P=1	P=2	P=4
Copy	1313.6	1258.9	1144.1
Scale	1327.3	1237.0	1066.7
Add	1682.4	1507.1	1331.1
Triad	1685.3	1534.5	1196.4

Table 6.3 Memory bandwidth (MB/second) for the STREAM benchmarks with the Lahey compiler on the Pentium 4 Xeon (3.06 GHz, 1MB L3 cache) with optimization for the number of OpenMP threads (P) shown.
N	P=1	P=2	P=4
Copy	1355.9	1264.8	1032.3
Scale	1367.5	1240.3	1015.9
Add	1696.1	1514.2	1188.1
Triad	1684.2	1538.5	1191.1

As mentioned above the P=2 result for the P3 shows the expected decrease in memory bandwidth with optimization in Figure 1. For the P4 there are significant reductions when P=1 and P=2 results are compared (particularly for kernels 3 and 4). The P=4 thread case shows a loss of bandwidth in parallel mode by as much as 29% for kernel 4 with both compilers. To show this relative reduction more clearly, Figure 9 has the ratio of P=2 and P=4 values normalized to the P=1 result. Note that since code is generally compiled with optimizations a progressively increasing loss of bandwidth is to be expected as the thread count increases. That this is also true of the P3 is demonstrated in Figure 10. There is one interesting observation to be made on the behavior of the Intel compiler with optimizations enabled. If Figure 7 here is compared with the Intel compiler results shown in Figure 8 of the previous report for the serial STREAM benchmark it is obvious that the exceptionally large memory bandwidth reported there is not evident here. The only difference is the addition of the –auto and –openmp compiler switches and it can only be assumed that when these are enabled the Intel compiler allocates hardware resources in ways different to that done in the serial case, with a consequent reduction in bandwidth.

Fig. 7 Memory bandwidth of the Intel compiler for the STREAM benchmarks (with optimization) on the Pentium 4 Xeon.

Fig. 8 Memory bandwidth of the Lahey compiler for the STREAM benchmarks (with optimization) on the Pentium 4 Xeon.

Fig. 9 Ratio of memory bandwidth for P>1 threads to those for P=1 with Intel and Lahey compilers on the Pentium 4 Xeon for the STREAM benchmark (with optimization).

Fig. 10 Ratio of memory bandwidth of the Portland compiler on the Pentium 3 for results without and with compiler optimization for the STREAM benchmark.

6.2 STREAM statistics for optimizations

Figures 11 and 12 show the basic statistics for averages and dispersion over P=1, 2, and 4 OpenMP threads. The mean for both compilers is similar but variability is larger for the Lahey compiler with kernels 1-3.

Fig. 11 Memory bandwidth statistics for the Intel compiler with P=1, 2, and 4 threads using the STREAM benchmark (with optimization) on the Pentium 4 Xeon.

Fig. 12 Memory bandwidth statistics for the Lahey compiler with P=1, 2, and 4 threads using the STREAM benchmark (with optimization) on the Pentium 4 Xeon.

8.0 CONCLUSIONS

This report presented performance results of three fortran compilers in the IA-32 environment for the parallel STREAM memory bandwidth metric. The measurements were performed without and with compiler optimizations on Intel Pentium 3 and Pentium 4 Xeon processors used in parallel mode on dual processor motherboards.

As observed in the previous report the P4 Xeon architecture when compared to the P3 can deliver a four-fold improvement in bandwidth. But when more than one process (or thread) executes the overall conclusion is that memory bandwidth is, in general, significantly decreased. The observed reductions are as large as 29% in some cases, and are progressively increasing as threads are added and compiler optimizations are enabled.

The parallel STREAM benchmark was valuable as a quantitative metric in addressing the memory bandwidth issues raised in the introduction of the previous report. However, as a caution, we note that for “real-world” multiprocessor codes, memory bandwidth contention has been observed to have more severe consequences than the results reported here would suggest.