Hiperism Consulting, LLC: HCTR-2004-3

1. INTRODUCTION

This is part of a series of reports on a project to evaluate industry standard fortran 90/95 compilers for IA-32 Linux™ commodity platforms. This report shows results, in a side-by-side comparison for each compiler, for the Intel™ Pentium 3 (P3) and Pentium 4 Xeon (P4) processors for the Princeton Ocean Model.

2.0 CHOICE OF HARDWARE AND OPERATING SYSTEM

Results for the wall clock time are compared for benchmarks compiled using four different Fortran compilers with the Linux™ operating system and one with Windows 2000 (because the Linux™ version was not yet installed). For this project benchmarks were executed in serial mode on a dual processor Intel™ Pentium III (256KB L2 cache) and a dual processor Pentium 4 Xeon 3.06GHz (1MB L3 cache). These architectures offers Streaming Single-Instruction-Multiple-Data Extensions (with version 2, SSE2, for the Xeon). This enables vectorization of loops operating on multiple elements in a data set with a single operation. Where compilers specifically enable SSE/SSE2 it has been tested.

3.0 CHOICE OF COMPILERS

The choice of compilers for Linux™ IA-32 platforms now includes several vendor-supported products. The importance of this category is that vendor products have technical support and undergo continuous development with ports to new architectures as they arrive in the marketplace. The four compilers chosen in this survey are described separately in the following sections and compiler switches used in the benchmarks are also discussed. However, it is noted here that while all compilers offer a switch to target the Pentium 4, only three (Intel, Lahey, and Portland) offer a specific SSE/SSE2 option (see also notes below).

3.1 Absoft

Absoft f77 and f90/f95 are the Fortran compilers included in the Absoft Pro Fortran™ 8.0 package for Linux™ offered by the Absoft Corporation (http://www.absoft.com). The f90/f95 version has a Cray front-end and resulted from a five-year collaboration with Cray Research. With this compiler use of the –O3 compiler switch enables automatic architecture detection and selection of the Pentium 3 or 4 instruction set.

3.2 Intel

The Intel Fortran Compiler version 8.0 targets both Intel IA-32 and IA-64 (Itanium) architectures, but only the former has been used in this project so far. Details on the compiler features are available at HiPERiSM Consulting, LLC’s URL. Code for target architectures is generated with either the –tpp6 (Pentium 3) or –tpp7 (Pentium 4) switch.

3.3 Lahey

The Lahey/Fujitsu Fortran 95 compiler (hereafter Lahey) for Linux™ is available from Lahey Computer Systems, Inc., (http://www.lahey.com). The Express version 5.6 for Microsoft Windows 2000™ was used on the Pentium 3 because it was available from another project for the same hardware. With this compiler use of the –tpp compiler switch enables automatic architecture detection for the P3 only. However, release v7.1 (for Windows) and v6.2 (for Linux) support compiler switches –tp4 and –sse2 to target the Pentium 4 Xeon and the SSE2 instruction set. The v6.2 release and the new switches are studied in this report.

3.4 Portland

The pgf90™ fortran compiler (Linux™ distribution) from the Portland Group, (http://www.pgroup.com) was used in the CDK 4.0 release where it supports OpenMP, MPI and OpenMP+MPI parallel applications on HiPERiSM’s IA-32 Linux™ cluster. With this compiler use of the –fast compiler switch enables automatic architecture detection. Note that the CDK 5.1 release (not used here) may offer additional performance enhancement of the Pentium 4 Xeon processor with the use of SSE2 options.

3.5 Portability and migration issues

Portability issues come up when legacy Fortran code needs to be compiled. In this respect a compiler that allows extensions to the f90/f95 standard can save time and effort. The two compilers that offer the widest scope in portability are those from Absoft and Portland. Compilers from Lahey and Intel are less forgiving of such extensions.

Here we also mention some migration issues that came up with compiler and architecture changes. The change in architecture from P3 to P4 Xeon also involves changes in library versions. As a result, two of the compilers had to either be upgraded or have patches applied. Installation of the Absoft 8.0 compiler for the Xeon processor and the newer Linux Kernel does require download and application of two patch files to resolve glibc version issues (these patch files are available from the Absoft URL given in Section 3.1). Likewise, an attempt was made to install the 7.1 release of the Intel Fortran compiler on the P4 Xeon. However, again version skew with glibc suggested the simpler option of installing the 8.0 release. Whenever the version of a compiler is changed performance is also expected to change. This is especially true of the Intel compiler since major performance improvements are announced with the 8.0 release. Therefore, the changes in performance reported here for the Intel compiler are due to improvements in the compiler technology as well as the change in architecture.

4.0 CHOICE OF BENCHMARKS

4.1 Introduction

The Princeton Ocean Model (POM) algorithm is used here and has been executed on a wide variety of platforms. The serial version is used here in studying how a compiler and architecture interact for a real-world model that was optimized for performance on vector register machines. A fuller discussion of the POM (in an MPI version) is available at http://www.hiperism.com/hc_6_10v30.htm. What follows introduces only the essentials of the cases studied here.

4.2 Princeton Ocean Model Algorithm

The Princeton Ocean Model (POM) is a legacy Fortran 77 code with compute kernels consisting of over five hundred vectorizable loops. Typically these are triple-nested loops (i,j,k) that perform operations over a three-dimensional finite difference grid. The vertical zones over the k range form the outermost loop in the nest. The number of iterations varies with the choice of data set as shown in Table 4.1. For the choices shown here, the k range is constant while the two inner loops scale substantially with over-all problem size scaling shown in the last column. The inner loop structure is conventional and this code should present compilers with good prospects for vectorization. Two important features of the POM should be noted: (a) the algorithm is unstable in single precision arithmetic and therefore double precision is used for all compilers, and (b) long integers are required and this option must be specifically requested with the Lahey compiler which otherwise produces run time errors.

Table 4.1 Problem sizes and scaling for the POM algorithm.
GRID	i_max	j_max	k_max	Scaling
1	100	40	15	1
2	128	128	16	4.37
3	256	256	16	17.47

5.0 COMPARING EXECUTION TIMES

The following sections summarize execution time with four compilers for the POM algorithm with the three data sets of Table 4.1 (GRID 1 to 3).

5.1 Timing performance

Whole code execution was measured with calls to the Fortran 90/95 system_clock routine for all compilers as this was deemed to be the most portable and accurate timing method.

5.2 Princeton Ocean Model results

For the POM algorithm the choice of compiler switches is summarized in Table 5.1. Note the use of the target architecture switches (often these are implicit in the optimization level). Timing results (without SSE enabled) are shown in Tables 5.2 (Pentium 3) and 5.3 (Pentium 4). Figures 1 and 2, for Pentium 3 and Pentium 4 respectively, show these times as bar charts. For the largest problem size the Lahey compiler is noticeably less efficient than the others and this is due to the requirement of the --long option for large integers.

Table 5.1 Compiler command and switches for the POM algorithm on the P3 and P4 Xeon processors.

Compiler and version

Compiler command and selected switches

Effect of switches

Absoft 8.0 (P3),

Absoft 8.0 (P4)

f90 –s –cpu:p6 –O3 –N113 –ffixed

f90 –s –cpu:p7 –O3 –N113 –ffixed

Optimize for P3 or

P4 Xeon target

Intel 7.1 (P3)

Intel 8.0 (P4)

ifc –O3 –r8 –tpp6 –FI

ifc –O3 –r8 –xK –tpp6 –FI

ifort –fast –r8 –tpp7 –FI

ifort –fast –r8 –xW –tpp7 –FI

Optimize for P3

Vectorize and enable SSE.

Optimize for P4 Xeon target.

Vectorize and enable SSE2.

Lahey 5.6 (P3)

Lahey 6.2 (P4)

lf95 –long –tpp –fix –dbl

lf95 --long --O2 --tp4 --fix --dbl

lf95 --long --O2 --tp4 --sse2 --fix --dbl

Optimize for P3 target.

Optimize for P4 target.

Enable SSE2.

Portland 4.0

(P3 and P4)

pgf90 –fast –Mvect –r8

pgf90 –fast –Mvect=sse –r8

Vectorize

Enable SSE

Table 5.2 Execution times (seconds) for the POM algorithm with four compilers on the Pentium III (933 MHz) without SSE enabled.

GRID

Absoft

Intel

Lahey

Portland