10          Case studies III: The Princeton Ocean Model

10.1  Introduction

10.1.1 The history of the Princeton Ocean Model (POM)

 

The Princeton Ocean Model (POM) is a numerical ocean model originally developed by George L. Mellor and Alan Blumberg around 1977 and coded in Fortran. Subsequently, others contributed to development and use of the model and a list of research papers that describe results and the numerical model may be found at the POM URL http://www.aos.princeton.edu/WWWPUBLIC/htdocs.pom, where a User Guide is also available. The POM is used for modeling of estuaries, coastal regions and oceans.  The world-wide POM user community passed 340 in number by mid-1998 and active use continues to this day, even though POM has been superceded by newer models described at the NOAA Geophysical Fluid Dynamics Laboratory URL http://www.gfdl.noaa.gov.

 

10.1.2 Performance of the POM

The Fortran source code for the serial version of the POM is available at the POM URL. This site also lists numerous benchmark results on a wide variety of platforms from High Performance Computers (HPC) to workstations and personal computers. In the past the code was successfully implemented on vector architectures. An analysis by HiPERiSM Consulting, LLC, in Course HC2, finds over 500 vectorizable loops, and describes conversion of the POM to an OpenMP version. Performance on HiPERiSM Consulting, LLC’s IA-32 cluster for several problem sizes is discussed in Section 10.4.

 

10.1.3 MPI parallel version of the POM (MP-POM)

 

The source code used here for this MPI course was made available to HiPERiSM Consulting, LLC, for educational purposes, by Dr. Steve Piacsek, Naval Research Laboratory, Stennis Space Center, MS. This willingness to share the code is gratefully acknowledged. The MPI version of POM (hereafter MP-POM) was developed by W.D. Oberpriller, A. Sawdey, M.T. O’Keefe, and S. Gao, at the Parallel Computer Systems Laboratory, Department of Electrical and Computer Engineering, University of Minnesota. A manuscript describing the MPI conversion process and the current version is available at the MP-POM URL: http://www.borg.umn.edu/topaz/mppom/. The MP-POM code has been used here to study scalability as a function of both the problem size and the number of MPI processes.  To provide insight into this behavior Section 9.5 presents an instrumented performance analysis.

 

 

Back to Intel Trace Analyzer™

 

10.2  POM Overview

 

10.2.1  Brief description of the POM

 

The POM contains an embedded moment turbulence closure sub-model (subroutines PROFQ and ADVQ) to provide vertical mixing coefficients and implements complete thermodynamics. It solves a numerical model in a discrete three dimensional grid with the following characteristics:

 

·         It is a sigma coordinate model in that the vertical coordinate (z) is scaled on the water column depth and vertical time differencing is implicit.

·         The horizontal grid (x,y) uses curvilinear orthogonal coordinates in an “Arakawa C” differencing scheme that is explicit.

·         The model has a free surface and a split time step:

1.      The external mode portion of the model is two-dimensional and uses a short time step based on the CFL condition and the external wave speed.

2.      The internal mode is three dimensional and uses a long time step based on the CFL condition and internal wave speed.

 

In POM advection, horizontal diffusion, and (in the case of velocity) the pressure gradient and Coriolis terms are evaluated in subroutines ADVT, ADVQ, ADVCT, ADVU, ADVV and ADVAVE. The vertical diffusion is handled in subroutines PROFT, PROFQ, PROFU and PROFV.

 

The Fortran symbols that define the POM problem sizes under study here are defined in Table 10.1. The serial version uses upper case for Fortran symbols throughout, whereas MP-POM uses mixed case for symbols. For this discussion mixed case is used where it is not confusing.

 

 

Table 10.1 : Key grid and time step Fortran symbols in the Princeton Ocean Model (POM).

 

Symbol

Description

i , j

imax , jmax

Horizontal grid indexes and

Upper range limits

k

kmax

Vertical grid index (k=1 is top) and

Upper range limit (at bottom)

iint

DO loop index for internal mode time step

DTI

Internal mode time step (sec)

iend

Upper range limit for iint

iext

DO loop index for external mode time step

DTE

External mode time step (sec)

isplit = DTI / DTE

Upper range limit for iext

 

The POM User Guide contains a detailed description of the model equations, time integration method, and Fortran symbols. The following sections describe only the essentials needed to perform the cases under study.

 

10.2.2  Conversion of POM to MP-POM

 

The conversion of POM into the MPI version, MP-POM, used techniques and a specialized tool (TOPAZ) that are described in the paper at the MP-POM URL (see Section 10.1.3). In the following only a few points of the methodology are described.

 

Domain decomposition is applied to divide the computational work over MPI processes, but this requires overlap regions (also known as ghost cells, or halo regions) between domains. One key idea in the MP-POM conversion process is to add overlap regions that are larger than the minimum of one vector, as used, for example, in the SOM case study of Chapters 8 and 9. A domain decomposition algorithm requires communication between domains to exchange data at synchronization points. The idea in MP-POM is that if the halo region is made broader the overlap elements can be used for computation, and, only when all have been used up in computation, is communication required. The communication approach in MP-POM is guided by a motivation for efficient parallel code. Efficiency considerations require a minimal frequency of synchronization and communication because:

·         Synchronization is an overhead cost

·         Communication has a latency overhead

·         Communication delays occur if load imbalance between processes results in idle time (as some wait for others to complete computation before communication begins).

 

Thus a fundamental issue is the choice of an optimal overlap region width. Different choices in overlap region width have a different balance between computation and communication. To address this issue Table 10.2 lists the criteria and their trends for the two possible extremes in overlap region size.

 

 

Table 10.2 : Criteria and consequences of choice in overlap region size for MP-POM.

 

Criterion

Large overlap region

Small overlap region

Computation

Increase

Decrease

Synchronization

Decrease

Increase

Communication & Frequency

Longer and less frequent messages

Shorter and more frequent messages

Redundancy

More in computation on halo

Less in computation on halo

 

However, increasing the size of the overlap region increases the amount of redundant computation because overlap elements on one processor correspond to non-overlap elements on a neighbor where the same computation is performed. Despite this redundancy there is a potential performance gain when the cost of a local element computation is less than the cost of retrieving the value of the same (non-overlap) element value from a neighboring processor. This is because the communication factors listed above contribute to the cost of retrieval. 

 

Figure 10.1 shows the three parallel regions and synchronization points in MP-POM relative to the DO loops for internal and external modes.

 

Figure 10.1: Steps in the MPI version MP-POM showing the three parallel regions and synchronization points. For the latter the number in square parentheses is the number of arrays that are synchronized between MPI processes.

-----------       -----------                   ------------

STEP              Rank 0                        Rank > 0       

-----------       -----------                   ------------

1.                setup                         (idle)

2. Internal mode  DO 9000 IINT=1,IEND      

                  Synchronization Point #1 <<<<<< [16] <<<<<

                  Parallel Region #1 >>>>>>>>>>>>>>>>>>>>>>>

                        ¯                       ¯

3. External mode        DO 8000 IEXT=1,ISPLIT    

                        Synchronization Point #2 << [29] <<<

                        Parallel Region #2 >>>>>>>>>>>>>>>>>

                              ¯                 ¯

                        8000 CONTINUE

                  Synchronization Point #3 <<<<<< [62] <<<<<

                  Parallel Region #3 >>>>>>>>>>>>>>>>>>>>>>>

                        ¯                       ¯

                  9000 CONTINUE

3. I/O and STOP   PRINT OUTPUT

                        ¯                       ¯

                        call MPI_Finalize             call MPI_Finalize

-----------       -----------                   -----------

 

 

10.2.3  Problem sizes under study

 

Table 10.3 defines the four grids under study chosen such that problem size scaling is in the ratio 1:3:12:49 for the respective girds. GRID1 is used as a test problem whereas the remaining three grids are chosen for the even problem size scaling in the ratio 1:4:16.

 

 

 

 

 

 

 

Table 10.3: Problem sizes for the grid in the Princeton Ocean Model (POM).

 

Case

Description

GRID1

imax = 64, jmax= 64, kmax = 21

GRID2

imax = 128, jmax= 128, kmax = 16

GRID3

imax = 256, jmax= 256, kmax = 16

GRID4

imax = 512, jmax= 512, kmax = 16

 

 

Back to Intel Trace Analyzer™

10.3  MP-POM on a Linux IA-32 cluster

 

10.3.1  Location of MP-POM code

 

The following examples were performed under Red Hat Linux with the Portland Group pgf90 compiler on a 8-node, dual processor (per node), Pentium III IA-32 cluster. Therefore the scripts shown in the directories indicated below are for this specific case and need to be modified for the local implementation.

 

The MPI version of the POM code, MP-POM, is located in the subdirectory ~/hc6/mpi_f_cases/mppom. This has the source and makefiles in the folder:

 

                ~/build

 

For the four problem sizes that have been selected for this case study results (for the above implementation) are in the respective folders:

 

                ~/pom_grid1_sse

       ~/pom_grid2_sse

                ~/pom_grid3_sse

       ~/pom_grid4_sse

 

The results shown in these directories are those for the implementation described above and to prepare for execution on the local platform new directories should be created as follows:

 

                ~/pom_grid1

       ~/pom_grid2

                ~/pom_grid3

       ~/pom_grid4

 

and all run scripts copied for each grid as in this example for the first problem size:

 

                cp ~/pom_grid1_sse/run* ~/pom_grid1/.

 

Each run script has a distinctive name to denote the grid chosen and the number of MPI processes used as in the following example for run_grid1_8

 

/usr/bin/time mpirun -np 8 pom > pom_grid1_8.out

date >> pom_grid1_8.out

               

Note that the execution of each run script requires the correct identification on the Linux time and date commands, availability of the mpirun command, and the executable pom. The method of producing the executable pom is the subject of the next section.

 

10.3.2  Compiling and linking MP-POM

 

The MP-POM source and makefiles are located in the folder:

 

                ~/build

 

which has several different makefiles for various platforms, but only the makefile Makefile.linux has been tested and is used in this case study. Before the MP-POM executable pom can be produced with this makefile it must first be adapted for the current platform.  A README.txt file is available in this directory and should be studied carefully before the makefile is used. Since this makefile uses the Portland Group pgf90 and GNU gcc compilers some important implementation-dependent issues are repeated here. Edit the makefile Makefile.linux and observe the following, making changes (as needed), in the respective parts of the makefile.

 

Edit the makefile for the following information in PART I:

 

NPROCS=8

 

The parameter NPROCS is the number of processes/threads/processors to use. Whenever NPROCS is changed also delete the file dimensions.h before executing make, but do not erase the file dimensions.H since this is used by the preprocessor to generate dimensions.h.

 

Edit the Makefile for the following information in PART II:

 

Change the following to select the location of source and build files in PART IV (this is only required if PART IV is used – which is not the case here)

 

SRCDIR=~/hc6/mpi_f_cases/mppom/build

 

The target in this implementation is the 32-bit Linux operating system with the Portland Group and GNU gcc compilers so the following needs to be set for the local implementation:

 

CC = cc

FC = pgf90

CPP = /usr/bin/cpp

LOADER = pgf90

FFLAGS = -c -Mnosecond_underscore -fast -Mvect=sse -r8 -Minfo

CFLAGS =

CPPFLAGS = -P -traditional

LDFLAGS=

LIBS= -lfmpichlmpich

 

The following are some notes on these changes to ensure that they will be correct for the local Linux implementation.

 

#    CC = cc               # this is set for gcc

#    FC = pgf90            # this is the Portland Group pgf90 compiler

#    CPP = /usr/bin/cpp    # location of cpp preprocessor

#    LOADER = pgf90        # Portland Group pgf90 loader

#    FFLAGS = -c -Mnosecond_underscore -fast -Mvect=sse -r8 -Minfo

#                          # Portland Group pgf90 flags

#    CFLAGS =

#    CPPFLAGS = -P -traditional

#

#    For cpp use the following flags:

#    /usr/bin/cpp -P -traditional .... etc

#    The reason is that by default (on Linux) cpp will affect the label

#    field in processing fortran files. The -traditional flag corrects this

 

Once the above changes to Makefile.linux have been made for the local implementation an executable is generated by the following commands in the ~/build folder:

 

                                cp Makefile.linux Makefile

              make &> compile_grid1_8.out

 

After completion of the make execution scroll through the file compile_grid1_8.out and at the end confirm that the executable pom appears.

 

On first installation of the above directories the files have been set for the test problem of GIRD1 in Table 10.3 with 8 MPI processes.  However, this case study allows for a total of 16 combinations of MPI processor count and problem size with the following options:

 

                NPROCS=1, 4, 8, or 16 (or any power of 2)

       File grid replaced by any one of grid1, grid2, grid3, or grid4             

 

To execute any other problem size with any number of MPI processes changes have to be made in the ~/build directory as described in the following example.

 

Example for a different problem size and MPI process count

 

The default setup of the MP-POM installation is for the problem size GRID1 with 8 MPI processes. The steps that need to be followed when either the number of processes, or the problem size, is changed are described here for the example of 16 processes and GRID2.

 

To change the number of MPI processes follow these steps in folder ~/build:

1.      Edit the makefile and change the value NPROCS=16 ( the value used should be a power of 2).

2.      rm dimensions.h.

3.      rm *.o.

 

To change the problem size perform this next step before execution of the make:

4.      cp grid2 grid

5.      make &> compile_grid2_16.out.

 

If the problem size is not changed, then the output file of the make (in this example) is compile_grid1_16.out

 

After completion of the make execution scroll through the file *.out and at the end confirm that the executable pom appears. Once a combination of processor count and problem size is selected, and corresponding executable pom created, follow the directions for execution of the MP-POM code in the next section.

 

10.3.3  Executing MP-POM

 

Once the executable pom has been created (as described in the previous section) it needs to be moved to the appropriate run directory:

 

                ~/pom_grid1

       ~/pom_grid2

                ~/pom_grid3

       ~/pom_grid4

 

This example is for the default combination of 8 MPI processes and GRID1 and assumes that the run scripts have been copied as described in Section 10.3.1.

 

Example of MP-POM execution

 

The results shown in these directories are those for the default implementation described above and assumes that the executable pom has been created in directory ~/build.

 

To execute the default case follow these steps in folder ~/build:

1.      cd pom_grid1

2.      cp ../pom .

3.      ./run_grid1_8

 

Alternatively, for the example of the previous section the steps would be:

4.      cd pom_grid2

5.      cp ../pom .

6.      ./run_grid2_16

 

10.3.4 MP-POM output files

 

Once the executable run_grid1_8 has completed (as described in the previous section) in the run directory:

 

                ~/pom_grid1

 

the terminal returns the prompt with the results of the Linux time command and (for the default combination) an example has been saved in the file ~/pom_grid1_sse/run_grid1_8.time. In addition the script run_grid1_8 produces the output file run_grid1_8.time.out that has the MP-POM output. This should be viewed with an editor to see that at the end it shows the correct choice of problem size and number of processors. In addition some internal timing information is also presented.

 

Example of output files

 

The results in *.out and *.time files for each of 16 possible combinations of the number of MPI processes and problem sizes are shown in the respective default directories described in Section 10.3.1.  In addition, results of the compilation that produced the executable are shown in the compile_*.out files. Results generated on the local implementation should be saved with a similar naming convention in the respective local directories:

 

                ~/pom_grid1

       ~/pom_grid2

                ~/pom_grid3

       ~/pom_grid4

 

10.3.5   Exercises with MP-POM

 

Exercise 1. Perform the above case studies on your implementation for problem sizes in GRID2, GRID3, and GRID4, for MPI process counts in the range P=1, 4, 8, and 16 and compare performance with that discussed in the next Section.

 

Back to Intel Trace Analyzer™

10.4  MP-POM performance results on an IA-32 cluster

 

10.4.1   Wall clock time for MP-POM

 

The MP-POM code was executed for the four problem sizes of Table 10.3 and MPI process count P in the range 1 to 16. The results are for the Portland Group compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see Section 10.3). As was pointed out in Section 1.7, when two CPUs are in use on a dual processor node they compete for memory bandwidth whereas if only one CPU is in use it has the full memory bandwidth to itself. Hence the apparent loss in performance on this cluster between P=8 and p=16 MPI processes is a result of contention for memory bandwidth resources on the Intel dual processor node.

 

This section reports the results for wall clock time summarized in Table 10.4. Figure 10.2 shows two views of these results with the upper panel showing a 2-dimensional histogram chart for all four problem sizes and P=1, 4, and 8 MPI processes. The second (lower) panel shows a 3-dimensional histogram plot for P=1, 4, 8, and 16 MPI processes. In the 3-dimensional chart the lack of significant time reduction between P=8 and P=16 is evident for all problem sizes. However, the upper panel of Figure 10.2 shows that the reduction in wall clock time with increasing processor count P, increases as the problem size increases. The performance analysis in Section 10.5 gives more insight into the reasons for this but suffice it to say that increasing problem size corresponds to increasing coarseness in parallel granularity. 

 

 

 

Table 10.4 : MP-POM wall clock time for the four problem sizes (Grid) defined in Table 10.2 with the MPI process count values (P) shown.

 

Grid

P=1

P=4

P=8

P=16

GRID1

263.6

230.6

217.5

270.2

GRID2

856.7

420.9

318.0

355.9

GRID3

3425.9

1199.3

745.1

681.1

GRID4

13573.0

3841.0

2153.4

1836.3

 

 

10.4.2   Parallel speed-up for MP-POM

 

The MP-POM code was executed for the four problem sizes of Table 10.3 and MPI process count P in the range 1 to 16. The results are for the Portland Group compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see Section 10.3). As was pointed out in Section 1.7, when two CPUs are in use on a dual processor node they compete for memory bandwidth whereas if only one CPU is in use it has the full memory bandwidth to itself. Hence the apparent loss in performance on this cluster between P=8 and p=16 MPI processes is a result of contention for memory bandwidth resources on the Intel dual processor node.

 

This section reports the results for parallel speedup summarized in Table 10.5. Figure 10.3 shows two views of these results with the upper panel showing a 2-dimensional histogram chart for all four problem sizes and P=1, 4, and 8 MPI processes. The second (lower) panel shows a 3-dimensional histogram plot for P=1, 4, 8, and 16 MPI processes. In the 3-dimensional chart the lack of significant increase between P=8 and P=16 is evident for all problem sizes. However, the upper panel of Figure 10.3 shows that the gain in parallel speed up with increasing processor count (P) increases as the problem size increases because of the convergence to the ideal (linear speed up) curve shown as the solid line with the largest problem size showing the best parallel speed up result. The performance analysis in Section 10.5 gives more insight into the reasons for this but suffice it to say that increasing problem size corresponds to increasing coarseness in parallel granularity. 

 

 

Table 10.5 : MP-POM parallel speed-up for the four problem sizes (Grid) defined in Table 10.2 with the MPI process count values (P) shown.

 

Grid

P=1

P=4

P=8

P=16

GRID1

1

1.1

1.2

1.0

GRID2

1

2.0

2.7

2.4

GRID3

1

2.9

4.6

5.0

GRID4

1

3.5

6.3

7.4

 

 

 

 

Figure 10.2:  Time to solution for the MP-POM algorithm for the four problem sizes (grid number in Table 10.3) and MPI process count P=1, 4, 8 (upper frame), and including 16 (lower frame).

 

 

Figure 10.3:  Parallel speedup for the MP-POM algorithm for the four problem sizes (grid number in Table 10.3) and MPI process count P=1, 4, 8 (upper frame), and including 16 (lower frame).

 

10.4.3   Parallel efficiency for MP-POM

 

The MP-POM code was executed for the four problem sizes of Table 10.3 and MPI process count P in the range 1 to 16. The results are for the Portland Group compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see Section 9.3). As was pointed out in Section 1.7, when two CPUs are in use on a dual processor node they compete for memory bandwidth whereas if only one CPU is in use it has the full memory bandwidth to itself. Hence the apparent loss in performance on this cluster between P=8 and p=16 MPI processes is a result of contention for memory bandwidth resources on the Intel dual processor node.

 

This section reports the results for parallel efficiency summarized in Table 10.6. Figure 10.4 shows two views of these results with the upper panel showing a 2-dimensional histogram chart for all four problem sizes and P=1, 2, and 8 MPI processes. The second (lower) panel shows a 3-dimensional histogram plot for P=1, 4, 8, and 16 MPI processes. In the 3-dimensional chart the significant decrease between P=8 and P=16 is evident for all problem sizes. However, the upper panel of Figure 10.4 shows that the gain in parallel efficiency with increasing processor count (P) increases as the problem size increases because of the convergence to the ideal value of 1 with the largest problem size showing the best parallel efficiency result. The performance analysis in Section 10.5 gives more insight into the reasons for this but suffice it to say that increasing problem size corresponds to increasing coarseness in parallel granularity. 

 

 

 

Table 10.6 : MP-POM parallel efficiency for the four problem sizes (Grid) defined in Table 10.2 with the MPI process count values (P) shown.

 

Grid

P=1

P=4

P=8

P=16

GRID1

1

0.29

0.15

0.06

GRID2

1

0.51

0.34

0.15

GRID3

1

0.71

0.57

0.31

GRID4

1

0.88

0.79

0.46

 

 

 

 

Figure 10.4: Parallel efficiency for the MP-POM algorithm for the four problem sizes (grid number in Table 10.3) and MPI process count P=1, 4, 8 (upper frame), and including 16 (lower frame).

 

10.4.4   Conclusions on scalability for MP-POM

 

The MP-POM code was executed for the four problem sizes of Table 10.3 and MPI process count P in the range 1 to 16. The results are for the Portland Group compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see Section 10.3). As was pointed out in Section 1.7, when two CPUs are in use on a dual processor node they compete for memory bandwidth whereas if only one CPU is in use it has the full memory bandwidth to itself. Hence the apparent loss in performance on this cluster between P=8 and p=16 MPI processes is a result of contention for memory bandwidth resources on the Intel dual processor node.

 

This section reports the results for parallel efficiency summarized in Table 10.6 as a function of problem size for fixed process count P. Figure 10.5 shows these results with a 2-dimensional chart for all four problem sizes and P=4, 8 and 16 MPI processes. This shows that parallel efficiency increases as the problem size increases for each choice of processor count (P). In general, parallel efficiency values of 0.5 (or larger) are considered acceptable, while values 0.75 (and above) are in the good (to excellent) range. Thus the largest problem size shows the best parallel efficiency results. The performance analysis in Section 10.5 gives more insight into the reasons for this but suffice it to say that increasing problem size corresponds to increasing coarseness in parallel granularity. 

 

 

Figure 10.5: Scaling of parallel efficiency for the MP-POM algorithm with problem size (grid number in Table 10.3) and fixed MPI process count P=4, 8, and 16.

 

Back to Intel Trace Analyzer™

10.5  Performance analysis on an IA-32 cluster

 

10.5.1   Comparing tracefiles in Vampir for MP-POM

 

It is often stated that enhanced parallel performance comes from enhanced simultaneous activities. Usually this devolves into the separation of communication performance and application performance. The latter is a function of the choice of algorithm, data structure/partition, etc., whereas the former is a function of the communication costs for the chosen MPI procedures. Vampir (see Section 1.9) can give insights into both categories of performance issues because, through the many displays of a trace file that it offers, it can reveal subtle performance issues. It supports the programmer both in the discovery of problems and in the quantitative analysis of performance improvements between different versions of a program after code changes have been made.

 

The following sections show how Vampir is used to investigate performance costs in Section 10.6, by categories such as statistical overview, runtime analysis, and MPI statistics. Vampir also gives insight into load balance problems in Section 10.7 where statistics for activity, messages, and process profile are discussed. Section 10.8 concludes with a brief summary of the performance analysis.

 

10.5.2   Parallel granularity for MP-POM

 

Performance of the Princeton Ocean Model is evaluated by comparing Vampir trace files for the MP-POM code when executed for the problem sizes GRID2, GRID3, and GRID4, of Table 10.3 and a MPI process count P=4. This choice of ascending grid size produces problem size scaling in the ratio 1:4:16 for the respective girds which are identified by the tracefile names: pom_grid2_4.bvt, pom_grid3_4.bvt, and pom_grid4_4.bvt. Clearly, performance will depend on several factors related to parallel granularity, interconnect bandwidth, message size and frequency. Comparing performance results for these three grids is expected to show the effects of increasing parallel granularity.

 

Back to Intel Trace Analyzer™

10.6  Identifying performance costs

 

10.6.1   Statistical overview for MP-POM

 

Vampir opens with the Summary Chart display by default. This shows in Figure 10.6 to Figure 10.11 the sum of the time consumed by all activities over all selected processes for the respective trace files listed above. This is a very condensed view of the entire program run that shows a comparison of time spent in MPI procedures and user code. Each pair of figures corresponds to each of the three grids. The first figure of each pair shows the absolute time for the whole execution while the second shows the relative time in percent. For the smallest grid size in Figure 10.6 and Figure 10.7 the ratio of times for user application to MPI processes is 1, whereas for GRID3 and GRID4 it is 3:1 and 1:7.5, respectively. This Summary Chart display is the first clear quantitative indication of scalability in computation-to-communication ratio with increasing problem size. Later sections (and other displays) explore more detail for this observation. 

 

 

 

Figure 10.6:  Summary chart display with absolute times for pom_grid2_4.bvt.

 

 

Figure 10.7:  Summary chart display with the relative times for pom_grid2_4.bvt.

 

Figure 10.8:  Summary chart display with absolute times for pom_grid3_4.bvt.

 

 

Figure 10.9:  Summary chart display with the relative times for pom_grid3_4.bvt

 

Figure 10.10:  Summary chart display with absolute times for pom_grid4_4.bvt.

 

 

Figure 10.11:  Summary chart display with the relative times for pom_grid4_4.bvt.

 

 

10.6.2   Runtime analysis for MP-POM

 

An in-depth-analysis of the runtime behavior addresses the issue as to whether user application or MPI procedures dominate. Because Vampir has access to a trace that exactly describes when an event occurred in the application, it can render these changes over time in so-called Timeline Displays.

 

For chronologically ordered information the Global Timeline display for the respective problem sizes of the three grids is shown in the upper frame of Figure 10.12, Figure 10.13, and Figure 10.14. The result shown is that obtained after several applications of the Vampir zooming feature used for magnification of part of the timeline. These views have selected closely similar time slices of approximately 19.6 seconds. This display shows a color map that uses red for MPI (communication) and green for the application (computation). In these examples there are many MPI procedure calls that merge into black bands at this resolution on the horizontal scale. These black banded regions are the synchronization points of the MP-POM algorithm in Figure 10.1. These alternate computation and communication regions recur with regular frequency and each region involves participation of all MPI processes simultaneously. The comparison of the three Global Timeline displays shows an increasing sparseness of the MPI communication regions (black bands) compared to computation (green bands) as problem size increases.

 

Another timeline display is the Summary Timeline, which shows all analyzed state changes for each process (over the chosen time interval) in one display as a histogram over time bins. Within each time bin the number of processors in either an MPI (red) or Application (green) state is shown as a stacked column bar chart. Because a histogram is used the Summary Timeline display works well even for long trace times. An example of the summary timeline for GRID2, GRID3, and GRID4 is shown in the lower frame of Figure 10.12, Figure 10.13, and Figure 10.14. respectively, again just for the selected timeslice. The Summary Timeline display shows that with increasing problem size, the duration of application (computation) regions increases dramatically, while MPI (communication) regions develop a sparse structure.

 

 

 

Figure 10.12: Global timeline for pom_grid2_4.bvt after zooming.

 

 

Figure 10.13 Global timeline for pom_grid3_4.bvt after zooming.

 

 

Figure 10.14: Global timeline for pom_grid4_4.bvt after zooming.

 

 

 

10.6.3   Runtime MPI statistics for MP-POM

 

Runtime statistics for MPI processes are available through several displays in Vampir and some of these are shown here for the three problem sizes under study in the POM. One example of MPI statistics is seen in Figure 10.15, Figure 10.16, and Figure 10.17 for the three problem sizes. In each case the MPI process that accounts for the largest proportion of the time is MPI_Wait with respective proportions of 38.5%, 20.1% and 8.75% of the total. The MPI_Send process is the next largest time consumer with proportions of 4.0%, 2.5% and 2.0% of the total time in the respective problem sizes. It is evident that the relative time for MPI_Wait reduces substantially as the problem sizes scales upward and less so for the MPI_Send process.

 

 

 

Figure 10.15: Summary chart display for MPI only showing the relative times of pom_grid2_4.bvt.

 

 

Figure 10.16:  Summary chart display for MPI only showing the relative times of pom_grid3_4.bvt.

 

Figure 10.17: Summary chart display for MPI only showing the relative times of pom_grid4_4.bvt.

Back to Intel Trace Analyzer™

10.7  Identifying load balance problems

 

10.7.1   Activity statistics for MP-POM

 

Another statistical display in Vampir is the Activity Chart. It is especially useful in looking at distribution of time between MPI and user code for all processes at-a-glance. This display shows a pie chart for each MPI process with proportionate division by application, MPI, and Vampirtrace (VT_API) based on the total time utilization. VT_API is listed in addition to MPI and Application if there is one (or more) call(s) to a Vampirtrace API function. However, such calls typically take negligible time and thus the display helps to understand the uniformity of the load distribution by process at-a-glance.

 

The corresponding Activity Chart displays for the three problem problem sizes in MP-POM with four MPI processes are shown in Figure 10.18, Figure 10.19, and Figure 10.20, respectively. It is clear that by proportion, for all four processes, the amount of time spent in MPI procedures decreases sharply as the problem sizes scales upward. It is also worthwhile noting that these Activity Chart displays show proportions for MPI or Application activity that are relatively homogeneous over MPI processes for the largest problem size.

 

 

 

Figure 10.18: Global activity chart display for pom_grid2_4.bvt.

 

 

Figure 10.19: Global activity chart display for pom_grid3_4.bvt.

 

 

Figure 10.20: Global activity chart display for pom_grid4_4.bvt.

 

10.7.2   Message statistics for MP-POM

 

 

Another important indicator of possible MPI performance problems is in the details of messages passed between processes. In addition to the statistics discussed in Section 10.6.1, Vampir provides a feature-rich Message Statistics display for message counts, for the minimum, maximum, or average of message length, rate, or duration.

 

The default is to display the sum of all message lengths (not shown here) but some detailed comparison is available by viewing the Message Length statistics displays shown in Figure 10.21, Figure 10.22, and Figure 10.22. In this display the number of messages in each length bin are shown. The feature to note is that the bin size doubles for each increment in problem size as does the vertical scale and this indicates doubling in message length.

 

 

 

Figure 10.21: Message length statistics for communications between MPI processes with pom_grid2_4.bvt.

 

 

Figure 10.22: Message length statistics for communications between MPI processes with pom_grid3_4.bvt.

 

 

Figure 10.23: Message length statistics for communications between MPI processes with pom_grid4_4.bvt.

 

 

10.7.3   Process profile statistics for MP-POM

 

 

Another source for message statistics and information on possible MPI performance problems is in the details of the process profile for specific messages passed between processes. This is shown in the Process Profile displays which are discussed in this section.  These displays give timings on a per process basis for specific states. Only the results for the MPI_Wait state are shown here in view of the foregoing discussion on which MPI process dominates.

 

The Sum of time results for the MPI_Wait state are shown Figure 10.24, Figure 10.26, and Figure 10.28 for the respective problem sizes in the POM with four MPI processes. A companion set of results for the Average Duration is shown in Figure 10.25, Figure 10.27, and Figure 10.29. These displays show that the average time consumed by the MPI_Wait state changes only mildly with increasing problem size whereas the sum of time decreases drastically relative to the total time. Consequently the total cost of synchronization in MP-POM decreases substantially as the problem size increases.

 

 

 

Figure 10.24: Process profile for MPI activity and the MPI_Wait state showing sum of time for pom_grid2_4.bvt.

 

 

Figure 10.25: Process profile for MPI activity and the MPI_Wait state showing average duration  for  pom_grid2_4.bvt.

 

Figure 10.26:  Process profile for MPI activity and the MPI_Wait state showing sum of time for pom_grid3_4.bvt.

 

 

Figure 10.27: Process profile for MPI activity and the MPI_Wait state showing average duration  for  pom_grid3_4.bvt.

 

Figure 10.28:  Process profile for MPI activity and the MPI_Wait state showing sum of time for pom_grid4_4.bvt.

 

Figure 10.29:  Process profile for MPI activity and the MPI_Wait state showing average duration  for  pom_grid4_4.bvt.

 

Back to Intel Trace Analyzer™

10.8  Summary of performance analysis for MP-POM

 

Only by using Vampir to compare the three different problem sizes of GRID2, GRID3, and GRID4 of Table 10.3 for MP-POM is a quantitative, in-depth, understanding of the specifics of performance differences possible.

 

Section 10.6 presented results for performance costs with increasing problem size in the following categories

 

1.      Statistical overview  to show the first clear quantitative indication of scalability in computation-to-communication ratio.

2.      Runtime analysis to show that timelines have increasing computation regions and diminishing communication intervals.

3.      Runtime MPI statistics to show that MPI process time is dominated by one process and its relative time decreases by approximately a factor of two for each increase in grid size.

 

Section 10.7 presented results for load balance statistics with increasing problem size in the following categories

 

1.      Activity  to show that all MPI processes have a sharp decrease in proportion to computation with improved homogeneity across processes for the largest grid.

2.      Messages to show a doubling of message length

3.      Process profile  to show that the total cost of synchronization decreases substantially as problem sizes increases.

 

Consequently, it is possible to conclude that with increasing problem size

·         the computation-to-communication ratio increases

·         synchronization costs decrease, and

·         message length increases.

 

This performance analysis explains the scaling with problem size in parallel efficiency as shown in Figure 10.5.

 

Back to Intel Trace Analyzer™

 

10       Case studies III: The Princeton Ocean Model......................................................................................... 207

10.1     Introduction............................................................................................................................................................ 207

10.1.1 The history of the Princeton Ocean Model (POM)............................................................................................. 207

10.1.2 Performance of the POM.......................................................................................................................................... 207

10.1.3 MPI parallel version of the POM (MP-POM)...................................................................................................... 207

10.2     POM Overview........................................................................................................................................................... 208

10.2.1 Brief description of the POM................................................................................................................................... 208

10.2.2 Conversion of POM to MP-POM............................................................................................................................ 209

10.2.3 Problem sizes under study....................................................................................................................................... 210

10.3     MP-POM on a Linux IA-32 cluster................................................................................................................... 211

10.3.1 Location of MP-POM code...................................................................................................................................... 211

10.3.2 Compiling and linking MP-POM........................................................................................................................... 212

10.3.3 Executing MP-POM.................................................................................................................................................. 214

10.3.4 MP-POM output files................................................................................................................................................ 214

10.3.5 Exercises with MP-POM.......................................................................................................................................... 215

10.4     MP-POM performance results on an IA-32 cluster........................................................................... 215

10.4.1 Wall clock time for MP-POM.................................................................................................................................. 215

10.4.2 Parallel speed-up for MP-POM.............................................................................................................................. 216

10.4.3 Parallel efficiency for MP-POM............................................................................................................................. 219

10.4.4 Conclusions on scalability for MP-POM.............................................................................................................. 221

10.5     Performance analysis on an IA-32 cluster.......................................................................................... 222

10.5.1 Comparing tracefiles in Vampir for MP-POM..................................................................................................... 222

10.5.2 Parallel granularity for MP-POM......................................................................................................................... 222

10.6     Identifying performance costs.................................................................................................................... 222

10.6.1 Statistical overview for MP-POM.......................................................................................................................... 222

10.6.2 Runtime analysis for MP-POM............................................................................................................................... 224

10.6.3 Runtime MPI statistics for MP-POM...................................................................................................................... 226

10.7     Identifying load balance problems......................................................................................................... 227

10.7.1 Activity statistics for MP-POM............................................................................................................................... 227

10.7.2 Message statistics for MP-POM.............................................................................................................................. 229

10.7.3 Process profile statistics for MP-POM.................................................................................................................. 231

10.8     Summary of performance analysis for MP-POM.............................................................................. 233

 

Back to Intel Trace Analyzer™