Hiperism Consulting, LLC: HCTR-2010-3

5.0 OpenMP PARALLEL RESULTS

The section numbering continues from the previous report.

5.1 Intel™ compiler results for CMAQ 4.6.1

Runtime results for the thread parallel version of CMAQ 4.6.1 with the fastest Intel™ compiler switch group (ifc4) are presented here for the parameter choices BLKSIZE=480 and NCMAX=60. For a discussion of these parameters see the previous report in this series (Delic, 2009). Table 5.1 repeats the results from the previous year’s presentation with a correction in terminology. In this table two performance metrics are introduced to assess thread parallel performance:

Speedup is the gain in runtime over the standard U.S. EPA runtime,
Scaling is the gain in runtime for thread counts larger than 1, relative to the result for a single thread in the ROS3-HC modified code.

Table 5.1. For CMAQ 4.6.1, with the 20060814 episode, this shows the wall clock time (in hours), speedup, and scaling of the modified version (ROS3-HC) as a function of increasing thread count on the QC-1 platform.

CMAQ version for Rosenbrock solver	Number of threads
CMAQ version for Rosenbrock solver	1	2	4	8
U.S. EPA (hours)	25.3	NA	NA	NA
ROS3-HC (hours)	29.4	20.8	19.5	17.5
ROS3-HC (speedup)	0.86	1.22	1.30	1.45
ROS3-HC (scaling)	1.00	1.41	1.51	1.68

5.2 Intel™ compiler results for CMAQ 4.7.1

Runtime results for a pre-release version of CMAQ 4.7.1 with the fastest Intel™ compiler switch group (ifc4) are presented here for the parameter choices BLKSIZE=480 and NCMAX=60. These are shown in Tables 5.2 and 5.3, for QC-1 and QC-2, respectively, for the thread parallel version.

Table 5.2. For CMAQ 4.7.1, with the 20060809 episode, this shows the wall clock time (in hours), speedup, and scaling of the modified version (ROS3-HC) as a function of increasing thread count on the QC-1 platform.

CMAQ version for Rosenbrock solver	Number of threads
CMAQ version for Rosenbrock solver	1	2	4	8
U.S. EPA (hours)	33.0	NA	NA	NA
ROS3-HC (hours)	36.0	29.7	26.1	23.9
ROS3-HC (speedup)	0.92	1.11	1.26	1.38
ROS3-HC (scaling)	1.00	1.21	1.38	1.51

Table 5.3. For CMAQ 4.7.1, with the 20060809 episode, this shows the wall clock time (in hours), speedup, and scaling of the modified version (ROS3-HC) as a function of increasing thread count on the QC-2 platform.

CMAQ version for Rosenbrock solver	Number of threads
CMAQ version for Rosenbrock solver	1	2	4	8
U.S. EPA (hours)	23.06	NA	NA	NA
ROS3-HC (hours)	26.06	21.7	18.6	17.3
ROS3-HC (speedup)	0.88	1.06	1.24	1.34
ROS3-HC (scaling)	1.00	1.20	1.40	1.51

5.3 Analysis of OpenMP results

The principal results of comparisons for the above tables are as follows.

The modified version of the ROS3-HC solver for CMAQ showed typical speedup with 8 parallel threads in the range 1.3 to 1.5 over the U.S. EPA version.
The speedup metric shows CMAQ 4.7.1, when compared to 4.6.1, has less gain in performance with increasing thread count.
CMAQ 4.7.1 requires considerably longer runtimes compared to CMAQ 4.6.1.
The gain in moving the U.S. EPA version from QC-1 to QC-2 is 1.3.
The gain in moving the ROS3-HC version from QC-1 to QC-2 is 1.22 to 1.43 (depending on the number of threads).

The last two results are a consequence of CMAQ 4.7.1 shifting the balance of arithmetic operations further toward scalar work (i.e. less vector-capable work) compared to CMAQ 4.6.1. In other words, less time is spent in the chemistry solver part relative to the rest of the model.

6.0 HYBRID OpenMP+MPI RESULTS

6.1 CMAQ 4.6.1 runtime

Runtime results with the Portland compiler switch group pgf3 are presented here for the parameter choices BLKSIZE=240 and NCMAX=30. Table 6.1 summarizes the CMAQ 4.6.1 results for episode 20060814 with the Portland 10.3 compiler.

Table 6.1. Wall clock times (in hours) in the hybrid MPI+OpenMP version of the CMAQ 4.6.1 ROS3-HC solver on the HiPERiSM QC Cluster platform for the Portland compiler group pgf3.

Col x Row = NP	ROS3-EPA (hours)	ROS3-HC
		Time in hours by thread count
		1	2	4	8
1 x 1 = 1	29.0	34.8			24.1
1 x 2 = 2	15.1		15.2	13.3	12.6
2 x 2 = 4	8.2		8.0	7.5
2 x 4 = 8	5.1		5.0

Execution times of the standard EPA release are in the column labeled ROS3-EPA. Columns under the label ROS3-HC show results of the hybrid MPI+OpenMP modified CMAQ version with the Rosenbrock solver. The rows correspond to the MPI process count (NP) and thread count is the number appearing under the column labeled as ROS3-HC. The blank cells indicate that results are not yet available at this time, or are limited by 8 cores per node.

6.2 CMAQ 4.6.1 speedup

For the hybrid MPI+OpenMP modified CMAQ version with the Rosenbrock solver, Table 6.2 shows the speedup metric corresponding to the runtimes in Table 6.1.

Table 6.2. Speedup in the hybrid MPI+OpenMP version of the CMAQ 4.6.1 ROS3-HC solver on the HiPERiSM QC Cluster platform for the Portland compiler group pgf3.

Col x Row = NP	ROS3-HC vs ROS3-EPA
	Speedup by thread count
	1	2	4	8
1 x 1 = 1	0.83			1.20
1 x 2 = 2		1.00	1.14	1.20
2 x 2 = 4		1.03	1.10
2 x 4 = 8		1.00

ACKNOWLEDGEMENTS

Part of this work was performed by HiPERiSM Consulting, LLC, as subcontractor to Computer Sciences Corporation, under U.S. EPA SES3 Contract GS-35F-4381G BPA 0775, Task Order 1522