HiPERiSM's Technical Reports HiPERiSM - High Performance Algorism Consulting HCTR-2010-3: CMAQ Performance 3
CMAQ OpenMP AND OpenMP+MPI HYBRID RESULTS George Delic HiPERiSM Consulting, LLC.
5.0 OpenMP PARALLEL RESULTS The section numbering continues from the previous report. 5.1 Intel™ compiler results for CMAQ 4.6.1 Runtime results for the thread parallel version of CMAQ 4.6.1 with the fastest Intel™ compiler switch group (ifc4) are presented here for the parameter choices BLKSIZE=480 and NCMAX=60. For a discussion of these parameters see the previous report in this series (Delic, 2009). Table 5.1 repeats the results from the previous year’s presentation with a correction in terminology. In this table two performance metrics are introduced to assess thread parallel performance:
Table 5.1. For CMAQ 4.6.1, with the 20060814 episode, this shows the wall clock time (in hours), speedup, and scaling of the modified version (ROS3-HC) as a function of increasing thread count on the QC-1 platform.
5.2 Intel™ compiler results for CMAQ 4.7.1 Runtime results for a pre-release version of CMAQ 4.7.1 with the fastest Intel™ compiler switch group (ifc4) are presented here for the parameter choices BLKSIZE=480 and NCMAX=60. These are shown in Tables 5.2 and 5.3, for QC-1 and QC-2, respectively, for the thread parallel version. Table 5.2. For CMAQ 4.7.1, with the 20060809 episode, this shows the wall clock time (in hours), speedup, and scaling of the modified version (ROS3-HC) as a function of increasing thread count on the QC-1 platform.
Table 5.3. For CMAQ 4.7.1, with the 20060809 episode, this shows the wall clock time (in hours), speedup, and scaling of the modified version (ROS3-HC) as a function of increasing thread count on the QC-2 platform.
5.3 Analysis of OpenMP results The principal results of comparisons for the above tables are as follows.
The last two results are a consequence of CMAQ 4.7.1 shifting the balance of arithmetic operations further toward scalar work (i.e. less vector-capable work) compared to CMAQ 4.6.1. In other words, less time is spent in the chemistry solver part relative to the rest of the model. 6.0 HYBRID OpenMP+MPI RESULTS 6.1 CMAQ 4.6.1 runtime Runtime results with the Portland compiler switch group pgf3 are presented here for the parameter choices BLKSIZE=240 and NCMAX=30. Table 6.1 summarizes the CMAQ 4.6.1 results for episode 20060814 with the Portland 10.3 compiler. Table 6.1. Wall clock times (in hours) in the hybrid MPI+OpenMP version of the CMAQ 4.6.1 ROS3-HC solver on the HiPERiSM QC Cluster platform for the Portland compiler group pgf3.
Execution times of the standard EPA release are in the column labeled ROS3-EPA. Columns under the label ROS3-HC show results of the hybrid MPI+OpenMP modified CMAQ version with the Rosenbrock solver. The rows correspond to the MPI process count (NP) and thread count is the number appearing under the column labeled as ROS3-HC. The blank cells indicate that results are not yet available at this time, or are limited by 8 cores per node. 6.2 CMAQ 4.6.1 speedup For the hybrid MPI+OpenMP modified CMAQ version with the Rosenbrock solver, Table 6.2 shows the speedup metric corresponding to the runtimes in Table 6.1. Table 6.2. Speedup in the hybrid MPI+OpenMP version of the CMAQ 4.6.1 ROS3-HC solver on the HiPERiSM QC Cluster platform for the Portland compiler group pgf3.
ACKNOWLEDGEMENTS Part of this work was performed by HiPERiSM Consulting, LLC, as subcontractor to Computer Sciences Corporation, under U.S. EPA SES3 Contract GS-35F-4381G BPA 0775, Task Order 1522 Follow the "Next" button to view the next report in this series. HiPERiSM Consulting, LLC, (919) 484-9803 (Voice) (919) 806-2813 (Facsimile)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||