HiPERiSM's Technical Reports
HiPERiSM - High Performance Algorism Consulting
HCTR-2010-5: CMAQ on a GPGPU
CMAQ GPGPU LOOP NEST BENCHMARKS
HiPERiSM Consulting, LLC.
1.0 EXPLORING CMAQ ON A GPGPU
1.1 Thread-safe version of CMAQ
HiPERiSM's thread parallel version of the CMAQ Rosenbrock solver (ROS3-HC) is used as the starting point to explore a port of CMAQ to a General Purpose Graphical Processing Units (GPGPU) device. The host CPU architecture for this exploration was a platform with dual Intel™ quadcore (X5450) CPUs and one Nvidia™ GPGPU (C1060) device in a PCIe x16 slot. While this is not state-of-the-art at this time, it is sufficient to establish feasibility and identification of the issues involved in porting CMAQ to a GPGPU device. For more details on this configuration see Table 2.1 of Technical report HCTR2010-1. The following discussion refers to two user selectable parameters, BLKSIZE and NCMAX. These parameters are fully described in a previous report in this series (HCTR2009-1). They were included in the design of ROS3-HC to enable selection of chunk size (BLKSIZE) for OpenMP threads and vector length (NCMAX) for SSE enabled loops in CMAQ's ROS3-HC thread parallel solver.
1.2 Porting CMAQ to a GPGPU
A successful port of CMAQ ROS3-HC to a GPGPU device requires either multilayered loop nests, or very long (single) loops to reap the benefits of the many threads such devices employ. This suggests that if the NCMAX “vector” length could be made very large there could be throughput benefits for a CMAQ parallel version adapted for such devices. If at the same time, the number (BLKSIZE) of cells in a block could be increased, then the number of such blocks passed to the solver would be reduced. Consequently, the number of calls to the CMAQ chemistry solver algorithm is fewer, and loops spanning the entire block would be off-loaded to the many-core device.
1.3 Benchmark of CMAQ loop nests
In these benchmarks the vector length for loops over domain cells was set equal to the block size (NCMAX= BLKSIZE) with values of BLKSIZE incremented in a wide range. Table 1.1 shows the vector length increments chosen for this analysis on both the host CPU (X5450) and the GPGPU device (C1060). The domain size of 2,276,640 cells corresponds to the episodes described in Section 3 in the report HCTR-2010-1.
Table 1.1. Domain size and choices for vector length.
For timing benchmarks several loop nests from the CMAQ ROS3-HC thread parallel version were selected. Table 1.2 lists the legend for the loop names and details of each source code construct are described in the next section. These are not the only types of loops encountered in the ROS3-HC solver, but they are a starting point for identifying porting issues and GPGPU versus CPU performance results.
Table 1.2. Loop names for loop nest benchmarks.
2.0 COMPILING THE LOOP NESTS
The Portland 10.6 Fortran compiler was used with the pgf3 compiler group. For the host CPU the loops were compiled as single threaded loops. In the GPGPU case, the Portland Accelerator™ compiler generated kernels and data movement for the target device by recognition of accelerator directives (!ACC) that encapsulated loop nests in the benchmarks. In conducting these benchmarks no additional GPGPU optimizations were performed.
3.0 CODE FOR LOOP NESTS
3.1 Loop L478
Loop L478 is one example of a construct that recurs in CMAQ's Rosenbrock solver: the outer loop has an indirect memory reference to extract the index NRK that is used in the inner loop. Therefore the outer loop cannot be parallelized. The inner loop over the index NCIN is a parallel candidate over the vector length NUMCELLS (=NCMAX).
3.2 Loop L522-37
Loop L522-37 is another example of a construct that recurs in CMAQ's Rosenbrock solver:the DO 111 loop has an indirect memory reference to extract the index NRK that is used in contained loop nests. There are three contained loop nests and each has an outer loop that again has an indirect reference used in the inner (long) vector loop. These inner vector loops are all candidates for parallel execution while the outer loops execute sequentially.
3.3 Loop L1240
Loop L1240 computes the final solution of the Rosenrock iteration. It differs from the previous examples in two respects: the loop nest has depth 2, and the first inner loop has significant arithmetic operation count (5 adds and 6 multiplies) which bodes well for a GPGPU kernel. The second inner loop is a memory copy which may be a performance deficit. Note that reversing loop order in the first loop pair would invoke non-unit stride for multiple arrays on the right hand side. All loops in L1240 are candidates for parallel execution.
3.4 Loop L1284
For the current Rosenbrock iterate, Loop L1284 computes the mean square error (MSE), ERR, for each cell index, NCELL, as a sum over a chemical species index N. This creates a loop carried dependence on ERR that prevents parallel execution of the species loop with index N. The last loop is a reduction loop searching for the maximum of the root MSE. The three loops that are candidates for parallel execution are as indicated.
4.0 RUNTIME RESULTS
For the ROS3-HC version of CMAQ timing benchmarks were performed for each loop nest described above. Results for execution on both the host CPU and the GPGPU device are presented in the previous report (HCTR2010-4). The most promising results are for the L1240 case because it is also the most computationally intensive case.
Fig 4.1: Scaling of time spent in kernels on the GPGPU device versus the vector length for three CMAQ benchmarks in Table 1.2.
Exploratory benchmarks with selected loops on a many-core GPGPU device suggest that opportunities exist for CMAQ on such devices, but further work is needed to improve performance.
HiPERiSM Consulting, LLC, (919) 484-9803 (Voice)
(919) 806-2813 (Facsimile)