Research Article  Open Access
Jianqi Lai, Hang Yu, Zhengyu Tian, Hua Li, "Hybrid MPI and CUDA Parallelization for CFD Applications on MultiGPU HPC Clusters", Scientific Programming, vol. 2020, Article ID 8862123, 15 pages, 2020. https://doi.org/10.1155/2020/8862123
Hybrid MPI and CUDA Parallelization for CFD Applications on MultiGPU HPC Clusters
Abstract
Graphics processing units (GPUs) have a strong floatingpoint capability and a high memory bandwidth in data parallelism and have been widely used in highperformance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multiGPU HPC clusters. The AUSM + UP upwind scheme and the threestep Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the SST twoequation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPUbased CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the onedimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multiGPU clusters are discussed. Performance measurements show that multiGPU parallelization can achieve a speedup of more than 36 times with respect to CPUbased parallel computing, and the parallel algorithm has good scalability.
1. Introduction
The developments of computer technology and numerical schemes over the past few decades have made computational fluid dynamics (CFD) become an important tool in optimal design of aircraft and analysis of a complex flow mechanism [1, 2]. A large number of CFD applications can reduce development costs and provide technical support for research on aircraft. The scope and complexity of flow problems in CFD simulation is constantly expanding, and the grid size required for simulation is increasing. The rapid increase in grid size raises the challenge in processing these huge data on processors in engineering activities and scientific research. Traditionally, multiCPU parallelization has been used to accelerate computation. The low parallelism degree and power inefficiency may limit the parallel performance of the cluster. Furthermore, the computing time is largely dependent on the CPU update. In recent years, the development of the CPU has been a bottleneck due to limitations in power consumption and heat dissipation prevention [3, 4].
In CFD applications, a large amount of computing resources are required for complex flow problems, such as turbulent flow, reactive flow, and multiphase flow. Highperformance computing (HPC) platforms, such as graphics processing unit (GPU), many integrated core (MIC), and field programmable gate array (FPGA), exhibit a more efficient performance in parallel data processing than the CPU [5–8]. Faster and better numerical solutions can be obtained by executing CFD codes on these heterogeneous accelerators. In this paper, we discuss GPU parallelization in CFD applications.
GPU has a strong floatingpoint capability and a high memory bandwidth in data parallelism. The latest Volta architecture Tesla V100 GPU has singleprecision and doubleprecision floatingpoint operations up to 14 and 7 TFLOP/s, respectively, which are much higher than the computing performance of the CPU. In 2006, NVIDIA introduced the compute unified device architecture (CUDA), which reduces the complexity of programming [9]. The programmable GPU has evolved into a highly parallel, multithreaded, manycore processor. Therefore, GPU acceleration is becoming popular in generalpurpose computing areas such as molecular dynamics (MD), direct simulation Monte Carlo (DSMC), CFD, artificial intelligence (AI), and deep learning (DL) [10–13].
In this work, we focus on the design and optimizations of a hybrid parallel algorithm of the message passing interface (MPI) and CUDA for CFD applications on multiGPU HPC clusters. The compressible Navier–Stokes equations are discretized based on the cellcentered finite volume method. The AUSM + UP upwind scheme and the threestep Runge–Kutta method are used for spatial discretization and time discretization, respectively. Moreover, the turbulent solution is solved by the shear stress transport (SST) twoequation model. In some previous work, the CPU was designed to perform data processing together with the GPU [14, 15]. However, for the latest Pascal or Volta Architecture GPU, the computing ability of the GPU far exceeds that of the CPU. In our design of the hybrid parallel algorithm, the CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Performance optimization involves three basic strategies: maximizing parallel execution to achieve maximum utilization, optimizing memory usage to achieve maximum memory throughput, and optimizing instruction usage to achieve maximum instruction throughput. In this study, parallel execution and memory access optimizations are investigated.
In a multiGPU HPC cluster, ghost and singularity data are exchanged between GPUs. Some scholars use CUDAaware MPI technology to accelerate the speed of data exchange [13, 16–18]. This technology not only makes it easier to work with a CUDA + MPI application, but also makes acceleration technologies like GPUDirect be utilized by the MPI library. However, the current hardware and software configurations we used in this paper do not support this technology. Moreover, the use of the SST twoequation turbulence model increases the complexity of multiGPU parallel programming. In our design of the hybrid parallel algorithm, we need to stage GPU buffers through host memory. Two kinds of communication approaches are considered: blocking and nonblocking communication methods. The first approach uses blocking functions MPI_Bsend and MPI_Recv without overlapping communication and computations. The second approach uses nonblocking functions MPI_Isend and MPI_Irecv with fully overlapping GPU computations, CPU_CPU communication, and CPU_GPU data transfer. The nonblocking communication method can improve computational efficiency to some extent.
MultiGPU parallelization can achieve the maximum performance by balancing the workload among GPUs based on domain decomposition [3, 19]. The onedimensional (1D), twodimensional (2D), or threedimensional (3D) domain decomposition method is commonly used for GPU implementation. In this study, we design a 1D domain decomposition algorithm based on the idea of dichotomy to load each GPU with approximately the same grid scale. Though the 1D method needs to transfer more data, the 2D or 3D method cannot achieve coalesced memory access in the global memory, which results in considerable performance loss when performing CFD applications on multiGPU HPC clusters.
In this paper, the design and optimizations of the parallel algorithm are closely related to the hardware configurations, numerical schemes, and computational grids to obtain the optimal parallel performance on multiGPU HPC clusters. The main contributions of this work are summarized as follows:(i)A hybrid parallel algorithm of MPI and CUDA for CFD applications implemented on multiGPU HPC clusters is proposed, and optimization methods are adopted to improve the computational efficiency(ii)Considering the CFD numerical schemes the nonblocking communication mode is proposed to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams(iii)1D domain decomposition method based on the idea of dichotomy is used to distribute the problem among GPUs to balance workload(iv)The proposed algorithm is evaluated with the flat plate flow application, and the parallel performance has been analyzed in detail
The remainder of this paper is organized as follows. Section 2 discusses the related work on GPUbased parallelization and optimizations. Section 3 introduces the governing equations and numerical schemes. Section 4 describes the hybrid parallel algorithm and the optimizations in detail. Section 5 presents the performance evaluation results with the compressible turbulent flow over a flat plate. Section 6 provides the conclusion of this work and a plan for future work.
2. Related Work
In the field of CFD, GPU parallelization for CFD applications has achieved numerous remarkable results. Brandvik and Pullan [20, 21] developed 2D and 3D GPU solvers for the compressible, inviscid Euler equations. This was the first CFD application to use CUDA for the 2D and 3D solutions. Ren et al. [22] and Tran et al. [23] proposed a GPUaccelerated solver for turbulent flow based on the lattice Boltzmann method, and the solver can achieve a good acceleration performance. KhajehSaeed and Blair Perot [24] and Salvadore et al. [25] accomplished direct numerical simulation (DNS) of turbulence using GPUaccelerated supercomputers which demonstrated that scientific problems could benefit significantly from advanced hardware. Ma et al. [26] and Zhang et al. [27] performed GPU computing of compressible flow problems by a meshless method. Xu et al. [14] and Cao et al. [28] described hybrid OpenMP + MPI + CUDA in parallel computing of CFD codes. The results showed that the GPUaccelerated algorithm had sustainably improved efficiency and scalability. Liu et al. [29] proposed a hybrid solution method for CFD applications on CPU + GPU platforms, and a domain decomposition method based on the functional performance model was used to guarantee a balanced workload.
The optimization techniques are used to enhance the performance of GPU acceleration. Memory access and communication are the most critical parts of performance optimization. A review of optimization techniques and the specific improvement factors for each technique is shown in [4].
In a multiGPU HPC cluster, GPUs cannot communicate directly for data exchange. Meanwhile, the blocking communication mode is simple to implement but has low efficiency. Several researchers have designed the nonblocking communication mode for GPU parallelization to overlap computation and communication. Mininni et al. [30] used the nonblocking communication method to realize the overlap between GPU computation and CPU_CPU communication. Thibault and Senocak [31], Jacobsen et al. [32], Castonguay et al. [33], and Ma et al. [34] performed the nonblocking communication method with CUDA streams to fully overlap GPU computation, CPU_CPU communication, and CPU_GPU data transfer. Their results showed that the full coverage between computation and communication is the most efficient.
Domain decomposition is a commonly used method in parallel computing of CFD simulations to balance the workload in each processor. Jacobsen et al. [32] used 1D domain decomposition to decompose 3D structured meshes into a 1D layer. Wang et al. [35] studied the HPC of atmospheric general circulation models (ACGMS) in Earth science research on multiCPU cores. They indicated an ACGMS model with 1D domain decomposition can only run in dozens of CPU cores. Therefore, they proposed a 2D domain decomposition parallel algorithm for this largescale problem. Baghapour et al. [36] executed CFD codes on heterogeneous platforms, with 16 Tesla C2075 GPUs, where the solver works up to 190 times faster than a single core of a Xeon E5645 processor. They pointed out that 3D domain decomposition performs best in bandwidthbound communication and not in latencybound communication, in which 1D domain decomposition is preferred. Given that the GPU computing can execute many threads simultaneously and the communication between the CPU and the GPU becomes a source of high latency with highly noncontiguous data transfer, the 1D domain decomposition method is the most suitable for balancing the workload on GPUs.
3. Governing Equations and Numerical Schemes
The simulation of a compressible turbulent flow is considered. All volume sources are ignored due to body forces and volumetric heating, and the integral form of the 3D Navier–Stokes equations for a compressible, viscous, heatconduction gas can be expressed as follows [37]:where is the vector of conservative variables, is the vector of convective fluxes, and is the vector of viscous fluxes:where is the density, are the local Cartesian velocity components, is the static pressure, is the total energy, is the total enthalpy, are the unit normal vectors of the cell surface, is the velocity normal to the surface element , and stands for the work of the viscous stresses and of the heat conduction, respectively.
The shear stress transport (SST) turbulence model [38] merges the Wilcox’s model [39] with a high Reynolds number model [40]. The SST twoequation turbulence model can be written in integral form as follows:where is the vector of conservative variables, and represent the convective fluxes and viscous fluxes, respectively, and is the source term:where is the turbulent kinetic energy and is the specific dissipation rate. The components of the source term and the model constant are introduced in [37].
The spatial discretization of equation (1) on structured meshes is based on the cellcentered finite volume method. The upwind AUSM + UP scheme [41], which has a high resolution and computational efficiency for all speeds, is used to compute the convective fluxes. Secondorder accuracy is achieved through the monotone upstreamcentered schemes for conservation law (MUSCL) [42] with Van Albada et al. limiter function [43]. The viscous fluxes are solved by the central scheme, and turbulence is modeled with the SST twoequation model.
The solution of equation (1) employs a separate discretization in space and in time, so that space and time integration can be treated separately. For the control volume , equation (1) is written in the following form:
The threestep Runge–Kutta method [44] with thirdorder accuracy has good data parallelism and lower storage overhead, which is used for temporal discretization of equation (5):where is the time step of control volume and stands for the residual.
4. Parallel Algorithm
4.1. Algorithm Description
The GPU implementation uses CUDA. CUDA is a generalpurpose parallel computing platform and programming model for GPUs. CUDA provides a programming environment for highlevel languages, such as C/C++, Fortran, and Python. For NVIDIA GPUs, CUDA has wider universality than other general programming models, such as OpenCL and OpenACC [9, 45–47]. In this study, we choose CUDA as the heterogeneous model to design GPUaccelerated parallel codes for CFD on GTX 1070 and Tesla V100 GPU. A complete GPU code consists of seven parts, namely, getting the device, allocating memory, data transfer from the host to the device, kernel execution, data transfer from the device to the host, free memory space, and resetting the device. In the CUDA programming framework, the execution of a GPU code can be divided into host codes and device codes, which are executed on the CPU and the GPU, respectively. The code on the device side calls the kernel functions to execute on the GPU. The kernel corresponds to a thread grid, which consists of several thread blocks. One thread block contains multiple threads, and a thread is the smallest execution unit of a kernel. Threads within a block can cooperate by sharing data through shared memory. Although the same instructions are executed on the threads, the processed data are different; this mode of execution is called the single instruction multiple thread (SIMT).
The GPU is rich in computing power but poor in memory capacity, whereas the CPU is the opposite. To run CFD applications on a GPU, we must fully utilize the CPU and the GPU. Thus, the GPU is responsible for executing kernel functions, and the CPU only manages the execution of the GPU and communication. In a GPU parallel computing program, a thread is the smallest unit for kernel execution. Threads are executed concurrently in the streaming multiprocessor (SM) as a warp, and a warp contains 32 threads. Therefore, the optimal number of threads in each thread block is an integer multiple of 32. For current devices, the maximum number of threads on a thread block is 1024. The greater the number of threads on each thread block, the smaller the number of thread blocks an SM can call. This condition will diminish the advantage of the GPU in utilizing multithreading to hide the delays in memory acquisition and instruction execution due to the issue of thread synchronization. Too few threads in the thread block will result in idle threads and thereby insufficient device utilization. In this study, a thread block size of 256 is used. The number of thread blocks is determined by the scale of workload to ensure that each thread is loaded with the computation of a grid cell.
MPI and OpenMP are two application programming interfaces that are widely used for running parallel codes on a multiGPU platform. OpenMP can be utilized within a node for finegrained parallelization using shared memory, and MPI works on shared and distributed memories and is widely used for massive parallel computations. In general, MPI exhibits performance loss compared with OpenMP but is relatively easy to perform on different types of hardware and has good scalability [48, 49]. In this work, MPIbased communication for shared or distributed memory is hybridized with CUDA to implement largescale computation on multiGPU HPC clusters. The parallel algorithm for CFD on multiGPU clusters based on MPI and CUDA is shown in Algorithm 1. It is noted that “Block_size” and “Thread_size” stand for the number of thread blocks and the thread block size, respectively. At the beginning of the calculation, the MPI_Init function is called to enter the MPI environment. The Device_Query function is called to run the GPU. Then, data are transferred from the CPU to the GPU using the cudaMemcpyHostToDevice function, and a series of kernel functions, including the processing of boundary, the computing of local time step, the calculation of gradient of primitive variables, the calculation of fluxes, and the update of primitive variables, are executed on the GPU. Among these kernel functions, the calculation of gradients and fluxes consumes the largest amount of time. For multiGPU parallel computing, data exchange is required between GPUs, and the Primitive_Variables_Exchange and Grad_Primitive_Variables_Exchange functions are used to exchange the primitive variables and their gradients, respectively. After the kernel iteration ends, the cudaMemcpyDeviceToHost function is called to transfer the data from the GPU to the CPU for postprocessing. Finally, the MPI_Finalize function is called to exit the MPI environment.

The data transfer is essential in multiGPU parallel computing. In this paper, the exchange of data is done by the CPUs controlling the GPUs. The data transfer process between GPUs is shown in Figure 1. First, we call the cudaMemcpyDeviceToHost function to transfer data from the device to the relevant host through the PCIe bus. Then, the data are transferred between CPUs with the MPI. Finally, we call the cudaMemcpyHostToDevice function to transfer the data from the host to the target device through the PCIe bus.
4.2. Algorithm Optimizations
The performance of the GPU parallel program can be optimized using two main methods: parallel execution optimization for the highest degree of parallelism utilization and memory access optimization for maximum memory throughput.
4.2.1. Parallel Execution Optimization
CUDA programs have two execution modes: synchronous mode and asynchronous mode. Synchronous mode means that control does not return to the host until the current kernel function is executed. Asynchronous mode means that control returns to the host immediately once the kernel function is started. Therefore, the host can start new kernel functions and perform data exchange simultaneously. Streams are sequential sequences of commands that can be managed by the CUDA program to control devicelevel parallelism. Commands in one stream are executed in order, but streams from different commands can be executed in parallel. Thus, the concurrent execution of multiple kernel functions, which is called asynchronous concurrent execution, can be implemented via streams. For the massive parallel computing of CFD on a GPU, a series of kernel functions needs to be executed. The concurrent execution of these kernel functions can be implemented via streams. CUDA creates and destroys streams through the cudaStreamCreate and cudaStreamDestroy functions, and synchronization between threads can be achieved through the cudaDeviceSynchronize function. The implementation of the asynchronous concurrent execution algorithm of CFD on the GPU is shown in Algorithm 2. The initialization of the gradients and residuals, the boundary condition processing, and the time step calculation are concurrently executed by creating a stream. In addition, the calculation of the inviscid fluxes can also be concurrently executed with the calculation of the gradients of primitive variables.

4.2.2. Memory Access Optimization
The data transfer between the host and the device is realized via the PCIe bus. The transmission speed is considerably lower than the GPU bandwidth. Therefore, the application should reduce the data exchange between the host and the device as much as possible. In this work, the kernel iteration is completely performed on the GPU, and data transmission only occurs at the beginning and end of the kernel iteration. Intermediate variables can be created in the data memory and released after the calculation is completed. However, for multiGPU parallel computing, data transfer is inevitable between the host and the device due to the features of communication between GPUs.
The GPU provides different levels of memory structure: global memory, texture memory, shared memory, and the registers. This storage mode ensures that the GPU can reduce the data transfer between the global memory and the device. The global memory locates in the video memory with a large access delay. Therefore, the maximum bandwidth can only be obtained by employing the coalesced memory access. Texture memory is also part of the video memory. Compared with the global memory, the texture memory can use the cache to improve the data access speed and obtain a high bandwidth without strictly observing the conditions of coalesced memory access. The shared memory has a bandwidth much higher than that of the global and texture memories. Data sharing among threads in the SM can be realized by storing the frequently used data in the shared memory. The register is the exclusive storage type of the thread, which stores the variables declared in the kernel to accelerate data processing. The GPU computing efficiency can be improved by properly using the texture memory, the shared memory, and the registers and reducing the number of accesses to the global memory.
4.3. Nonblocking Communication Mode
When performing the parallel computing of CFD on multiGPU HPC clusters, the kernel iteration process needs to exchange data on the boundary, including primitive variables and their gradients. The primitive variables are chosen to ensure that as small data as possible are exchanged. In this process, CPU_CPU communication and CPU_GPU data transfer exist. Optimizing the communication between GPUs significantly affects the performance of multiGPU parallel systems.
The traditional method is the blocking communication mode, as shown in Figure 2. Algorithm 3 shows the algorithm of the blocking communication mode. The blocking communication mode calls the MPI_Bsend and MPI_Recv functions for data transmission and reception, respectively. In the blocking communication mode, GPU computing, CPU_CPU communication, and CPU_GPU data transfer are completely separated. The communication time in this communication mode is a pure overhead, which seriously reduces the efficiency of the parallel system.

The nonblocking communication mode shields the communication time with the computing time by overlapping the computation and the communication, as shown in Figure 3. Algorithm 4 shows the algorithm of the nonblocking communication mode. The overlaps among GPU computing, CPU_CPU communication, and CPU_GPU data transfer are achieved by creating two CUDA streams. When exchanging primitive variables, stream 0 is used for CPU_GPU data transfer and stream 1 is used for boundary condition processing and time step calculation. When the gradients of the primitive variables are exchanged, stream 1 is used to calculate the inviscid fluxes. This can be done because the computations of the inviscid fluxes do not depend on the values being transferred among GPUs. The exchange of the gradients to the host can start as soon as the data are packaged by using stream 0. Then, the data transmission between CPUs is implemented with MPI. At the same time, the Convective_Flux_GPU function is executed by using stream 1. Finally, the target device can receive the data from the host with stream 0. Therefore, the overlaps among GPU computing, CPU_CPU communication, and CPU_GPU data transfer can be realized. In this work, the nonblocking communication mode based on multistream computing is used to optimize the communication of GPU parallel programs. The nonblocking communication mode calls the MPI_Isend and MPI_Irecv functions for data transmission and reception, respectively, and the MPI_Waitall function to await the communication completion and query the completion status. The cudaMemcpyAsync function is used for asynchronous data transmission.

4.4. Domain Decomposition and Load Balancing
In the multiGPU parallel computing, the computational grid needs to be partitioned. Considering the load balancing problem, this work uses the 1D domain decomposition method to load each GPU with approximately the same number of computational grid. The 1D domain decomposition is shown in Figure 4. This method adopts the concept of the bisection method, which is simple to implement and facilitates load balancing. The dichotomy algorithm is shown in Algorithm 5.

Meanwhile, the coalesced memory access is easy to implement due to its boundary data alignment for effectively improving the efficiency of boundary data communication. When the nonblocking communication mode is used, the data in the GPU are divided into three parts (top, middle, and bottom). The top and bottom parts of the data need to be exchanged with other devices. The data transfer of the top and bottom parts can occur simultaneously with the computation of the middle portion.
5. Results and Discussion
5.1. Test Case: Flat Plate Flow
The supersonic flow over a flat plate is a wellknown benchmark problem for compressible turbulent flow of CFD applications [50, 51]. This test case has been studied by many researchers and is widely used to verify and validate CFD codes.
The free stream Ma number is 4.5, the Reynolds number based on the unit length is , the static temperature is 61.1 K, and the angle of attack is 0°. Noslip boundary condition is applied at the stationary flat plate surface, which is also assumed to be adiabatic.
The supersonic flat plate boundary layer problem is solved on various meshes, namely, mesh 1 (0.72 million), mesh 2 (1.44 million), mesh 3 (2.88 million), mesh 4 (5.76 million), mesh 5 (11.52 million), mesh 6 (23.04 million), mesh 7 (46.08 million), and mesh 8 (92.16 million).
5.2. Hardware Environment and Benchmarking
In this work, two types of devices, namely, GTX 1070 GPU and Tesla V100 GPU, are introduced. These two devices were introduced by NVIDIA Corporation. Table 1 shows the main performance parameters of GPUs. For these types of devices, the singleprecision floatingpoint operations far exceed doubleprecision ones. Therefore, the singleprecision data for GPU parallelization are used. The performance of the latest Turing architecture Tesla V100 GPU is greatly improved compared with the previous architecture GPU.

In this study, we use CUDA version 10.1, Visual Studio 2017 for C code and MPICH2 1.4.1 for MPI communication. In addition, a node contains one Intel Xeon E52670 CPU at 2.6 GHz with eight cores and four GPUs.
5.3. Performance Analysis
Speedup (SP) and parallel efficiency (PE) are important parameters for measuring the performance of a hybrid parallel algorithm. Speedup and parallel efficiency are defined as follows [9, 52]:where is the runtime of one iteration step for one CPU with eight cores, is the runtime of one iteration step for GPUs, and are the problem sizes, and are the number of GPUs, and and are the computation times. If , then the problem size remains constant when the number of GPUs is increased. The parallel efficiency of strong scalability is . If , then the problem size grows proportionally with the number of GPUs; thus, the problem size remains constant in each GPU. The parallel efficiency of weak scalability is . Here, subscript 1 denotes a single GPU and subscript 2 stands for multiGPU HPC clusters. The runtime of one iteration step is achieved by averaging the execution time of ten thousand time steps.
5.3.1. Single GPU Implementation
For a single GPU implementation, the time required for the calculation of one iteration step is provided in Table 2 (time is given in milliseconds). Figure 5 shows that the speedup of a single GPU increases with the increase in grid size. For the Tesla V100 GPU, the speedup reaches 135.39 for mesh 1 and 182.11 for mesh 8. Thus, the Tesla V100 GPU has a considerably greater speedup than the GTX 1070 GPU. GPU parallelization can greatly improve the computational efficiency compared with CPUbased parallel computing. For meshes 7 and 8, the GTX 1070 GPU cannot afford such a large amount of calculations because of the limitation of device memory. For our GPU codes, 1 GB of device memory can load approximately 3 million grid cells. As the grid size is increased, the speedup of the GPU code increases gradually. The reason is that, as the grid size increases, the proportion of kernel execution increases with those of data arrangement and communication. Meanwhile, the growth rate of speedup is gradually decreasing because of the limitation of the number of SMs and CUDA cores.

5.3.2. Scalability
In this section, the blocking communication mode is used to study the scalability of GPU codes. Here, the performance of GTX 1070 and Tesla V100 multiGPU HPC clusters is discussed.
Strong scaling tests are performed for meshes 1 to 8 on multiGPU HPC clusters. Tables 3 and 4 provide the time required for the calculation of one iteration step for multiGPU implementation (time is given in milliseconds). Figures 6 and 7 show the strong speedup and parallel efficiency, respectively. In Figure 6, the strong speedup is shown for different grid sizes. Evidently, a large grid size can reach a high speedup. For GTX 1070 and Tesla V100 multiGPU clusters, the speedups of four GPUs reach 172.59 and 576.49, respectively. These values are far greater than the speedups achieved by a single GPU. Thus, a high degree of strong scaling performance is maintained. MultiGPU parallelization can considerably improve the computational efficiency with the increase in grid size. However, multiGPU parallel computing does not show obvious advantages when the grid size is small. This can be explained by the fact that the relative weight of data exchange is inversely proportional to the grid size. GPU is specialized for computeintensive, highly parallel computation. In Figure 7, the strong parallel efficiency is shown for different grid sizes. This result is consistent with the change law of speedup; that is, the parallel efficiency increases with the increase in grid size, and the strong parallel efficiency performance of GTX 1070 multiclusters is slightly better than that of Tesla V100 GPUs. For mesh 8, the strong parallel efficiency of four Tesla V100 GPUs is close to 80%. In addition, a larger number of GPUs indicate a lower parallel efficiency because of the increase in the amount of data transfer. Figure 8 shows the amount of memory communications for parallel computing with four GPUs. As the grid size increases, the amount of memory communication increases proportionally.


(a)
(b)
(a)
(b)
The weak scaling tests for parallel efficiency are shown in Figure 9, and the grid size loaded on each block remains constant. As expected, the parallel efficiency decreases as the number of GPUs increases because the amount of data exchange increases with the increase in the number of GPUs. For the grid size of mesh 6 on each GTX 1070 GPU and Tesla V100 GPU, the weak parallel efficiency of four GPUs can reach 88.05% and 79.68%, respectively. Thus, a high degree of weak scaling performance is maintained.
(a)
(b)
5.3.3. Performance of Nonblocking Mode between GPUs
In this section, the nonblocking communication mode is used to study the performance of GTX 1070 GPUs and Tesla V100 multiGPU HPC clusters. As an example, mesh 6 is investigated on one node with four GPUs to compare the performance of the two communication modes.
Figures 10 and 11 show the strong scalability performance of the nonblocking communication mode compared with the blocking communication mode. Figure 12 shows the weak parallel efficiency of the two methods. The results show that the nonblocking communication mode can shield the communication time with the computing time by overlapping the communication and the computation. For four GPUs, the strong speedups of GTX 1070 and Tesla V100 multiGPUs increase by 15.15% and 19.63%, respectively. In addition, the weak parallel efficiency remains above 87% with the nonblocking communication mode. Moreover, the GPUbased parallelization with Tesla V100 GPUs can exhibit a better performance improvement than the GTX 1070 multiGPU HPC clusters. This result is because the relative weight of the communication time of Tesla V100 GPUs is higher.
(a)
(b)
(a)
(b)
(a)
(b)
6. Conclusions
In this work, a hybrid parallel algorithm of MPI and CUDA for CFD applications on multiGPU HPC clusters has been proposed. Optimization methods have been adopted to achieve the highest degree of parallelism utilization and maximum memory throughput. In this study, a thread block size of 256 is used. The number of thread blocks is determined by the scale of workload to ensure that each thread is loaded with the computation of a grid cell. For a single GPU implementation, two types of devices have been discussed. We obtain an acceleration ratio of more than 36 times, which indicates that GPU parallelization can greatly improve the computational efficiency compared with CPUbased parallel computing. Meanwhile, a large grid size can reach a high speedup due to the increase in the proportion of kernel executions compared with those of data arrangement and communication. The speedups of four GPUs reach 172.59 and 576.49 for GTX 1070 GPUs and Tesla V100 multiGPU HPC clusters, respectively. The strong and weak parallel efficiency are maintained at a high level when the grid size is at a large value. Thus, the parallel algorithm has good strong and weak scalability. Nonblocking communication mode has been proposed to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer. For four GPUs, the strong speedups of GTX 1070 and Tesla V100 multiGPUs increase by 15.15% and 19.63%, respectively. In addition, the weak parallel efficiency remains above 87% with the nonblocking communication mode.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation (NNSF of China) project (no. 11472004). Also, the first author would like to thank Dr. Feng Liu.
References
 Z. J. Wang, K. Fidkowski, R. Abgrall et al., “Highorder CFD methods: current status and perspective,” International Journal for Numerical Methods in Fluids, vol. 72, no. 8, pp. 811–845, 2013. View at: Publisher Site  Google Scholar
 D. Zhang, S. Tang, and J. Che, “Concurrent subspace design optimization and analysis of hypersonic vehicles based on response surface models,” Aerospace Science and Technology, vol. 42, pp. 39–49, 2015. View at: Publisher Site  Google Scholar
 A. Afzal, Z. Ansari, A. R. Faizabadi, and M. K. Ramis, “Parallelization strategies for computational fluid dynamics software: state of the art review,” Archives of Computational Methods in Engineering, vol. 24, no. 2, pp. 337–363, 2017. View at: Publisher Site  Google Scholar
 K. E. Niemeyer and C.J. Sung, “Recent progress and challenges in exploiting graphics processors in computational fluid dynamics,” The Journal of Supercomputing, vol. 67, no. 2, pp. 528–564, 2014. View at: Publisher Site  Google Scholar
 A. Moreno, J. J. Rodríguez, D. Beltrán, A. Sikora, J. Jorba, and E. César, “Designing a benchmark for the performance evaluation of agentbased simulation applications on HPC,” The Journal of Supercomputing, vol. 75, no. 3, pp. 1524–1550, 2019. View at: Publisher Site  Google Scholar
 M. Rodriguez and L. Brualla, “Manyintegrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: a study based on the code DPM,” Computer Physics Communications, vol. 225, pp. 28–35, 2018. View at: Publisher Site  Google Scholar
 N. Cadenelli, Z. Jaks̆ić, J. Polo, and D. Carrera, “Considerations in using OpenCL on GPUs and FPGAs for throughputoriented genomics workloads,” Future Generation Computer Systems, vol. 94, pp. 148–159, 2019. View at: Publisher Site  Google Scholar
 J. Satheesh Kumar, G. Saravana Kumar, and A. Ahilan, “High performance decoding aware FPGA bitstream compression using RG codes,” Cluster Computing, vol. 22, no. S6, pp. 15007–15013, 2019. View at: Publisher Site  Google Scholar
 NVIDIA Corporation, NVIDIA: CUDA C Programming Guide 10.2, NVIDIA Corporation, Santa Clara, CA, USA, 2019, https://docs.nvidia.com/cuda/cudacbestpracticesguide/.
 M. S. Friedrichs, P. Eastman, V. Vaidyanathan et al., “Accelerating molecular dynamic simulation on graphics processing units,” Journal of Computational Chemistry, vol. 30, no. 6, pp. 864–872, 2009. View at: Publisher Site  Google Scholar
 S. S. Sawant, O. Tumuklu, R. Jambunathan, and D. A. Levin, “Application of adaptively refined unstructured grids in DSMC to shock wave simulations,” Computers & Fluids, vol. 170, pp. 197–212, 2018. View at: Publisher Site  Google Scholar
 G. Nguyen, S. Dlugolinsky, M. Bobák et al., “Machine Learning and Deep Learning frameworks and libraries for largescale data mining: a survey,” Artificial Intelligence Review, vol. 52, no. 1, pp. 77–124, 2019. View at: Publisher Site  Google Scholar
 A. A. Awan, K. V. Manian, C. H. Chu, H. Subramoni, and D. K. Panda, “Optimized largemessage broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?” Parallel Computing, vol. 85, pp. 141–152, 2009. View at: Publisher Site  Google Scholar
 C. Xu, X. Deng, L. Zhang et al., “Collaborating CPU and GPU for largescale highorder CFD simulations with complex grids on the TianHe1A supercomputer,” Journal of Computational Physics, vol. 278, pp. 275–297, 2014. View at: Publisher Site  Google Scholar
 S. Iturriaga, S. Nesmachnow, F. Luna, and E. Alba, “A parallel local search in CPU/GPU for scheduling independent tasks on large heterogeneous computing systems,” The Journal of Supercomputing, vol. 71, no. 2, pp. 648–672, 2015. View at: Publisher Site  Google Scholar
 E. Calore, A. Gabbana, J. Kraus, E. Pellegrini, S. F. Schifano, and R. Tripiccione, “Massively parallel latticeBoltzmann codes on large GPU clusters,” Parallel Computing, vol. 58, pp. 1–24, 2016. View at: Publisher Site  Google Scholar
 J. Castagna, X. Guo, M. Seaton, and A. O’Cais, “Towards extreme scale dissipative particle dynamics simulations using multiple gpgpus,” Computer Physics Communications, vol. 251, p. 107159, 2020. View at: Publisher Site  Google Scholar
 J. Kraus, “An introduction to CUDAAware MPI,” 2013, https://developer.nvidia.com/blog/introductioncudaawarempi/.71. View at: Google Scholar
 D. Li, C. Xu, B. Cheng, M. Xiong, X. Gao, and X. Deng, “Performance modeling and optimization of parallel LUSGS on manycore processors for 3D highorder CFD simulations,” The Journal of Supercomputing, vol. 73, no. 6, pp. 2506–2524, 2017. View at: Publisher Site  Google Scholar
 T. Brandvik and G. Pullan, “Acceleration of a twodimensional Euler flow solver using commodity graphics hardware,” Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, vol. 221, no. 12, pp. 1745–1748, 2007. View at: Publisher Site  Google Scholar
 T. Brandvik and G. Pullan, “Acceleration of a 3D Euler solver using commodity graphics hardware,” in Proceedings of the 46th AIAA Aerospace Sciences Meeting and Exhibit, Reno, NEV, USA, January 2008. View at: Google Scholar
 F. Ren, B. Song, Y. Zhang, and H. Hu, “A GPUaccelerated solver for turbulent flow and scalar transport based on the Lattice Boltzmann method,” Computers & Fluids, vol. 173, pp. 29–36, 2018. View at: Publisher Site  Google Scholar
 N. Tran, M. Lee, and S. Hong, “Performance optimization of 3D lattice Boltzmann flow solver on a GPU,” Scientific Programming, vol. 2017, Article ID 1205892, 16 pages, 2017. View at: Publisher Site  Google Scholar
 A. KhajehSaeed and J. Blair Perot, “Direct numerical simulation of turbulence using GPU accelerated supercomputers,” Journal of Computational Physics, vol. 235, pp. 241–257, 2013. View at: Publisher Site  Google Scholar
 F. Salvadore, M. Bernardini, and M. Botti, “GPU accelerated flow solver for direct numerical simulation of turbulent flows,” Journal of Computational Physics, vol. 235, pp. 129–142, 2013. View at: Publisher Site  Google Scholar
 Z. H. Ma, H. Wang, and S. H. Pu, “GPU computing of compressible flow problems by a meshless method with spacefilling curves,” Journal of Computational Physics, vol. 263, pp. 113–135, 2014. View at: Publisher Site  Google Scholar
 J.L. Zhang, Z.H. Ma, H.Q. Chen, and C. Cao, “A GPUaccelerated implicit meshless method for compressible flows,” Journal of Computational Physics, vol. 360, pp. 39–56, 2018. View at: Publisher Site  Google Scholar
 W. Cao, C.f. Xu, Z.h. Wang, L. Yao, and H.y. Liu, “CPU/GPU computing for a multiblock structured grid based highorder flow solver on a large heterogeneous system,” Cluster Computing, vol. 17, no. 2, pp. 255–270, 2014. View at: Publisher Site  Google Scholar
 X. Liu, Z. Zhong, and K. Xu, “A hybrid solution method for CFD applications on GPUaccelerated hybrid HPC platforms,” Future Generation Computer Systems, vol. 56, pp. 759–765, 2016. View at: Publisher Site  Google Scholar
 P. D. Mininni, D. Rosenberg, R. Reddy, and A. Pouquet, “A hybrid MPIOpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence,” Parallel Computing, vol. 37, no. 67, pp. 316–326, 2011. View at: Publisher Site  Google Scholar
 J. C. Thibault and I. Senocak, “CUDA implementation of a NavierStokes solver on multiGPU desktop platforms for incompressible flows,” in Proceedings of the 47th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, Orlando, FL, USA, January 2009. View at: Google Scholar
 D. A. Jacobsen, J. C. Thibault, and I. Senocak, “An MPICUDA implementation for massively parallel incompressible flow computations on multiGPU clusters,” in Proceedings of the 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, Orlando, FL, USA, January 2010. View at: Google Scholar
 P. Castonguay, D. M. Williams, P. E. Vincent, M. Lopez, and A. Jameson, “On the development of a highorder, multiGPU enabled, compressible viscous flow solver for mixed unstructured grids,” in Proceedings of the 20th AIAA Computational Fluid Dynamics Conference, Honolulu, Hawaii, June 2011. View at: Google Scholar
 W. Ma, Z. Lu, and J. Zhang, “GPU parallelization of unstructured/hybrid grid ALE multigrid unsteady solver for moving body problems,” Computers & Fluids, vol. 110, pp. 122–135, 2015. View at: Publisher Site  Google Scholar
 Y. Wang, J. Jiang, H. Zhang et al., “A scalable parallel algorithm for atmospheric general circulation models on a multicore cluster,” Future Generation Computer Systems, vol. 72, pp. 1–10, 2017. View at: Publisher Site  Google Scholar
 B. Baghapour, A. McCall, and C. J. Roy, “Multilevel parallelism for CFD codes on heterogeneous platforms,” in Proceedings of the 46th AIAA Fluid Dynamics Conference, Washington, DC, USA, June 2016. View at: Google Scholar
 J. Blazek, Computational Fluid Dynamics: Principles and Applications, Elsevier, Amsterdam, Netherlands, third edition, 2015.
 F. R. Menter, “Twoequation eddyviscosity turbulence models for engineering applications,” AIAA Journal, vol. 32, no. 8, pp. 1598–1605, 1994. View at: Publisher Site  Google Scholar
 D. C. Wilcox, Turbulence Modeling for CFD, DCW Industries, La Cañada Flintridge, CA, USA, third edition, 2006.
 W. P. Jones and B. E. Launder, “The prediction of laminarization with a twoequation model of turbulence,” International Journal of Heat and Mass Transfer, vol. 15, no. 2, pp. 301–314, 1972. View at: Publisher Site  Google Scholar
 M.S. Liou, “A sequel to AUSM, part II: AUSM+up for all speeds,” Journal of Computational Physics, vol. 214, no. 1, pp. 137–170, 2006. View at: Publisher Site  Google Scholar
 B. Van Leer, “Towards the ultimate conservative difference scheme. V. A secondorder sequel to Godunov’s method,” Journal of Computational Physics, vol. 135, pp. 101–136, 1997. View at: Publisher Site  Google Scholar
 G. D. Van Albada, B. Van Leer, and W. W. Roberts, “A comparative study of computational methods in cosmic gas dynamics,” Astronomy & Astrophysics, vol. 108, pp. 439–471, 1982. View at: Google Scholar
 C.W. Shu and S. Osher, “Efficient implementation of essentially nonoscillatory shockcapturing schemes,” Journal of Computational Physics, vol. 77, no. 2, pp. 439–471, 1988. View at: Publisher Site  Google Scholar
 W. P. Ma, Z. H. Lu, W. Yuan, and X. D. Hu, “Parallelization of an unsteady ALE solver with deforming mesh using Open ACC,” Scientific Programming, vol. 2017, Article ID 4610138, 16 pages, 2017. View at: Publisher Site  Google Scholar
 J. Q. Lai, H. Li, Z. Y. Tian, and Y. Zhang, “A multiGPU parallel algorithm in hypersonic flow computations,” Mathematical Problems in Engineering, vol. 2019, Article ID 2053156, 15 pages, 2019. View at: Publisher Site  Google Scholar
 J. Q. Lai, Z. Y. Tian, H. Yu, and H. Li, “Numerical investigation of supersonic transverse jet interaction on CPU/GPU system,” Journal of the Brazilian Society of Mechanical Sciences and Engineering, vol. 42, 2020. View at: Publisher Site  Google Scholar
 P. S. Rakić, D. D. Milašinovic, Ž. Živanov, Z. Suvajdžin, M. Nikolić, and M. Hajduković, “MPI–CUDA parallelization of a finitestrip program for geometric nonlinear analysis: a hybrid approach,” Advances in Engineering Software, vol. 42, pp. 273–285, 2011. View at: Publisher Site  Google Scholar
 F. Schmitt, R. Dietrich, and G. Juckeland, “Scalable critical path analysis for hybrid MPICUDA applications,” in Proceedings of the 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops, Phoenix, AZ, USA, May 2014. View at: Google Scholar
 L. Jiang, C.L. Chang, M. Choudhari, and C. Liu, “Instabilitywave propagation in boundarylayer flows at subsonic through hypersonic mach numbers,” Mathematics and Computers in Simulation, vol. 65, no. 45, pp. 469–487, 2004. View at: Publisher Site  Google Scholar
 L. Jiang, M. Choudhari, C. L. Chang, and C. Q. Liu, “Numerical simulations of laminarturbulent transition in supersonic boundary layer,” in Proceedings of the 36th AIAA Fluid Dynamics Conference and Exhibit, San Francisco, CA, USA, June 2006. View at: Google Scholar
 J. Watkins, J. Romero, and A. Jameson, “MultiGPU, implicit time stepping for highorder methods on unstructured grids,” in Proceedings of the 46th AIAA Fluid Dynamics Conference, Washington, DC, USA, June 2016. View at: Google Scholar
Copyright
Copyright © 2020 Jianqi Lai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.