Zero to GPU hero with OpenACC
Overview
Jeff Larkin of Nvidia produced a lecture Zero to GPU hero with Open ACC.
He notes that there are basically three approaches to writing parallelized code:
Parallel syntax in a standard language
Directives in OpenACC
CUDA
Standard languages
C++ and FORTRAN (2008) have GPU accelerated features in the languages already. Here is an example from FORTRAN:
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
Directives
Compiler directives are a middle ground between standard facets of a given language and CUDA. The OpenACC directives annotate your existing code to give a specialized compiler additional information (in comments) that can allow for parallelization.
Such compiler directives enhance the language and enable the code to run well on GPUs when it is compiled on specific compilers, such as the Nvidia HPC compilers.
!$acc data copy(x,y)
...
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$acc end data
CUDA
Maximum level of performance but at the cost of portability. This code runs well on GPUs but isn’t available anywhere else – e.g. it can’t run on CPUs.
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockId%x*blockDim%x + threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,devince :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
Amdahl’s Law
The performance gains of your application made available by parallelization are constrained by what serial steps such as data movement (both I/O and to and from the GPU) and atomic operations.
Types of directives
Initializing parallel execution
Managing data movement (if the CPU and GPU have distinct memory)
Optimization, such as loop mapping
Directive syntax
!$acc directive clauses
Think of the directive like a function call and the clauses as arguments passed to the function.
CUDA managed memory (aka CUDA unified virtual memory)
Fundamentally, the CPU has access to a large block of relatively slow system memory, while the GPU has access to a smaller block of faster GPU memory.
Passing data between these two memory devices is a serial process that is bandwidth limited. The link between the two memories is PCI-Express or NVLink.
With CUDA managed memory, the programmer does not need to decide where the data resides. This distinction is handled by the operating system and the GPU driver.
miniWeather
This tutorial uses the miniWeather fluid dynamics application.
General steps
First step
Gather the application’s initial performance profile. There are many tools for doing this, such as NVIDIA Nsight Systems, gprof, Tau, Vampir, or HPCToolkit.
Adding directives
By adding an “acc parallel loop” directive, the compiler is made aware that the programmer wants to parallelize the loop and that it is safe to do so.
This example is in C++:
#pragma acc parallel loop collapse(2)
private(ll,s,inds,stencil,vals,d3_vals,r,u,w,t,p)
The collapse(2)
clause tells the compiler to parallelize the first and
second loops in the code. GPUs excel when there are more opportunities for
parallelization. It’s a good idea to aggressively use the collapse clause to
enable as much parallelism as possible by the compiler.
The private
clause tells the compiler that each iteration in the parallel
computation needs its own copy of the private variables in order to prevent a
race condition.
Keep in mind, the parallel
and loop
directives are two distinct
directives. The parallel
directive creates “gangs” or multiple memory
blocks, and the loop
directive specifies which loops to parallelize. Most
often, however, the directives are written together as parallel loop
.
Data optimization
Managed memory (unified virtual memory) suffices much of the time for
optimizing the flow of memory between the CPU and GPU. However, there are
data
directives for explicit data management. This is an advanced topic
that is beyond the scope of Larkin’s talk.
Reduction directive
The last directive to add is the reductions directive. It contributes very little to performance, but it is necessary to add the entire time step to the GPU. When every iteration of a loop is doing an operation onto a particular variable, such as a tendency variable, the reduction clause is needed to prevent a data race.
The reduction clause is told which operation is being performed (such as addition) and which variables the operation is being performed upon.
#pragma acc parallel loop collapse(2) reduction(+:mass_loc, te_loc)
The compiler itself is good at detecting reductions. If the programmer doesn’t
add the reduction directive, the compiler might print an Minfo
message
notifying the user that it added an implicit reduction clause to a loop. When
Minfo
does this, go back and add the clause explicitly.
Compiling
Nvidia’s HPC software development toolkit, which contains compilers among other
tools, is known as NVHPC. The command to compile fortran is nvfortran
. The
compilers support GPU and CPUs as well, from x86, to ARM to Power.
Compiler options:
-acc
enable OpenACC support-gpu
provides the option to use managed memory:-gpu=managed
makes data visible on both the CPU and GPU-Minfo=accel
prints compiler feedback on how your code was accelerated (very useful)
The Minfo
reporting also tells you when the compiler couldn’t
accelerate the code, giving you an opportunity to go back and fix any problems.
Sometimes performance actually decreases because of the serial movement of data between the CPU and the GPU. Jeff Larkin says, “Don’t worry, it will go away.” When we use a performance profiler, we’ll see when data movement occurs, and we can start optimizing the code to minimize data movement. Our goal is to make the computations run on the GPU for the entire time step.
Nvidia Nsight systems
This program provides a GUI to show what exactly is happening with the GPU kernels and the movement of data between CPU and GPU.
Note
The NVTX section of the Nvidia Nsight window shows the calls to the function. If there are gaps between function calls, you should determine why the gaps exist and fill in the time with processing on the GPU.
Look at which function calls are preceded and succeeded by gaps and examine the source code. If there is another function that does computation on the CPU then the code might be triggering page faults. Ensure that the data stays on the GPU and doesn’t get moved back and forth between the CPU and the GPU.
In the case of compute_tendencies_z
there is a loop after the function call
that adds the tendencies to the fluid state:
else if (dir == DIR_Z) {
compute_tendencies_z(state_forcing, flux, tend);
}
...
state_out[inds] = state_init[inds} + dt * tend[indt];
This reference to tend[indt]
is causing the page fault.
Final profiling
If done correctly, the GPU acceleration will dramatically speed up serial code. It no longer makes sense to compare the serial code to the GPU-enhanced code. Instead, the full-socket MPI performance should be compared against the GPU-enhanced performance.
If an application is threaded, then the threaded versus GPU-enhanced performance would be the fair comparison.