I am computing a nested loop operation using OpenCL (open computing language). My main question is, given the code outlined below, how might I optimize the speed and efficiency of using the GPU, for instance by using local memory calls instead of global ones?
Given some input vector \$V=(v_x, v_y, v_z)\$ (in math terms, the input vector is a point in reciprocal space where we evaluate the Fourier transform), the desired computation involves iterating over a collection of atoms in space (of different species, like Hydrogen, Oxygen, etc.) and using the input vector to compute a function. Each atom is described by a 4-tuple, consisting of 3 spatial coordinates, \$A=(a_x, a_y, a_z)\$, and 1 species identifier, species_id
, which ranges from 0 to the number of unique atom species.
The iterations look like this (in pseudocode):
for V in input_vectors:
ft_magnitude = 0
for (A, species_id) in atoms:
ft_magnitude += get_scale_factor(V, species_id) * exp( -i * dot(A,V))
where i
in the exponential is the complex number \$i=\sqrt{-1}\$, the term dot(A,V)
is the dot product of two vectors \$A \cdot V = a_x \cdot v_x + a_y \cdot v_y + a_z \cdot v_z\$, and get_scale_factor(V,species_id)
is a lookup operation which returns the correct, pre-computed scale factor, which changes with each vector V
and species_id
.
The idea of using a GPU for this problem is to let each GPU worker compute the output for one vector V
. I wrote a kernel posted below to be used with OpenCL, after following several examples I saw online.
The inputs to the kernel are the following:
input_vectors
which has length3*n_vectors
, 3 per input vectoratoms
which has length4*n_atoms
, 4 per atomscale_factors
which has lengthn_species * n_vectors
, 1 per atom species per input vectoroutputs
which has length2*n_vectors
, 1 complex number per input vector, representing the Fourier transform magnitude
Here is the kernel (note the use of cosine and sine to evaluate the complex exponential using Euler's formula) :
__kernel void compute(
__global float *input_vectors,
__global float *atoms,
__global float *scale_factors,
__global float2 *outputs,
const int n_vectors,
const int n_atoms){
int i_v = get_global_id(0);
if ( i_v < n_vectors) {
float vx = input_vectors[i_v*3];
float vy = input_vectors[ i_v*3+1];
float vz = input_vectors[ i_v*3+2];
for (int i_a=0; i_a< n_atoms; i_a++){
float ax = atoms[ i_a*4];
float ay = atoms[ i_a*4+1];
float az = atoms[ i_a*4+2];
int species = atoms[i_a*4+3];
float factor = scale_factors[ species * n_vectors + i_v ];
float dot = vx*rx + vy*ry + vz*rz;
outputs[i_v].x += factor*native_cos(-dot);
outputs[i_v].y += factor*native_sin(-dot);
}
}
}
I am using PyOpenCL to wrap to the kernel but could be convinced to switch to something else if it would get speedups. To give a sense of scale, typically, n_vectors
is on the order of 1,000,000, and n_atoms
is on the order of 100,000, with about 10 different atom species.
Here is the GPU information for the machine I am using:
81:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
Subsystem: NVIDIA Corporation 12GB Computational Accelerator
Physical Slot: 4
Flags: bus master, fast devsel, latency 0, IRQ 64
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at 27800000000 (64-bit, prefetchable) [size=16G]
Memory at 27c00000000 (64-bit, prefetchable) [size=32M]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb