Computing the Fourier transform of a distribution of electron charge (represented by atomic positions in space)

Question

I am computing a nested loop operation using OpenCL (open computing language). My main question is, given the code outlined below, how might I optimize the speed and efficiency of using the GPU, for instance by using local memory calls instead of global ones?

Given some input vector \$V=(v_x, v_y, v_z)\$ (in math terms, the input vector is a point in reciprocal space where we evaluate the Fourier transform), the desired computation involves iterating over a collection of atoms in space (of different species, like Hydrogen, Oxygen, etc.) and using the input vector to compute a function. Each atom is described by a 4-tuple, consisting of 3 spatial coordinates, \$A=(a_x, a_y, a_z)\$, and 1 species identifier, species_id, which ranges from 0 to the number of unique atom species.

The iterations look like this (in pseudocode):

for V in input_vectors:
    ft_magnitude = 0
    for (A, species_id) in atoms:
        ft_magnitude += get_scale_factor(V, species_id) * exp( -i * dot(A,V))

where i in the exponential is the complex number \$i=\sqrt{-1}\$, the term dot(A,V) is the dot product of two vectors \$A \cdot V = a_x \cdot v_x + a_y \cdot v_y + a_z \cdot v_z\$, and get_scale_factor(V,species_id) is a lookup operation which returns the correct, pre-computed scale factor, which changes with each vector V and species_id.

The idea of using a GPU for this problem is to let each GPU worker compute the output for one vector V. I wrote a kernel posted below to be used with OpenCL, after following several examples I saw online.

The inputs to the kernel are the following:

input_vectors which has length 3*n_vectors, 3 per input vector
atoms which has length 4*n_atoms, 4 per atom
scale_factors which has length n_species * n_vectors , 1 per atom species per input vector
outputs which has length 2*n_vectors, 1 complex number per input vector, representing the Fourier transform magnitude

Here is the kernel (note the use of cosine and sine to evaluate the complex exponential using Euler's formula) :

__kernel void compute(
__global float *input_vectors,
__global float *atoms,
__global float *scale_factors,
__global float2 *outputs,
const int n_vectors,
const int n_atoms){

int i_v = get_global_id(0);

if ( i_v < n_vectors) {

    float vx = input_vectors[i_v*3];
    float vy = input_vectors[ i_v*3+1];
    float vz = input_vectors[ i_v*3+2];

    for (int i_a=0; i_a< n_atoms; i_a++){

        float ax = atoms[ i_a*4];
        float ay = atoms[ i_a*4+1];
        float az = atoms[ i_a*4+2];
        int species = atoms[i_a*4+3];
        float factor = scale_factors[ species * n_vectors + i_v ];

        float dot = vx*rx + vy*ry + vz*rz;

        outputs[i_v].x += factor*native_cos(-dot);
        outputs[i_v].y += factor*native_sin(-dot);
        }
    }
}

I am using PyOpenCL to wrap to the kernel but could be convinced to switch to something else if it would get speedups. To give a sense of scale, typically, n_vectors is on the order of 1,000,000, and n_atoms is on the order of 100,000, with about 10 different atom species.

Here is the GPU information for the machine I am using:

81:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
Subsystem: NVIDIA Corporation 12GB Computational Accelerator
Physical Slot: 4
Flags: bus master, fast devsel, latency 0, IRQ 64
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at 27800000000 (64-bit, prefetchable) [size=16G]
Memory at 27c00000000 (64-bit, prefetchable) [size=32M]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

Kethis256 · Accepted Answer · 2017-04-04 08:45:19Z

2

The most immediate optimization is to use coalesced reads by switching to array of structs form to structure of arrays.

answered Apr 4, 2017 at 8:45

Kethis256

211 bronze badge

3

\$\begingroup\$ Welcome to Code Review! May be you could summarize the content the content of the link and embed it into the answer? Links can go down, so StackExchange tries to keep everything self sufficient \$\endgroup\$

Incomputable
– Incomputable

2017-04-04 09:05:53 +00:00
Commented Apr 4, 2017 at 9:05

Add a comment |

Stack Exchange Network

Computing the Fourier transform of a distribution of electron charge (represented by atomic positions in space)

1 Answer 1

Your Answer

Hot Network Questions

Computing the Fourier transform of a distribution of electron charge (represented by atomic positions in space)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions