Continuum open-sourced their Python CUDA bindings this summer, which were previously part of their paid
Anaconda Accelerate. Notability it provides bindings for linear algebra (BLAS), Fourier transforms,
sparse matrices, and random number generation plus sorting algorithms. This is yet another nice contribution to the
open-source scientific programming community by Continuum that I’m happy to get to play with. Here I’d like to
compare `numba`

and its companion `pyculib`

for doing some computation on a graphics processing unit compared to a
conventional CPU approach. Actually that’s not quite true, I’m comparing CUDA’s FFT to the Intel FFT found in
Intel’s Python distribution, another thing Continuum has a hand in.

I always like to test with real-world applications that are within my field of interest, because I usually learn something new about the problems I work with. Here I’m going to compute the phase correlation for a stack of 89, 2048 x 2048 images against a template (in this case, the first image in the stack). The basic goal here is to correct for image drift amongst all the frames in a movie. Modern superphones do this with their cameras more-or-less all the time. When you take a picture, it actually records a movie, and then it uses correlations to remove the jitter from your unstable hands and shows you the summed result.

## Configuration

Here is some information about my test computer:

**Intel iCore-7 7820X Skylake-X CPU**

- Physical cores: 8
- Virtual cores: 16
- Clock Frequency: 3.6 GHz

This processor has access to the AVX512 instruction set, an important factor from the point of view of
trying to maximize parallel performance, regardless of the number of cores. For example with the Skylake-X
architecture here, each core has not one but two multiply units, and each multiple unit can do 16 single-precision
floating-point operations per cycle, so in total you get 32 `float-32`

flops/cycle (for multiply and fused-multiply
add). Thus the parallelism from the number of cores (8) is dwarfed by the parallelism from the SIMD vectorized
instructions. If you are truly concerned about performance for CPU computing you need to use a program that is both
multi-threaded (or multi-processed) and SIMD aware. And to compare I have a Nvidia graphics card:

- CUDA 9.0
- VRAM: 8.0 GB
- CUDA single-precision cores: 2560
- Clock Frequency: 1683 MHz

It’s fairly normal to use such commercial gaming cards in cluster computing. In Switzerland I had use of a cluster that had 6 Titan-XP cards per node. Titans are only marginally faster than 1080s in practical terms but they have 50 % more VRAM, which is more important in many tasks. The only real reason to buy Tesla cards IMHO is if you need double-precision, but they usually aren’t cost effective compared to CPUs.

Roughly speaking these two computing units are in a similar price range. With the SIMD instructions, we should get about 256 flops/cycle from the CPU. The CPU runs at about twice the clock rate of the GPU, so we expect (very roughly) the GPU to be 5.0x faster than the CPU.

Here I’m going to use the Intel Python in a virtual `conda`

environment, for the basic reason that I’m using Windows
and on windows `pyFFTW`

is slow compared to the Intel MKL-FFT (by a factor of three on my machine). FFTW is basically
written for `gcc`

and the newest `gcc`

on Windows is to the best of my knowledge v5.1 with TDM-GCC. Furthermore FFTW
has seen any updates for a couple of processor architecture generations. The Intel MKL-FFT basically takes better
advantage of the modern AVX512 SIMD instructions my processor has available.

For matrix operations we’re going to use the NumExpr-3.0 development branch. While Intel provides a NumExpr 2.6.2
build in their conda channel, we really want the auto-vectorization and the `complex64`

support
that the 3.0 branch brings. Moreover for a fair test we have to use `complex64`

because my Nvidia 1080 GTX doesn’t
perform well with double-precision numbers. For the tests, I made a new `conda`

environment as follows:

```
conda config --add channel intel
conda create --name intel --channel intel --override-channels intelpython3_core
(source) activate intel
conda install numba pyculib
```

For more information see the Intel Conda install guide.

Currently NumExpr3 is built from source, as follows:

```
git clone https://github.com/pydata/numexpr.git
cd numexpr
git branch numexpr-3.0
python setup.py install
```

For whatever reason if you build NumExpr3 with Visual Studio on Windows (but not GCC on Linux) it leaks memory with
the Intel Python distribution, whereas with a conventional Anaconda build it doesn’t. My motherboard doesn’t support
Ubuntu yet, so while I have a VirtualBox install, the performance inside VirtualBox is far worse than with a native
OS. Also I couldn’t get CUDA 9.0 working with Ubuntu inside VirtualBox either, as the appropriate ver.384 drivers
aren’t available yet with `apt`

. Ideally I’d like to try and compile NumExpr3 with ICC instead and see if that fixes
the problem, but I don’t have a copy.

## The benchmark

I wrote some benchmark scripts, one `cpu.py`

, using MKL-FFT and NumExpr-3:

```
import numpy as np
N_threads = 8
import numexpr3 as ne3; ne3.set_nthreads(N_threads)
from time import perf_counter
import sys, os, os.path, pickle
N = int(sys.argv[1])
tries = int(sys.argv[2])
timings = np.zeros(tries)
# Insert your own code to load images here:
import mrcz
frames = mrcz.readDM4('Series_10_HAADF.dm4').im[1].imageData.astype('complex64')[:N,:,:]
imageShape = frames.shape[1:]
for I in np.arange(tries):
t0 = perf_counter()
fft_template = np.empty( imageShape, dtype='complex64' )
fft_frame = np.empty_like(fft_template)
xcorr2 = ne3.NumExpr( 'fft_frame=fft_frame*fft_template/abs(fft_frame*fft_template)' )
fft_template = np.fft.fft2(frames[0,:,:])
fft_template = fft_template.conj()
for K in np.arange(1,N):
fft_frame = np.fft.fft2(frames[K,:,:])
# There's a memory leak with NE3 with the Intel Python distro.
xcorr2( fft_frame=fft_frame, fft_template=fft_template)
output = np.fft.ifft2( fft_frame )
t1 = perf_counter()
print( f'Computed {N} phase correlations on CPU in {t1-t0} s' )
timings[I] = t1-t0
if os.path.isfile( 'cpu_times.pkl' ):
with open( 'cpu_times.pkl', 'rb' ) as fh:
cpu_times = pickle.load(fh)
if not N in cpu_times:
cpu_times[N] = []
cpu_times[N].append( np.min(timings) )
else:
cpu_times = {N:[np.min(timings)]}
with open( 'cpu_times.pkl', 'wb') as fh:
pickle.dump(cpu_times, fh)
```

and another `gpu.py`

, using `numba.cuda`

and `pyculib.fft`

:

```
import numpy as np
import numba.cuda as cuda
import pyculib.fft as cfft
from time import perf_counter
import sys, os, os.path, pickle
N = int(sys.argv[1])
tries = int(sys.argv[2])
timings = np.zeros(tries)
# Insert your own code to load images here:
import mrcz
frames = mrcz.readDM4('Series_10_HAADF.dm4').im[1].imageData.astype('complex64')[:N,:,:]
imageShape = frames.shape[1:]
# 1080 GTX seems to return errors above 20 compute units and 64 threads per compute unit,
# even though it is supposed to have 128 threads per.
Gdim = 20 # Thread-blocks per grid
Bdim = 64 # Threads per block
for I in range(tries):
t0 = perf_counter() # Don't count file load time in benchmark
@cuda.jit('void(complex64[:,:], complex64[:,:])')
def xcorr2(frame, conj_template):
I, J = cuda.grid(2)
frame[I,J] *= conj_template[I,J] / abs( frame[I,J] * conj_template[I,J] )
@cuda.jit('void(complex64[:,:])')
def conj_inplace(frame):
I, J = cuda.grid(2)
frame[I,J] = frame[I,J].real - frame[I,J].imag
stream = cuda.stream()
fftPlan = cfft.FFTPlan( shape=imageShape, itype='complex64', otype='complex64', stream=stream )
with stream.auto_synchronize():
fft_template = cuda.to_device(frames[0,:,:].astype('complex64'), stream=stream)
fftPlan.forward(fft_template, fft_template)
conj_inplace[Gdim,Bdim] ( fft_template )
# Run template match
for K in np.arange(1,N):
fft_frame = cuda.to_device(frames[K,:,:], stream=stream)
fftPlan.forward(fft_frame, fft_frame)
xcorr2[Gdim,Bdim] (fft_frame, fft_template)
fftPlan.inverse(fft_frame, fft_frame)
frames[K,:,:] = fft_frame.copy_to_host()
t1 = perf_counter()
print( f'Computed {N} phase correlations on GPU in {t1-t0} s' )
timings[I] = t1-t0
if os.path.isfile( 'gpu_times.pkl' ):
with open( 'gpu_times.pkl', 'rb' ) as fh:
gpu_times = pickle.load(fh)
if not N in gpu_times:
gpu_times[N] = []
gpu_times[N].append( np.min(timings) )
else:
gpu_times = {N:[np.min(timings)]}
with open( 'gpu_times.pkl', 'wb') as fh:
pickle.dump(gpu_times, fh)
```

And finally a simple `runner.py`

to execute the subprocesses:

```
import subprocess as sub
import os
try: os.remove('gpu_times.pkl')
except OSError: pass
try: os.remove('cpu_times.pkl')
except OSError: pass
unique_tries = 5
repeat_tries = 1
for I in range(2,90):
for J in range(unique_tries):
sub.call( f'python cpu.py {I} {repeat_tries}' )
sub.call( f'python gpu.py {I} {repeat_tries}' )
```

We have to run our tests in subprocesses to get a real view of the time taken to startup the components. For the FFTs,
there’s planning involved (although since these are power of 2 shaped images that should be minimal) and then for the
GPU solution we also have the time for Numba to transform the Python functions into a CUDA kernel. There are ways
around this cost if you are running the logic more than once (say via `dask`

over a cluster), typically via serialization
with `pickle`

. Both NumExpr-3 and Numba support pickling of their output functions. Historically FFTW offered a means of
saving a plan to a file, known as wisdom, but I don’t see an option to do that for either the CUDA FFT or the Intel MKL
FFT bindings featured here.

## Results

Here are the results for executing 1 to 88 cross-correlations, all for 2k x 2k images. I did 5 tries per step, and took the minimum time taken:

It takes about 60 frames (or ~ 1GB of data) for the GPU solution to catch the CPU. This is largely due to the longer
setup time for the GPU solution. Numba has to actually compile those functions, whereas NumExpr has its own internal
virtual machine and it can parse Python code very quickly. The initialization time for the CPU solution is about 0.2 ms,
whereas for the GPU solution is about 600 ms, or a factor of 3000x difference. About half of the GPU setup time is for
the FFT plan, whereas the Intel FFT doesn’t seem to have a setup time. Likely the Intel FFT has some cached wisdom
for the given image shape. Note that these benchmarks are really quite fast for both solutions. I tried the same
process with `pyfftw`

and it takes about 10.5 s for the full 88 cross-correlations. Using `scipy.fftpack`

is far
slower, clocking in at 68 s.

Sometimes one wants to calculate something once, but often we want to repeat the same calculation on different data
sets. If `cpu.py`

and `gpu.py`

are run through their `repeat_tries`

loop the initial run is slow, but then the later
runs are fast.

Here the GPU manages to pass the GPU around frame 20, a better performance. I’m not the world expert in either GPU
programming or the use of Numba, so likely a more skilled hand could squeeze some more performance out of the GPU.
Monitoring with `nvidia-smi`

shows a maximum utilization on the GPU of about 30 %. In my experience it’s difficult to
get 100 % utilization out of a GPU, but certainly 50-60 % isn’t unreasonable. There are optimizations that could be
done. For example, we are returning the full array from the GPU to main memory. Practically we probably don’t want the
entire cross-correlation, via the `fft_frame.copy_to_host()`

call, as we are usually only concerned with the
neighbourhood around the maximum. So we could return a small sub-array, or possibly even up-interpolate the sub-array
to get sub-pixel precision in our image shifts.

## Conclusion

If the GPU really is five times faster than the CPU then the GPU slope should be 5x flatter than the CPU slope. In
reality the marginal speed-up per cross-correlation is only 23 % for the GPU over the CPU. The GPU solution in this
case ends up being below expectations, for a more intricate and time-consuming coding task. In my experience where
GPUs really shine is for iterative algorithms where the data stays on the GPU for some time. In comparison the
cross-correlation problem is not particularly well-suited to GPU processing. It involves too little calculation on
the GPU, and too large a volume of arrays to transfer from main memory to the GPU and back. Ideally for pure GPU
solutions we want to repeatedly crunch numbers on the board. This means the problem has to fit within the GPU’s
buffer, which in this case is 8 GB. I’m thankful for the introduction of 4k computer monitors as before the
memory available to GPUs was much smaller. Overall the most notable result here to me was just how fast the Intel MKL-FFT
is, and it’s also extremely simple to use, as it patches `numpy.fft`

directly. I really did not expect it to be so
competitive. Overall Python has gotten much faster in 2017 for scientific calculation, and that’s pleasing to see.

## Comments @ reddit

Benchmarking NumExpr3/Intel MKL-FFT against Numba/cuFFT from Python