I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.
I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:
- Numpy, making use only of the functionality of
dot
. - Python, calling the BLAS functionalities through a shared object.
- C++, calling the BLAS functionalities through a shared object.
Scenario
I implemented a matrix-matrix multiplication for different dimensions i
. i
runs from 5 to 500 with an increment of 5 and the matricies m1
and m2
are set up like this:
m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)
1. Numpy
The code used looks like this:
tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))
2. Python, calling BLAS through a shared object
With the function
_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):
no_trans = c_char("n")
n = c_int(i)
one = c_float(1.0)
zero = c_float(0.0)
_blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n),
byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n),
m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero),
r.ctypes.data_as(ctypes.c_void_p), byref(n))
the test code looks like this:
r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))
3. c++, calling BLAS through a shared object
Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with
void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");
I measure the time with gettimeofday
like this:
gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);
where j
is a loop running 20 times. I calculate the time passed with
double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}
Results
The result is shown in the plot below:
Questions
- Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
- Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
- Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?
Download
The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)
r
matrix is unfair. I am resolving the "issue" right now and post the new results. – Woltan Sep 29 '11 at 11:35np.ascontiguousarray()
(consider C vs. Fortran order). 2. make sure thatnp.dot()
uses the samelibblas.so
. – J.F. Sebastian Sep 29 '11 at 12:44