
Re: Invoke GPU Kernel in MATLAB for Pointwise Multiplication
Posted:
Nov 6, 2012 5:37 AM


"Jerome " <the_rome@hotmail.com> writes:
> I am having difficulty multiplying complex numbers on the GPU. The values are > incorrect from the C=A.*B format. > > I am just using one thread to determine if the first element is correct. > My example code is the following: > func.cu > __global__ void matrix_mult(float * A_real, float*A_imag, float*B_real, float *C_real, float *C_imag) > { > int x = blockIdx.x * blockDim.x + threadIdx.x; > > C_real =A_real[x]*B_real[x]  A_imag[x] * B_imag[x]; > C_imag = A_imag[x]*B_real + A_real[x] * B_imag[x]; > > }
Hi Jerome, there are a few things to address here. Firstly, your kernel is missing an input  B_imag. Also, when trying to implement an elementwise function, it's very useful to send in the number of elements because you'll need to round up the number of threads. You've missed some indexing out too. Finally, CUDAKernel assumes that nonconst pointers are output arguments  in your case, you want only the 'C' portions to be outputs. Putting that all together, you end up with:
__global__ void matrix_mult( const int numel, const float * A_real, const float * A_imag, const float * B_real, const float * B_imag, float * C_real, float * C_imag) { // Calculate thread index int x = blockIdx.x * blockDim.x + threadIdx.x;
// Only continue if we're in range of the number of elements if ( x < numel ) { C_real[x] = A_real[x]*B_real[x]  A_imag[x] * B_imag[x]; C_imag[x] = A_imag[x]*B_real[x] + A_real[x] * B_imag[x]; } }
There are also some problems with your MATLAB driving code. Firstly, you have this:
> main_function.m > kernel = parallel.gpu.CUDAKernel('func.ptx',func.cu); > kernel.ThreadBlockSize = [1,1,1] > kernel.GridSize = [1,1]
This will run only a single thread  you need to run with enough threads to cover the arrays. You need at least "numel(A)" threads.
> [x,y]=size(data) > result_real = parallel.gpu.GPUArray.zeros(x,y, 'single'); > result_imag = parallel.gpu.GPUArray.zeros(x,y, 'single');
> [C_real, C_imag] = feval(kernel, real(A), imag(A), imag(B), real(B), result_real, result_imag)
You've got imag(B) and real(B) transposed here.
Putting it all together, you need something more like this:
% Arbitrarily choose size of data. n = 135; m = 247;
% Generate random complex input data A = complex( parallel.gpu.GPUArray.rand(n, m, 'single'), ... parallel.gpu.GPUArray.rand(n, m, 'single') ); B = complex( parallel.gpu.GPUArray.rand(n, m, 'single'), ... parallel.gpu.GPUArray.rand(n, m, 'single') );
% Get the kernel kernel = parallel.gpu.CUDAKernel('kern.ptx','kern.cu');
% Choose grid and block sizes numThreads = numel(A); blockSize = min( 256, kernel.MaxThreadsPerBlock ); gridSize = ceil( numThreads / blockSize );
% Set up the kernel kernel.ThreadBlockSize = blockSize; kernel.GridSize = gridSize;
% Preallocate the results C_real = parallel.gpu.GPUArray.zeros(n, m, 'single'); C_imag = parallel.gpu.GPUArray.zeros(n, m, 'single'); [C_real, C_imag] = feval(kernel, numel(A), real(A), imag(A), real(B), imag(B), C_real, C_imag); C = complex(C_real,C_imag);
% Check the results C  A .* B
Also note that CUDAKernel can handle passing complex data in as 'float2' (or 'double2'), so you could simplify your code by using those.
Cheers,
Edric.

