Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Invoke GPU Kernel in MATLAB for Pointwise Multiplication
Replies: 1   Last Post: Nov 6, 2012 5:37 AM

 Messages: [ Previous | Next ]
 Edric Ellis Posts: 721 Registered: 12/7/04
Re: Invoke GPU Kernel in MATLAB for Pointwise Multiplication
Posted: Nov 6, 2012 5:37 AM

"Jerome " <the_rome@hotmail.com> writes:

> I am having difficulty multiplying complex numbers on the GPU. The values are
> incorrect from the C=A.*B format.
>
> I am just using one thread to determine if the first element is correct.
> My example code is the following:
> func.cu
> __global__ void matrix_mult(float * A_real, float*A_imag, float*B_real, float *C_real, float *C_imag)
> {
> int x = blockIdx.x * blockDim.x + threadIdx.x;
>
> C_real =A_real[x]*B_real[x] - A_imag[x] * B_imag[x];
> C_imag = A_imag[x]*B_real + A_real[x] * B_imag[x];
>
> }

Hi Jerome, there are a few things to address here. Firstly, your kernel
is missing an input - B_imag. Also, when trying to implement an
elementwise function, it's very useful to send in the number of elements
because you'll need to round up the number of threads. You've missed
some indexing out too. Finally, CUDAKernel assumes that non-const
pointers are output arguments - in your case, you want only the 'C'
portions to be outputs. Putting that all together, you end up with:

__global__ void matrix_mult( const int numel,
const float * A_real, const float * A_imag,
const float * B_real, const float * B_imag,
float * C_real, float * C_imag)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;

// Only continue if we're in range of the number of elements
if ( x < numel ) {
C_real[x] = A_real[x]*B_real[x] - A_imag[x] * B_imag[x];
C_imag[x] = A_imag[x]*B_real[x] + A_real[x] * B_imag[x];
}
}

There are also some problems with your MATLAB driving code. Firstly, you
have this:

> main_function.m
> kernel = parallel.gpu.CUDAKernel('func.ptx',func.cu);
> kernel.GridSize = [1,1]

This will run only a single thread - you need to run with enough threads
to cover the arrays. You need at least "numel(A)" threads.

> [x,y]=size(data)
> result_real = parallel.gpu.GPUArray.zeros(x,y, 'single');
> result_imag = parallel.gpu.GPUArray.zeros(x,y, 'single');

> [C_real, C_imag] = feval(kernel, real(A), imag(A), imag(B), real(B), result_real, result_imag)

You've got imag(B) and real(B) transposed here.

Putting it all together, you need something more like this:

% Arbitrarily choose size of data.
n = 135; m = 247;

% Generate random complex input data
A = complex( parallel.gpu.GPUArray.rand(n, m, 'single'), ...
parallel.gpu.GPUArray.rand(n, m, 'single') );
B = complex( parallel.gpu.GPUArray.rand(n, m, 'single'), ...
parallel.gpu.GPUArray.rand(n, m, 'single') );

% Get the kernel
kernel = parallel.gpu.CUDAKernel('kern.ptx','kern.cu');

% Choose grid and block sizes
blockSize = min( 256, kernel.MaxThreadsPerBlock );
gridSize = ceil( numThreads / blockSize );

% Set up the kernel
kernel.GridSize = gridSize;

% Pre-allocate the results
C_real = parallel.gpu.GPUArray.zeros(n, m, 'single');
C_imag = parallel.gpu.GPUArray.zeros(n, m, 'single');
[C_real, C_imag] = feval(kernel, numel(A), real(A), imag(A), real(B), imag(B), C_real, C_imag);
C = complex(C_real,C_imag);

% Check the results
C - A .* B

Also note that CUDAKernel can handle passing complex data in as 'float2'
(or 'double2'), so you could simplify your code by using those.

Cheers,

Edric.

Date Subject Author
11/5/12 Jerome
11/6/12 Edric Ellis