The Math Forum

Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Math Forum » Discussions » Software » comp.soft-sys.matlab

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Invoke GPU Kernel in MATLAB for Pointwise Multiplication
Replies: 1   Last Post: Nov 6, 2012 5:37 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Edric Ellis

Posts: 721
Registered: 12/7/04
Re: Invoke GPU Kernel in MATLAB for Pointwise Multiplication
Posted: Nov 6, 2012 5:37 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

"Jerome " <> writes:

> I am having difficulty multiplying complex numbers on the GPU. The values are
> incorrect from the C=A.*B format.
> I am just using one thread to determine if the first element is correct.
> My example code is the following:
> __global__ void matrix_mult(float * A_real, float*A_imag, float*B_real, float *C_real, float *C_imag)
> {
> int x = blockIdx.x * blockDim.x + threadIdx.x;
> C_real =A_real[x]*B_real[x] - A_imag[x] * B_imag[x];
> C_imag = A_imag[x]*B_real + A_real[x] * B_imag[x];
> }

Hi Jerome, there are a few things to address here. Firstly, your kernel
is missing an input - B_imag. Also, when trying to implement an
elementwise function, it's very useful to send in the number of elements
because you'll need to round up the number of threads. You've missed
some indexing out too. Finally, CUDAKernel assumes that non-const
pointers are output arguments - in your case, you want only the 'C'
portions to be outputs. Putting that all together, you end up with:

__global__ void matrix_mult( const int numel,
const float * A_real, const float * A_imag,
const float * B_real, const float * B_imag,
float * C_real, float * C_imag)
// Calculate thread index
int x = blockIdx.x * blockDim.x + threadIdx.x;

// Only continue if we're in range of the number of elements
if ( x < numel ) {
C_real[x] = A_real[x]*B_real[x] - A_imag[x] * B_imag[x];
C_imag[x] = A_imag[x]*B_real[x] + A_real[x] * B_imag[x];

There are also some problems with your MATLAB driving code. Firstly, you
have this:

> main_function.m
> kernel = parallel.gpu.CUDAKernel('func.ptx',;
> kernel.ThreadBlockSize = [1,1,1]
> kernel.GridSize = [1,1]

This will run only a single thread - you need to run with enough threads
to cover the arrays. You need at least "numel(A)" threads.

> [x,y]=size(data)
> result_real = parallel.gpu.GPUArray.zeros(x,y, 'single');
> result_imag = parallel.gpu.GPUArray.zeros(x,y, 'single');

> [C_real, C_imag] = feval(kernel, real(A), imag(A), imag(B), real(B), result_real, result_imag)

You've got imag(B) and real(B) transposed here.

Putting it all together, you need something more like this:

% Arbitrarily choose size of data.
n = 135; m = 247;

% Generate random complex input data
A = complex( parallel.gpu.GPUArray.rand(n, m, 'single'), ...
parallel.gpu.GPUArray.rand(n, m, 'single') );
B = complex( parallel.gpu.GPUArray.rand(n, m, 'single'), ...
parallel.gpu.GPUArray.rand(n, m, 'single') );

% Get the kernel
kernel = parallel.gpu.CUDAKernel('kern.ptx','');

% Choose grid and block sizes
numThreads = numel(A);
blockSize = min( 256, kernel.MaxThreadsPerBlock );
gridSize = ceil( numThreads / blockSize );

% Set up the kernel
kernel.ThreadBlockSize = blockSize;
kernel.GridSize = gridSize;

% Pre-allocate the results
C_real = parallel.gpu.GPUArray.zeros(n, m, 'single');
C_imag = parallel.gpu.GPUArray.zeros(n, m, 'single');
[C_real, C_imag] = feval(kernel, numel(A), real(A), imag(A), real(B), imag(B), C_real, C_imag);
C = complex(C_real,C_imag);

% Check the results
C - A .* B

Also note that CUDAKernel can handle passing complex data in as 'float2'
(or 'double2'), so you could simplify your code by using those.



Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© The Math Forum at NCTM 1994-2018. All Rights Reserved.