Date: Nov 9, 2012 7:07 AM
Author: Jerome
Subject: Re: Performance Difference in CPU and GPU in MATALB
Thank you for your reply!

I also tried for large matrices 621 x 1176, and the GPU (0.00834) performance is still slower than the CPU (0.001513)

Where,

kernel.ThreadBlockSize = [1024,1,1];

kernel.GridSize = [713,1];

tic

C = feval(kernel,A,B,C);

wait(gpuDevice(1));

C=gather(C)

time = toc

My CPU version:

is A=rand(621,1176);

B=rand(621,1176);

C=rand(621,1176);

tic

C=A.*B

toc

Thanks in Advanced

Edric M Ellis <eellis@mathworks.com> wrote in message <ytw7gpv6sjl.fsf@uk-eellis0l.dhcp.mathworks.com>...

> "Jerome " <the_rome@hotmail.com> writes:

>

> > I have invoked a cuda kernel from my MATLAB implementation; however my

> > CPU results are faster than my gpu implementation.

> >

> > The results are:

> >

> > CPU: 0.000006

> > GPU: 0.00134

> > My kernel and MATLAB code is below:

> >

> > Thanks in Advance!

> >

> > matrix.cu

> >

> > __global__ void matrix_mult2(double *A, double *B, double * C) {

> > int x = blockIdx.x * blockDim.x + threadIdx.x;

> >

> > C[x] = A[x] * B[x];

> >

> >

> > }

> >

> >

> >

> > main.m

> > kernel = parallel.gpu.CUDAKernel( 'matrix_mult2.ptx', ...

> > 'matrix_mult2.cu' );

> >

> >

> > kernel.ThreadBlockSize = [25,1,1];

> > kernel.GridSize = [1,1];

> >

> >

> > A = parallel.gpu.GPUArray.rand(5,5,'double');

> > B = parallel.gpu.GPUArray.rand(5,5,'double');

> > C = parallel.gpu.GPUArray.zeros(5,5);

> >

> > C = feval(kernel,A,B,C);

>

> Firstly, to get accurate timing information when running stuff on the

> GPU, you need to add "wait(gpuDevice)" to ensure that everything has

> finished running there.

>

> Secondly, there is a fixed overhead to getting through to launching a

> kernel on the GPU, which explains why things don't speed up until you

> get to relatively large data sizes.

>

> To evaluate GPU performance for a kernel as simple as this one, you

> should compare your measured throughput (i.e. achieved bandwidth) with

> the theoretical maximum for your device. For a kernel as simple as this,

> you should get close to the peak achievable bandwidth for your device,

> probably when numel(A) is around 1e5 or thereabouts.

>

> Cheers,

>

> Edric.