Date: Nov 9, 2012 7:07 AM
Author: Jerome
Subject: Re: Performance Difference in CPU and GPU in MATALB

Thank you for your reply!

I also tried for large matrices 621 x 1176, and the GPU (0.00834) performance is still slower than the CPU (0.001513)
Where,
kernel.ThreadBlockSize = [1024,1,1];
kernel.GridSize = [713,1];

tic
C = feval(kernel,A,B,C);
wait(gpuDevice(1));
C=gather(C)
time = toc

My CPU version:
is A=rand(621,1176);
B=rand(621,1176);
C=rand(621,1176);

tic
C=A.*B
toc

Thanks in Advanced

Edric M Ellis <eellis@mathworks.com> wrote in message <ytw7gpv6sjl.fsf@uk-eellis0l.dhcp.mathworks.com>...
> "Jerome " <the_rome@hotmail.com> writes:
>

> > I have invoked a cuda kernel from my MATLAB implementation; however my
> > CPU results are faster than my gpu implementation.
> >
> > The results are:
> >
> > CPU: 0.000006
> > GPU: 0.00134
> > My kernel and MATLAB code is below:
> >
> > Thanks in Advance!
> >
> > matrix.cu
> >
> > __global__ void matrix_mult2(double *A, double *B, double * C) {
> > int x = blockIdx.x * blockDim.x + threadIdx.x;
> >
> > C[x] = A[x] * B[x];
> >
> >
> > }
> >
> >
> >
> > main.m
> > kernel = parallel.gpu.CUDAKernel( 'matrix_mult2.ptx', ...
> > 'matrix_mult2.cu' );
> >
> >
> > kernel.ThreadBlockSize = [25,1,1];
> > kernel.GridSize = [1,1];
> >
> >
> > A = parallel.gpu.GPUArray.rand(5,5,'double');
> > B = parallel.gpu.GPUArray.rand(5,5,'double');
> > C = parallel.gpu.GPUArray.zeros(5,5);
> >
> > C = feval(kernel,A,B,C);

>
> Firstly, to get accurate timing information when running stuff on the
> GPU, you need to add "wait(gpuDevice)" to ensure that everything has
> finished running there.
>
> Secondly, there is a fixed overhead to getting through to launching a
> kernel on the GPU, which explains why things don't speed up until you
> get to relatively large data sizes.
>
> To evaluate GPU performance for a kernel as simple as this one, you
> should compare your measured throughput (i.e. achieved bandwidth) with
> the theoretical maximum for your device. For a kernel as simple as this,
> you should get close to the peak achievable bandwidth for your device,
> probably when numel(A) is around 1e5 or thereabouts.
>
> Cheers,
>
> Edric.