Drexel dragonThe Math ForumDonate to the Math Forum



Search All of the Math Forum:

Views expressed in these public forums are not endorsed by Drexel University or The Math Forum.


Math Forum » Discussions » Software » comp.soft-sys.matlab

Topic: Performance Difference in CPU and GPU in MATALB
Replies: 2   Last Post: Nov 9, 2012 7:07 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Jerome

Posts: 48
Registered: 12/9/11
Re: Performance Difference in CPU and GPU in MATALB
Posted: Nov 9, 2012 7:07 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

Thank you for your reply!

I also tried for large matrices 621 x 1176, and the GPU (0.00834) performance is still slower than the CPU (0.001513)
Where,
kernel.ThreadBlockSize = [1024,1,1];
kernel.GridSize = [713,1];

tic
C = feval(kernel,A,B,C);
wait(gpuDevice(1));
C=gather(C)
time = toc

My CPU version:
is A=rand(621,1176);
B=rand(621,1176);
C=rand(621,1176);

tic
C=A.*B
toc

Thanks in Advanced

Edric M Ellis <eellis@mathworks.com> wrote in message <ytw7gpv6sjl.fsf@uk-eellis0l.dhcp.mathworks.com>...
> "Jerome " <the_rome@hotmail.com> writes:
>

> > I have invoked a cuda kernel from my MATLAB implementation; however my
> > CPU results are faster than my gpu implementation.
> >
> > The results are:
> >
> > CPU: 0.000006
> > GPU: 0.00134
> > My kernel and MATLAB code is below:
> >
> > Thanks in Advance!
> >
> > matrix.cu
> >
> > __global__ void matrix_mult2(double *A, double *B, double * C) {
> > int x = blockIdx.x * blockDim.x + threadIdx.x;
> >
> > C[x] = A[x] * B[x];
> >
> >
> > }
> >
> >
> >
> > main.m
> > kernel = parallel.gpu.CUDAKernel( 'matrix_mult2.ptx', ...
> > 'matrix_mult2.cu' );
> >
> >
> > kernel.ThreadBlockSize = [25,1,1];
> > kernel.GridSize = [1,1];
> >
> >
> > A = parallel.gpu.GPUArray.rand(5,5,'double');
> > B = parallel.gpu.GPUArray.rand(5,5,'double');
> > C = parallel.gpu.GPUArray.zeros(5,5);
> >
> > C = feval(kernel,A,B,C);

>
> Firstly, to get accurate timing information when running stuff on the
> GPU, you need to add "wait(gpuDevice)" to ensure that everything has
> finished running there.
>
> Secondly, there is a fixed overhead to getting through to launching a
> kernel on the GPU, which explains why things don't speed up until you
> get to relatively large data sizes.
>
> To evaluate GPU performance for a kernel as simple as this one, you
> should compare your measured throughput (i.e. achieved bandwidth) with
> the theoretical maximum for your device. For a kernel as simple as this,
> you should get close to the peak achievable bandwidth for your device,
> probably when numel(A) is around 1e5 or thereabouts.
>
> Cheers,
>
> Edric.




Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© Drexel University 1994-2014. All Rights Reserved.
The Math Forum is a research and educational enterprise of the Drexel University School of Education.