Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.
|
|
Math Forum
»
Discussions
»
Software
»
comp.soft-sys.matlab
Notice: We are no longer accepting new posts, but the forums will continue to be readable.
Topic:
Performance Difference in CPU and GPU in MATALB
Replies:
2
Last Post:
Nov 9, 2012 7:07 AM
|
 |
|
Jerome
Posts:
48
Registered:
12/9/11
|
|
Re: Performance Difference in CPU and GPU in MATALB
Posted:
Nov 9, 2012 7:07 AM
|
|
Thank you for your reply!
I also tried for large matrices 621 x 1176, and the GPU (0.00834) performance is still slower than the CPU (0.001513) Where, kernel.ThreadBlockSize = [1024,1,1]; kernel.GridSize = [713,1];
tic C = feval(kernel,A,B,C); wait(gpuDevice(1)); C=gather(C) time = toc
My CPU version: is A=rand(621,1176); B=rand(621,1176); C=rand(621,1176);
tic C=A.*B toc
Thanks in Advanced Edric M Ellis <eellis@mathworks.com> wrote in message <ytw7gpv6sjl.fsf@uk-eellis0l.dhcp.mathworks.com>... > "Jerome " <the_rome@hotmail.com> writes: > > > I have invoked a cuda kernel from my MATLAB implementation; however my > > CPU results are faster than my gpu implementation. > > > > The results are: > > > > CPU: 0.000006 > > GPU: 0.00134 > > My kernel and MATLAB code is below: > > > > Thanks in Advance! > > > > matrix.cu > > > > __global__ void matrix_mult2(double *A, double *B, double * C) { > > int x = blockIdx.x * blockDim.x + threadIdx.x; > > > > C[x] = A[x] * B[x]; > > > > > > } > > > > > > > > main.m > > kernel = parallel.gpu.CUDAKernel( 'matrix_mult2.ptx', ... > > 'matrix_mult2.cu' ); > > > > > > kernel.ThreadBlockSize = [25,1,1]; > > kernel.GridSize = [1,1]; > > > > > > A = parallel.gpu.GPUArray.rand(5,5,'double'); > > B = parallel.gpu.GPUArray.rand(5,5,'double'); > > C = parallel.gpu.GPUArray.zeros(5,5); > > > > C = feval(kernel,A,B,C); > > Firstly, to get accurate timing information when running stuff on the > GPU, you need to add "wait(gpuDevice)" to ensure that everything has > finished running there. > > Secondly, there is a fixed overhead to getting through to launching a > kernel on the GPU, which explains why things don't speed up until you > get to relatively large data sizes. > > To evaluate GPU performance for a kernel as simple as this one, you > should compare your measured throughput (i.e. achieved bandwidth) with > the theoretical maximum for your device. For a kernel as simple as this, > you should get close to the peak achievable bandwidth for your device, > probably when numel(A) is around 1e5 or thereabouts. > > Cheers, > > Edric.
|
|
|
|