I have three arrays, say for example x_pts = rand(3,1000); x_pts_sig = rand(3,1000); s_pts_all = rand(3,10000);
I want to be able to speed up the following code on a GPU, but have found that running for loops on the GPU does not provide any increase in performance when opposed to using say a parfor loop on the CPU across the entire x_pts array. I have a feeling that I should unroll the for loop, but don't know how I would go about doing that, especially with the line using the bsxfun function.
% make all variables GPU arrays x_pts = gpuArray(single(x_pts)); x_pts_sig = gpuArray(single(x_pts_sig)); s_pts_all = gpuArray(single(s_pts_all));
% calculate partial likelihood values for each pose on the GPU posterior_sum_ix = gpuArray.zeros(1,size(s_pts_all,2)); for ix_pts = 1:size(x_pts,2) sig = [x_pts_sig(1,ix_pts) 0 0;0 x_pts_sig(2,ix_pts) 0;0 0 x_pts_sig(3,ix_pts)]; fconst = 1/(2*pi^(3/2)*sqrt((det(sig)))); dist_ix_s = bsxfun(@minus,x_pts(:,ix_pts),s_pts_all); dist_sq = dist_ix_s.^2; dist_norm = sum(dist_sq'/sig,2)'; posterior_all = fconst .* exp(-.5*dist_norm); posterior_sum_ix = posterior_sum_ix + posterior_all; end posterior_sum_ix = gather(posterior_sum_ix);
I know that I could move the lines 'sig = ...' and 'fconst = ...' out of the foor loop, but the profiler says that would be a negligible speedup. Also I know that I could save storage on the GPU by combining some of the lines in the for loop. Any suggestions would be helpful!