I am trying to write (or find code for) a very high performance fft routine (I am specifically interested in the N=256 case).
After benchmarking MATLAB's fft routine I noticed that it runs slower than I would have expected. Using the methodology of fftw webpage (http://www.fftw.org/speed/), and counting flops as 5*N*log2(N) for a complex fft of length power of 2, I benchmarked matlab's fft vs matlab's matrix multiplication as follows:
For N>=256, the fft is about 3x slower than matrix multiplication. Does anyone know what package MATLAB uses for fft? does it use fftw, intel mkl or an internal package? Has anyone gotten noticable (factor of 2 or more) improvements using other fft packages within a mex file? - it seems like this should be possible given the above experiment (my cpu is 2.25ghz with support for sse instructions (2 flops in parallel) and 2fpu ports; I calculate the theoretical optimum performance at 2.25*2*2 = 9GFlops).
% code for generating benchmark of fft vs matrix multiplication Ns=2.^(6:11); GFlopsMM = nan(1,numel(Ns)); GFlopsFFT = nan(1,numel(Ns)); for s=1:numel(Ns) N = Ns(s); A=rand(N,N); B=rand(N,N);
NIter = Ns(end)/N; % run more iterations for smaller N for more accurate timing tic; for j=1:NIter, A'*B; end; t = toc;
A = A+sqrt(-1)*B; % benchmark complex fft tic; for j=1:NIter fft(A); end; t2=toc;
GFlopsMM(s) = NIter*2*size(A,1)^3/1e9/t; GFlopsFFT(s) = NIter*5*N*log2(N)*N/1e9/t2; % cost of N ffts of length N fprintf('N=%d, GFlopsMM=%3.3g, GFlopsFFT=%3.3g\n',N,GFlopsMM(s),GFlopsFFT(s)); end