I am taking the fft of each colomn of the NxN matrix rather than a single Nx1 vector to get more accurate timings.
Calling Nx1 fft N times would be slow in matlab (calling fft with an NxN matrix does this internally). Furthermore, if there are optimizations that can be done because we are computing the same length fft N times (pre-computing the twiddle factors for example), passing in the NxN matrix allows fftw to take advantage of that if it chooses.
I am taking this factor of N into account in my computations of the number of flops (see code).
I wrote my own C++ code to do the fft on powers of 2 (highly optimized using sse instructions and efficient memory access patterns) and noticed about a 1.5-1.7x speedup over MATLAB's fft (for N in 128 to 1024 range), achieving about 4.5GFlops on my cpu (compare to above experiment where matlabs fft achieves 3GFlops at peak). I am working on an assembly implementation for x64 windows right now. It seems that fftw isn't as optimized as it could be (unless the interfacing with MATLAB is innefficient). I also noticed that fftw was written in c++ while most good matrix multiplication routines are written in assembly and achieve much closer to the theoretical optimum flop count.
Does anyone have experience with another fft library (like intel MKL) that shows significant improvements over fftw?
dpb <email@example.com> wrote in message <firstname.lastname@example.org>... > On 3/29/2013 12:40 PM, dpb wrote: > ... > > > Your above A array on which you compute the fft is a 2D array of NxN > > which is an N/2-length column-wise complex by N columns. > > > > The ML vectorized FFT routine computes a separate FFT for each column > > when passed an array. So, you have computed N complex FFTs each of > > length N/2 complex values. > ... > > Take it back...wasn't thinking clearly. After the sum it _is_ N complex > in length but is still N wide. You want only > > A=complex(A(:,1),B(:,1)); > > to be of the proper size for FFT to be consistent w/ the link timings. > > Although I didn't look at the actual source code of the benchmark, their > verbiage indicates they actually compute on a zero-ed out array every > time, though. > > --