"Matt J" wrote in message <email@example.com>... > "Bruno Luong" <firstname.lastname@example.org> wrote in message > > That doesn't explain why the 3rd version was the slowest. The 3rd version uses a 10+10 tensorial operation so since 10+10 << 10*10, you would expect the 3rd version to be faster (or comparable to) the others.
This is an independent question, and a little OT. But here is few elements of explanation:
I imagine on the implementation side, the first and second input arguments of CONV2 is not symmetric. There should be an outer loop and inner loop must be on 1st/2nd arguments (or the opposite). Also the sum is carried out on a direct/flipped memory arrangement of the arguments. That can make a huge difference espectially considering the computer cache system that is not symmetric in the memory reading.
The lesson here is that one should put the large array as first argument of conv/conv2, which is probably the marojity of the cases in practice.