On Wed, 2008-05-07 at 01:45 +0300, Jussi Laako wrote:
Fons Adriaensen wrote:
> Which will determine performance for every algorithm that
>
> - is working on a data set that is larger than the cache,
> - does not produce multiple results from the same inputs.
-<snip>-
There are several use cases where the data set is
rather small and is
used in several subsequent loops, thus cache can help.
After profiling, I've identified number of algorithms which
significantly benefit from handwritten vectorized asm.
One thing that I wonder is what the pattern of addition in Fons's
application really looks like. I assume the fftA * fftB is some windowed
precalculated impulsresponse and a signal? The addition/accumulate
suggests that the output fftD has been touched before, implying that
there could be more variables to work on at once or that the vectors
would still be in the cache if the order of addition was changed.
/j
PS: Your fastest calculation is when the data floods the cache:
N=(1024*1024), n=1000, gcc, clock: 8410 ms (_Complex). Is that a typo?