Jens M Andreasen wrote:
PS: Your fastest calculation is when the data floods
the cache:
N=(1024*1024), n=1000, gcc, clock: 8410 ms (_Complex). Is that a typo?
Nope, that's the actual result, I just verified the settings, recompiled
and re-run, and it's still:
clock: 8390 ms (_Complex)
clock: 9310 ms (cvec_t)
clock: 8480 ms (original float array[N][2])
clock: 10550 ms (asm on float array)
Fast memory bus + prefetch is a really good thing...
I also have vectorized float array copy and it's significantly faster
than memcpy(). While memcpy() stays under 1 GB/s, vectorized version can
reach around 90% of the theoretical memory speed for large copies.
- Jussi