Fons Adriaensen wrote:
Which will determine performance for every algorithm
that
- is working on a data set that is larger than the cache,
- does not produce multiple results from the same inputs.
Here are some results with empty() included...
N=1024, n=1000000, gcc:
clock: 16500 ms (_Complex)
clock: 26760 ms (cvec_t)
clock: 15820 ms (original float array[N][2])
clock: 13700 ms (asm on float array)
N=(1024*1024), n=1000, gcc:
clock: 8410 ms (_Complex)
clock: 9360 ms (cvec_t)
clock: 8500 ms (original float array[N][2])
clock: 10540 ms (asm on float array)
And if I remove "-fprefetch-loop-arrays", it degrades to:
clock: 12800 ms (_Complex)
clock: 10010 ms (cvec_t)
clock: 13800 ms (original float array[N][2])
clock: 10510 ms (asm on float array)
And non-vectorized version ("normal x86-64 code"):
clock: 12840 ms (_Complex)
clock: 22830 ms (cvec_t)
clock: 12880 ms (original float array[N][2])
clock: 10470 ms (asm on float array)
The asm code I used doesn't include prefetch instructions, because the
data sets I use at once are smaller. Vectorization improves cvec_t
layout case significantly.
It is safe now, but with such a small data size the
code is still not
representative of real life use of a very simple operation such as a
MAC loop. In practice you also have to generate the data and use the
There are several use cases where the data set is rather small and is
used in several subsequent loops, thus cache can help.
After profiling, I've identified number of algorithms which
significantly benefit from handwritten vectorized asm.
BR,
- Jussi