[LAD] vectorization

Jussi Laako jussi at sonarnerd.net
Tue May 6 22:45:26 UTC 2008


Fons Adriaensen wrote:
> Which will determine performance for every algorithm that
> 
> - is working on a data set that is larger than the cache,
> - does not produce multiple results from the same inputs. 

Here are some results with empty() included...

N=1024, n=1000000, gcc:
 > clock: 16500 ms (_Complex)
 > clock: 26760 ms (cvec_t)
 > clock: 15820 ms (original float array[N][2])
 > clock: 13700 ms (asm on float array)

N=(1024*1024), n=1000, gcc:
 > clock: 8410 ms (_Complex)
 > clock: 9360 ms (cvec_t)
 > clock: 8500 ms (original float array[N][2])
 > clock: 10540 ms (asm on float array)

And if I remove "-fprefetch-loop-arrays", it degrades to:
 > clock: 12800 ms (_Complex)
 > clock: 10010 ms (cvec_t)
 > clock: 13800 ms (original float array[N][2])
 > clock: 10510 ms (asm on float array)

And non-vectorized version ("normal x86-64 code"):
 > clock: 12840 ms (_Complex)
 > clock: 22830 ms (cvec_t)
 > clock: 12880 ms (original float array[N][2])
 > clock: 10470 ms (asm on float array)

The asm code I used doesn't include prefetch instructions, because the 
data sets I use at once are smaller. Vectorization improves cvec_t 
layout case significantly.

> It is safe now, but with such a small data size the code is still not
> representative of real life use of a very simple operation such as a
> MAC loop. In practice you also have to generate the data and use the

There are several use cases where the data set is rather small and is 
used in several subsequent loops, thus cache can help.

After profiling, I've identified number of algorithms which 
significantly benefit from handwritten vectorized asm.


BR,

	- Jussi




More information about the Linux-audio-dev mailing list