On Sat, 2008-04-19 at 00:30 +0300, Jussi Laako wrote:
For simple operations, compilers are rather good on
vectorization. Even
though I don't know if there's any support for multi-arch targets on
gcc, so that the SSE2/SSE3 optimized binary would run on hardware
without SSE (dynamic code selection)? I haven't got time to follow the
latest gcc developments.
I tried rewriting a "moog filter" slightly to calculate 4 voices in
parallel instead of one by declaring all scalars to be arrays and then
looping through them as in:
/* float r1,r2,r3,r4; */
float r1[4],r2[4],r3[4],r4[4]; // tmp
...
for(int i = 0; i < 4; ++i)
{
...
r1[i] = b[1][i];
b[1][i] = r3[i] = p[i] * (r4[i] + r2[i]) - r1[i] * f[i];
...
}
This strategy fails to auto-vectorize with gcc4.3 but works with icc
10.1 and almost quadruples thruput. Breaking up the filter in separate
smaller functions helped getting rid of confusion regarding what should
be the inner and outer loops. The functions are inlined anyways.
For applications that look like a bunch of identical channel strips,
this should be pretty useful. "Buy one and get three for free!" :-D
So it is not all science-fiction, but gcc is not quite there yet.