On 07/26/2011 03:15 AM, Maurizio De Cecco wrote:
So are you now
considering use some #ifdef to select float/4 instead of
double/8 vectors in jMax or just change all of them?
Well, at the moment on gcc the perfomance with vector types is the same
as without vector types, so i'll leave the Linux version without vector
types (the code is #ifdef'ed).
When I was playing around with this last night... the best performance
came from your non-optimized, non-vectored code.
Why?
Because GCC translated it to optimized, vectored code.
By the way, i forgot to mentions that all my tests
where at 64 bits;
i'll try later on a 32 bit Ubuntu.
I was on 32 bit Ubuntu. Also, with GCC the 64-bit optimizer is known to
be better at optimising SIMD code.
Because I'm a sucker for these kinds of diversions, I came up with a
scheme that shaved about 1 second off your test (on my machine). It
assumes that `vecsize` is a power-of-two. The idea is to store stuff in
the processor registers, and access each buffer one page at a time (a
cache page is 64 bytes on x86... 16 floats).
static inline void add3_vec(float * restrict arg0, float * restrict
arg1, float * restrict arg2, unsigned int vecsize)
{
unsigned int i;
v4sf *v0, *v1, *v2;
v4sf c0, c1, c2, c3, c4, c5, c6, c7;
const unsigned cache_size = 4;
v0 = (v4sf*)arg0;
v1 = (v4sf*)arg1;
v2 = (v4sf*)arg2;
vecsize /= 4*cache_size;
while(vecsize--) {
c0 = *v0++;
c1 = *v0++;
c2 = *v0++;
c3 = *v0++;
c4 = *v1++;
c5 = *v1++;
c6 = *v1++;
c7 = *v1++;
*v2++ = c0 + c4;
*v2++ = c1 + c5;
*v2++ = c2 + c6;
*v2++ = c3 + c7;
}
}
-gabriel