On Thu, 2008-02-07 at 18:51 +0100, Malte Steiner wrote:
Hello,
I try to squeeze as much performance as possible out of my upcomming
Linux synthesizer and try manual vectorization with following construct
in c, mainly to vectorize away multiplications :
Have you checked the SIMD code in Ardour? We have SIMD code for crucial
DSP. The functions we use are:
(pure ASM, defined in libs/ardour/sse_functions.s or _64bit.s)
mix_buffers_with_gain(float *dst, float *src, long nframes, float gain);
mix_buffers_no_gain (float *dst, float *src, long nframes);
apply_gain_to_buffer (float *buf, long nframes, float gain);
float compute_peak(float *buf, long nframes, float current);
(xmmintrin, defined in libs/ardour/sse_functions_xmm.cc)
find_peaks(float *buf, nframes_t nframes, float *min, float *max)
When I wrote the code, I was unable to get better results from the gcc
vectorizer. From what I've heard, it's supposed to be getting better.
But at least until that, we are using the above code.
Note especially the xmmintrin syntax. It's a brilliant way of doing
pseudo-assembler. It gives you the power of direct XMM (SIMD) register
access and direct SIMD calls.
compute_peak() returns the largest absolute peak value in buf and
current. (i,e. return max( max(abs(buf)), current) ). The function we
have is multiple magnitudes faster than anything GCC can come up with
from generic C code. This is partly because we are using 16-byte aligned
buffers and mostly because we can cheat and not run a true ABS function,
but a bit masking operation which works for audio data as there are no
infinites or NaNs in it.
All functions work with aligned and non-aligned data. With non-aligned
data, they will run one sample at a time until they reach alignment and
continue 4 buffers at a time.
Sampo