On Mon, 2007-04-09 at 16:55 +0200, Tim Blechmann wrote:
Hand written
assembler is still many orders faster than what gcc is
capable of doing. In Ardour peak computation (for both metering and
waveform displaying) is written in SSE (the first part in pure assembly,
the second in a C-level abstraction which is almost 1:1 assembly). Both
functions are more than 20x faster in raw performance than what gcc 4.1
can do.
btw, is there a reason, why ardour is using assembler code instead of
compiler intrinsics?
The first batch of SSE was written in pure assembly for two reasons: a)
i had fun learning it, and b) i failed to get xmmintrin.h working but
had asm working.
The second batch (find_peaks, for displaying waveforms) is done with
xmmintrin.h as i finally figured out how to use it :)
beside that, if ardour is using a fixed block size,
using compile-time
loop unrolling would be another point, where one could gain speed (iirc,
the micro-benchmarks i did for pnpd/nova indicated an additional
performance boost around 40%) ...
Ardour is not using a fixed block size, it uses the block size from
jackd. Second, Ardour does sample accurate even handling, which means
that buffers might be divided up in any possible way so we must have
code which will work for non-aligned buffers and numbers of frames which
are not dividable by 4. (alignment here means 16-byte alignment which is
required by x86 SIMD, 4 bytes per sample, 16 bytes = 4 samples).
The find_peaks algo works like this:
1) run one sample at a time until we reach alignment
2) run buffer in quads of quads (64 bytes or 16 samples in one loop)
while there are >= 16 samples left
3) run buffer in quads (16 bytes or 16 samples in one loop)
while there are >= 4 samples left
4) run one sample at a time until we run out of samples
So we have "conservative dynamic unrolling" :)
But, the benefits here are quite small and very architecture dependent.
The AMD 64 bit processors benefit a lot more from unrolling and memory
prefetching than what my Core 2 Duo (in 32 bit mode) benefits.
I don't have the numbers here, and any numbers i would give you would be
from a testbench which can only measure raw performance, not real world
use.
Sampo