On Wed, 2004-02-18 at 11:43, Steve Harris wrote:
Ideally DSP software would be written so that it never
generates them,
but, erm, well, developers are lazy, y'know :)
Writing it such a way would sometimes make it even slower.
Now the speed difference between Intel and AMD is that Intel is 10-100x
slower than AMD depending on routine. (other problem is performance
issues when memory is accessed on non-16-byte aligned address which is
about 2x slower than aligned)
However, there is a way to work around it.
On the P4 its possible that you can set some flags to
use SSE instructions
instead of 387 and tell the SSE unit to never produce denormals, but
last time I tried it, gcc 3.something generated bad code (illegal
instructions).
Best way is to code everything for SSE2 which has no denormal problems
(can treat denormals as zero). Writing code this way (and possibly using
E3DNow) will also make it run fast on AMD Opteron/Athlon64. x86-64
architecture _really_ needs asm to get a good kick out of it...
- Use SSE2
- Use prefetch
Here are some numbers for my 2.4 GHz Celeron.
C++ (Intel C++ 8.x):
10831.111 us / cross correlation on 44100 samples (single precision)
1060.100 ms / 181 tap FIR filter of 65536 samples (single precision)
6.390 ms / 5 biquad IIR filter of 65536 samples (single precision)
SSE2 (inline asm):
597.222 us / cross correlation on 44100 samples (single precision)
23.800 ms / 181 tap FIR filter of 65536 samples (single precision)
2.840 ms / 5 biquad IIR filter of 65536 samples (single precision)
--
Jussi Laako <jussi.laako(a)pp.inet.fi>