[linux-audio-dev] Traps in floating point code

Benno Senoner sbenno at gardena.net
Thu Jul 1 16:18:41 UTC 2004


Jens M Andreasen wrote:

>
>Why not just use modf?
>
>  double fullindex, increment, integer, fraction;
>  // int i;
>
>  fullindex += increment;
>  fraction   = modf(fullindex, &integer);
>  // i = integer;
>
>C99 have float and long double versions as well.
>  
>
The problem of modf is that it is slow (it generates "call modf" which 
involves subroutine calling, even with -ffast-math), and the
integer index is still a double which needs to be converted to an int, 
so you need do perform either a fistl
or lrintf();

I benchmarked my code agains the modf() code and mine is 3 times faster.
 As said in my code the fract part might become 1.0 in some cases
or even if it got a bit below 0, for interpolation it still works 
perfectly because the continuity of polynomial interpolation.
In LinuxSampler all we want is fast interpolation and the current code 
we use is efficient.
We will explore the possibility of using fixed point (int/fract part, eg 
16/16 bit) indexes perhaps we will squeeze out a bit more
but on the other hand it has the problem that if you need LFOs and other 
kind of pitch modulation you have to convert
indexes from float/dobule to fixed point which might be a bit 
timeconsuming, especially because we have sample accurate
modulation and envelopes. We will see what can be done, for now the goal 
is to get a perfectly working sampler engine.
5-10% of performance increase is not that important for now, especially 
if fixed point indexes turn out to be a big PITA.
(cannot say yet, whether this is true or not, one has to implement and 
benchmark the stuff within the whole sampler engine
to get real numbers, synthetic benchmarks are not alway enough because 
in a complex audio app you do lots of other stuff besides
resampling)


regarding SSE/SSE2:
I performed various benchmarks (pure C/C++) with gcc 3.3/3.4 and the 
latest intel compiler with SSE vectorization optimizations:
the fact is as Eric said that SSE/SSE2 is slower than the regular FPU in 
some cases.
Resampling (polynomial, eg linear, cubic) seems such a case. Not much 
can be parallelized (icc is not able to vectorize
anything in the code below) so you either use an alternative algorithm 
handcrafted for SSE or you better stay with the
regular FPU.
I'm not sure if it is possible to achieve decent speed increases with 
SSE, perhaps it would help to keep track
of 4 indexes
eg.

double fullindex1, fullindex2, fullindex3, fullindex4;
double fract1, fract2, fract3, fract4;
int intindex1, intindex2, intindex3, intindex4;
double pitch;  ( eg 1.0 plays the audio at normal speed)

for(...) {
// fullindex1 is the official index  (incremented by pitch at each 
interation) , fullindex2,3,4 are the indexes needed

// can the following 3 lines be parallelized ?

  fullindex2 = fullindex1 +  pitch;
  fullindex3 += fullindex1 + 2.0 * pitch;
  fullindex4 += fullindex1 + 3.0 * pitch;

// AFAIK SSE2 can do 2 double_to_int with one instruction 
// see CVTPD2PI , http://folk.uio.no/botnen/intel/vt/reference/vc50.htm
// so at least 2x speedup would be achieved (only on P4+ CPUs)
// Athlon XP does not support SSE2 and SSE can only do 2 float to int using
// see http://folk.uio.no/botnen/intel/vt/reference/vc57.htm

  intindex1 = double_to_int(fullindex1); 
  intindex2 = double_to_int(fullindex2);
  intindex3 = double_to_int(fullindex3);
  intindex4 = double_to_int(fullindex4);

// can be parallelized using SSE  (doing 4 ops per instruction)
  fract1 = fullindex1 - intindex1;
  fract2 = fullindex2 - intindex2;
  fract3 = fullindex3 - intindex3;
  fract4 = fullindex4 - intindex4;

// can be parallelized using SSE
 outputsamplebuf[0] =samplebuf[intindex1] + fract1 * 
(samplebuf[intindex1 + 1] - samplebuf[intindex1]);
 outputsamplebuf[1] =samplebuf[intindex2] + fract1 * 
(samplebuf[intindex2 + 1] - samplebuf[intindex2]);
 outputsamplebuf[2] =samplebuf[intindex3] + fract1 * 
(samplebuf[intindex3 + 1] - samplebuf[intindex3]);
 outputsamplebuf[3] =samplebuf[intindex4] + fract1 * 
(samplebuf[intindex4 + 1] - samplebuf[intindex4]);

  fullindex1 += 4.0*pitch; // increase fullindex1 by 4 times pitch 
because we processed 4 samples

}

The disadvantage is that you must keep pitch constant for 4 samples but 
this is not a big problem.
(but in theory we could add a different pitch value to each fullindex1-4 
variable so it would not be so hard
to lift that kind of restriction).

Eric what do you think ? can something like that be coded efficiently 
using SSE/SSE2 ?
(I'd prefer SSE because the Athlon XP support that kind of instructions 
too while SSE2 is only supported by
P4+ or AMD64 CPUs)

 
regarding pure SSE math, try to compile this benchmark with gcc 3.3/3.4 
and try to use -mfpmath=sse/sse2
you will see speed will suck compared to using the regular FPU.
The intel icc will do less damage but it will still be slower than pure 
FPU code.

http://www.linuxdj.com/benno/rspeed4.tgz

Using SSE is not always a panacea for audio apps.

cheers,
Benno
http://www.linuxsampler.org




More information about the Linux-audio-dev mailing list