[linux-audio-dev] Traps in floating point code

Erik de Castro Lopo erikd-lad at mega-nerd.com
Thu Jul 1 22:41:34 UTC 2004


On Thu, 1 Jul 2004 15:15:22 -0700 (PDT)
Dan Hollis <goemon at sasami.anime.net> wrote:

> On Fri, 2 Jul 2004, Erik de Castro Lopo wrote:
> > On Thu, 01 Jul 2004 18:18:41 +0200
> > Benno Senoner <sbenno at gardena.net> wrote:
> > > Eric what do you think ? can something like that be coded efficiently 
> > > using SSE/SSE2 ?
> > Probably not. There are some algorithms which simply can't be vectorized.
> 
> Would an opteron be any better at it?

No, its not an issue of the CPU. The problem is that SSE and SSE2 on
the x86 extended family and Altivec on the PowerPC all require that
the algorithm can be refactored to allow blocks of data to be
processed at time. However, the refactoring can in some cases slow
the algorithm down to an extent that it ends up slower on the vector
processor than on the standard FPU.

Here's an example  (where a, b anc c are arrays of floats) :

     for (k = 0 ; k < 128 ; k++)
        a [k] = b [k] * c [k] ;

which can be written as:

     for (k = 0 ; k < 128 / 4 ; k+ = 4)
     {  a [k] = b [k] * c [k] ;
        a [k + 1] = b [k + 1] * c [k + 1] ;
        a [k + 2] = b [k + 2] * c [k + 2] ;
        a [k + 3] = b [k + 3] * c [k + 3] ;
        }

This can then be vectorized so that four entries in the b array can
be added to four entries in the c array in a single instruction and
then assigned to four entries in the a array in another single
instruction. like:

     for (k = 0 ; k < 128 / 4 ; k+ = 4)
     {  load_four_b_values () ;             /* single instruction. */
        load_four_c_values () ;             /* single instruction. */
        add_four_b_to_four_c_vals () ;      /* single instruction. */
        store_4_a_values () ;               /* single instruction. */
        }

The above should run at close to four times the speed of the original.

Now look at a new example:

     a [0] = b [0] * c [0];
     for (k = 1 ; k < 128 ; k++)
        a [k] = a [k - 1] + b [k] * c [k] ;

The problem here is that each a [k] depends on the value of a [k - 1].
Refactoring this to be vectorized is likely to require so much
shuffling of data that it ends up slower than the non-vectorized
version.


Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam at mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"This is like creating laws against blasphemy and then complaining that
unbelievers can't come up with any logical argument against the existence
of God"  -- www.infoanarchy.org on the Digital Millenium Copyright Act



More information about the Linux-audio-dev mailing list