[LAU] Performance tuning audio apps

Chris Cannam cannam at all-day-breakfast.com
Mon Aug 17 06:29:08 EDT 2009


On Mon, Aug 17, 2009 at 5:21 AM, Ken Restivo<ken at restivo.org> wrote:
> I'm trying to squeeze the last little bit of juice out of my EEE.
>
> The CPU I have is this:
> http://restivo.org/projects/eee/cpu.txt
>
> This nifty script at http://www.pixelbeat.org/scripts/gcccpuopt , says I should use "-march=core2 -mtune=pentium -mfpmath=sse"
>
> However, the Gentoo people (who I take to be an -funrollloops authority on performance tuning), say I should "-march=core2 -mtune=generic -fomit-frame-pointer -pipe".
>
> And then there is -march=native which many say is just easier and faster. And others recommend putting "-msse2" and other such things.
>
> What say you-all?

If you want the fastest possible floating point code, then you
probably want something like:

  -march=core2 -msse -msse2 -mfpmath=sse -ffast-math -fomit-frame-pointer -O3

... but with caveats.

Discussion:

Supplying -ffast-math causes the use of non-IEEE-compliant math
functions.  Among other things, this screws up any code that
explicitly deals with infinity or NaN values or signed zeroes, and
makes assumptions about properties like associativity for the purposes
of optimisation which may not be true in the floating-point world.  In
other words, it can give you the wrong results.  In _most_ cases,
audio applications are fine with it, but you need to be aware that it
can be problematic.

However, -ffast-math in combination with -mfpmath-sse has the very
nice quality that it enables denormal flush to zero throughout, thus
avoiding denormal slowdowns in filters and the like.  It's also much
faster for some of the apparently simple operations like floor() that
are surprisingly slow in IEEE compliant mode.

It might be interesting to know what the authors of the programs
you're trying to optimise thought about the use of -ffast-math...
Perhaps you could compile them both ways and compare the output.

-fomit-frame-pointer is pretty much guaranteed to make things
marginally faster but harder to debug.  It won't break anything and it
won't make any huge improvements.

-O3 rather than -O2 because it enables -ftree-vectorize, which does
some limited auto-vectorization of loops for things like
floating-point copy into SSE operations.  This doesn't always do
anything (depends on the code, obviously) but sometimes it makes a
significant difference, for example it helps when compiling my Rubber
Band library.  I've never yet seen any problems with the results, but
of course there's always an increased risk of running into
optimisation bugs the more optimisation you do.  You can get
interesting (?) debug output about vectorization successes and
failures (mostly failures) with e.g. -ftree-vectorizer-verbose=2.

I would be slightly suspicious of anyone who recommends -pipe as an
optimisation -- it makes no difference to the resulting code, it just
makes compiling faster.

If you're using a 64-bit distro, then you can omit the options with
SSE in them (they're all enabled by default in 64-bit gcc).


Chris



More information about the Linux-audio-user mailing list