On Mon, Aug 17, 2009 at 11:44:32AM +0100, Dan S wrote:
2009/8/17 Chris Cannam
<cannam(a)all-day-breakfast.com>om>:
> On Mon, Aug 17, 2009 at 5:21 AM, Ken Restivo<ken(a)restivo.org> wrote:
>> I'm trying to squeeze the last little bit of juice out of my EEE.
>>
>> The CPU I have is this:
>>
http://restivo.org/projects/eee/cpu.txt
>>
>> This nifty script at
http://www.pixelbeat.org/scripts/gcccpuopt , says I should
use "-march=core2 -mtune=pentium -mfpmath=sse"
>>
>> However, the Gentoo people (who I take to be an -funrollloops authority on
performance tuning), say I should "-march=core2 -mtune=generic -fomit-frame-pointer
-pipe".
>>
>> And then there is -march=native which many say is just easier and faster. And
others recommend putting "-msse2" and other such things.
>>
>> What say you-all?
>
> If you want the fastest possible floating point code, then you
> probably want something like:
>
> ?-march=core2 -msse -msse2 -mfpmath=sse -ffast-math -fomit-frame-pointer -O3
>
> ... but with caveats.
>
> Discussion:
>
> Supplying -ffast-math causes the use of non-IEEE-compliant math
> functions. ?Among other things, this screws up any code that
> explicitly deals with infinity or NaN values or signed zeroes, and
> makes assumptions about properties like associativity for the purposes
> of optimisation which may not be true in the floating-point world. ?In
> other words, it can give you the wrong results. ?In _most_ cases,
> audio applications are fine with it, but you need to be aware that it
> can be problematic.
Hmm. I did notice what appears to be a problem like that. With the
"-O2 -march=core2 -fomit-frame-pointer -pipe -march=core2 -mtune=pentium
-mfpmath=sse -ffast-math" options, all the audio apps were using a LOT less CPU via
top, but I ran into glitches in fluidsynth, when holding down a lot of notes and the
sustain pedal. But then, I haven't stress-tested this systematically. I am going to
have to build several different packages of these apps, each with several different flags,
and then try them very methodically, one by one, and see what happens.
In general, by far the biggest hogs are the LADSPA plugins. I use Caps VTS+Tonestack and
that of all of them is the biggest CPU users, with autowah as a close second.
>
> However, -ffast-math in combination with -mfpmath-sse has the very
> nice quality that it enables denormal flush to zero throughout, thus
> avoiding denormal slowdowns in filters and the like. ?It's also much
> faster for some of the apparently simple operations like floor() that
> are surprisingly slow in IEEE compliant mode.
>
> It might be interesting to know what the authors of the programs
> you're trying to optimise thought about the use of -ffast-math...
> Perhaps you could compile them both ways and compare the output.
I will do that. How would I compare the output though? Right now my "stress
test" is running a MIDI file that just holds down lots of notes in Fluidsynth, and
counting how many seconds it can tolerate that before I get that dreated
"CD-skipping" stuttering sound.
On the SuperCollider dev list we're just having a conversation about
exactly this. NaNs are used in some cases for signalling, and since
compiling with -ffast-math implies -ffinite-math-only, that trashes
the NaN signalling. This combination seems OK though: "-ffast-math
-fno-finite-math-only". The moral of the story is probably that it
depends strongly on the app. Who knows if your chosen softwares make
use of NaNs and infinities? Hard to tell.
In case anyone has any specific suggestions (or the authors are on this list), here are
the apps I am most interested in optimizing:
azr3-jack 1.0.2-1.1
caps 0.4.2-1
fluidsynth 1.0.8-1.1
jackd 0.116.1-4
libfluidsynth1 1.0.8-1.1
libgig6 3.3.0-1
libjack0 0.116.1-4
liblinuxsampler 1.0.0-1
linuxsampler 1.0.0-1
tap-plugins 0.7.0-2
terminatorx 3.82-7.2
Since CAPS Amp is one of the bigger CPU users, I'm remembering some years ago that
mtaht4 on IRC was working on re-doing all the CAPS or TAP code to optimize it. This was
before he and moved to the beach of Costa Rica, IIRC. Did he ever release that code
though? Maybe I should try it.
> -fomit-frame-pointer is pretty much guaranteed to make things
> marginally faster but harder to debug. ?It won't break anything and it
> won't make any huge improvements.
>
> -O3 rather than -O2 because it enables -ftree-vectorize, which does
> some limited auto-vectorization of loops for things like
> floating-point copy into SSE operations. ?This doesn't always do
> anything (depends on the code, obviously) but sometimes it makes a
> significant difference, for example it helps when compiling my Rubber
> Band library. ?I've never yet seen any problems with the results, but
> of course there's always an increased risk of running into
> optimisation bugs the more optimisation you do. ?You can get
> interesting (?) debug output about vectorization successes and
> failures (mostly failures) with e.g. -ftree-vectorizer-verbose=2.
>
> I would be slightly suspicious of anyone who recommends -pipe as an
> optimisation -- it makes no difference to the resulting code, it just
> makes compiling faster.
You have every reason to be suspicious; that "-pipe" was from Gentoo :-).
Obviously intended to mitigate the 3-day wait when typing "emerge @world". I
know it doesn't optimize, but this whole compiling process does go faster on my little
Atom when I use it, and it does no harm. So far LinuxSampler is the hugest compile time of
any of the apps, a second only to Ardour.
>
> If you're using a 64-bit distro, then you can omit the options with
> SSE in them (they're all enabled by default in 64-bit gcc).
>
>
This is a 32-bit Atom. I haven't tried optimizing the Dual Core2 64-bit, because it
hasn't been necessary.
Thanks for the detailed advice.
-ken