minor nit because I see that very often. :) from the
gcc man page:
-march=cpu-type
Generate instructions for the machine type cpu-type.
The choices for cpu-type are the same as for -mcpu.
Moreover, specifying -march=cpu-type implies
-mcpu=cpu-type.
The extra -mcpu is unnesessery :).
Good to know! I've always been paranoid about those two, so I covered them
twice in my builds.
And theres a -mmmx switch, but I don't know if it
is that useful for DSP,
since the mmx instructions are integer only.
One cost to mmx which doesn't exist for XMM (SSE) is that there is an
expensive "FP mode switch" which comes into play. There is apparently an
option to have GCC generate both x87 and sse code, but the documentation
doesn't inspire confidence.
Good thing you didn't recommend -O3. I've seen
instances where it was
much slower than -O2.
Yeah, it's definitely a risker option. I've seen bloated binaries which
don't execute as quickly as well. With some code, I've seen -O1 be faster
than -O2 [like Dan Bernstein's FFT library to name just one].
The Intel compiler is alot better in this regard. In fact, I've yet to find
code where the Intel compiler doesn't humiliate the GCC compiler's generated
binaries.
And yes, even if you have AMD hardware, the code's faster. The table below
isn't technically 100% accurate, as the pipelining and instruction
scheduling of the Pentium III is a bit of an oddball compared to both its
successor and predecessor. I've found the code to generally run at its best
on these AMD processors when tuned for these matching Intel CPUs regardless.
AMD Athlon XP (Stepping 6 and later) ~== Pentium III (SSE)
AMD Athlon (Stepping 4 and earlier) ~== Pentium II (MMX)
AMD Duron Applebred (Stepping 8+) ~== Pentium III (SSE)
AMD Duron Morgan or older (Step <= 6)~== Pentium II (MMX)
Only Opterons have SSE2. No AMD CPUs support SSE3 currently, but those are
petty improvements by comparison. SSE2 is a big deal, offering both FP and
INT vector operations. I've used SSE2 to dramatically speed up things from
cryptographic code, to pattern-matching stuff for fuzzy-spamsign detection.
In 19 times out of 20, whenever someone gives me a binary which shows the
Athlon XP to run it faster than an equivalently rated Pentium 4, when I
recompile them both for optimal tune (for both Intel and AMD), the Pentium 4
blows it away. No questions asked, go make me a sammich, bitch.
Hard data? How does 20-30% performance improvement in the OpenSSL crypto
cores sound, and that's WITHOUT any SSE2 hand-rolled assembler.
The Athlon XP seems to handle really bad code, like stuff compiled for a
386/486 with lots of unaligned accesses, etc much more gracefully than the
Pentium 4. I've seen a commercial game server published by Interplay whose
code was so bad, running it on a 3.2GHz P4 was slower than an XP2400+
[running the same exact OS + configuration].
As always, YMMV; I could be a Dark Agent of Sauron trying to lure you away
from more expensive CPUs.
=MB=
--
A focus on Quality.