i imagine you'd have a blast at somewhere like
this:
http://forums.gentoo.org/viewtopic.php?t=5717
(there's bad and good advice in there)
Yeah... I'm not sure I see real improvements using -O3 and some of the
options they advise. In fact, I've seen some reduced performance, probably
due to over-aggressive loop-unrolling and whatnot. There's a subtle
interplay between cache-unroll counts, L1 cache spilling, etc, etc, which is
not a trivial problem. It's made worse by the fact that you don't know how
many loop iterations is "normal" for fully dynamic loops, i.e., The Halting
Problem.
When I code in assembler, I find that unrolling to the point where the data
access/processing stages are chunking one L1 cache line per iteration is
probably ideal, since you can sprinkle successor data-cache-line
touch-prefills partway through to get the first 4-8 bytes into the L1
transit area by the time your next iteration begins. I did this on the
PowerPC 604 with stunning performance results for an md5 checksum routine.
With these hyper-pipelined CPUs these days, stalls are very expensive.
You also tend to be roadkill for subtle/bizarre bugs in the code optimiser
when you crank it up to maximum levels like that. I instinctively distrust
that zone, myself.
There will always be some code which runs fastest at -O1, for example. Dan
Bernstein's djfft library is one clear example. I know when I'm writing C
code for a compiler I know is stupid (like the pre-GCC-3 compilers, or old
Metrowerks compilers on the Macintosh), I tended to "guide" its code
generation by expressing functionality in a way to get the compiler to
produce the code the way I want it.
It often looks "noisy" or "simple and inelegant", i.e., hoisting
invariant
stuff to temporary vars which are locally declared so the stack frame usage
is kept to a minimum, or even better, the # of vars can fit nicely into
registers. But it ends up compiling to faster-running code than "elegantly
written" C did. I do not disagree with much of DJB's rants on the subject,
but there are genuine cases where it's "harder than is worth it" to fully
hint the compiler. In the case of Intel's compiler, it's bound to find a
vectorising opportunity more readily than you are, unless you have plenty of
free time on your hands.
Compilers *ARE* improving... there was a time though when I couldn't rely on
them to reduce an integer unsigned multiplication/division by a constant
power-of-2 to a bitshift.
These cycles do add up, people, and when you're doing them millions of
times, it adds up to real seconds and minutes of your life. Naturally, this
is all academic for a program you're only going to use once or twice and
forget about.
=MB=
--
A focus on Quality.