On a related
subject: How is level one cache replaced with new data,
should one (or ones compiler) decide to use some of the prefetch
instructions available from Intel PII and up? It would make sense to
fetch the next dataset while doing what has to be done "now". On the
other hand, overwriting the current dataset is somewhat counter
productive.
Unless you're hand-coding assembly it's probably wisest to leave this
to the compiler. OTOH, I've no idea how smart gcc/g++ is in this respect.
It could be quite interesting to -S some familiar DSP code and have a
look at the result.
From what I understand gcc 4.x will recognize float
x[n] = float y[n] *
float z[n], and do sse as good as the next person. How far it
will dive
into other goodies in sse/altivec, I dunno. Apple used to have some
advice on portable gcc vector programming but I can't find it in the
latest revision. There is a general discussion here, well worth reading:
http://developer.apple.com/hardware/ve/simd.html
I am still on gcc 3.4 though. It will use the sse unit for scalar floats
(if you ask politely), more on this later.
About prefetch, I can say that this gcc will not recognize a linked list
in a doforall loop. Perhaps looping thru an array could help?
Hey, and isn't this where all these hot diodes in modern hardware is
supposed to look ahead and recognize a pointer being incremented?
Speculatively fetching instructions to fill the pipe without also
getting the data makes no sense, that would just stall execution.
Mmmm ... wait a second. Speculatively fetching a load will eventually
get you the data also. OK, so the system works :)
I am on a PIII these days, and this one has only a very small window of
opportunity where it can shuffle instructions around. Actually it feels
more like marketing bull since nothing much happens without interleaving
what you are doing right now with what you have (almost) done and what
you are about to do next.
The only place where I've seen prefetch used
explicitly is in Brutefir's
sse and 3dnow routines which I recently modified for use in one of my own
projects.
The layout of the data would influence these
decisions, no?
And conversely, considerations relating to cache use (and possible sse
optimisations) may influence your choice of data formats.
I said more on scalar sse floats later?
If you think this code snippet looks funny:
if(x == 0.0) x = 0.0;
... then wait untill you take it away from the code attached below, and
watch execution time more than double up. (And no, it is not the missing
cast to float.)
Don't ask! I have no idea how I could dream up that expression ...
There is some unfinished work in the function:
int set_DAZ_and_FTZ(int on)
.. where I do not recognize an (early) AMD k8. Anybody care to enlighten
me?
mvh // Jens M Andreasen (still fighting the man.)
--