On Mon, May 05, 2008 at 07:18:39PM +0200, Jens M
Andreasen wrote:
  Could you try this out with your proposed
compiler options on your own
 hardware?
 ...
 #define N 1024
 ...
 int n = 1000000;
 ... 
 Looping a million times over the same small data vector
 is _not_ very realistic.
 In a real app, the data size would be much longer (there's
 no need to optimise otherwise), that data would be rewritten
 for each iteration (no need to redo the calculation otherwise),
 and the work would not be done in a single long run but be
 divided over a number of e.g. jack process callbacks.
 I've again performed some tests on zita-convolver used by
 jconv to do the York Minster config. That means around 240
 different blocks of 8192 complex values each. The differences
 between plain C++, hand vectorized, and optimised assembly
 code are absolutely marginal in that case. 
The main problem there is that you are not looking at the speed of the
CPU, but running into memory bandwidth problems. In my own experiments
it was very apparent that it pays off to restructure the code in such a
way that memory access is limited as much as possible, although that
might result in more instructions. (not suggesting that's possible for
you though).
The key point seems to be that you have to do all operations at once
instead of looping over a buffer multiple times. Which is not really a
surprise of course...
e.g.:
for(all samples) {sample = operation1(sample)}
for(all samples) {sample = operation2(sample)}
for(all samples) {sample = operation3(sample)}
can easily be an order of magnitude slower than:
for(all samples) {
  sample = operation1(sample)
  sample = operation2(sample)
  sample = operation3(sample)
}
It is essential that the data does not leave the processors cache,
that's for sure. For the remainder I think modern day processors are
very good at optimizing their computation.
Greets,
Pieter