On Mon, May 05, 2008 at 07:18:39PM +0200, Jens M
Andreasen wrote:
Could you try this out with your proposed
compiler options on your own
hardware?
...
#define N 1024
...
int n = 1000000;
...
Looping a million times over the same small data vector
is _not_ very realistic.
In a real app, the data size would be much longer (there's
no need to optimise otherwise), that data would be rewritten
for each iteration (no need to redo the calculation otherwise),
and the work would not be done in a single long run but be
divided over a number of e.g. jack process callbacks.
I've again performed some tests on zita-convolver used by
jconv to do the York Minster config. That means around 240
different blocks of 8192 complex values each. The differences
between plain C++, hand vectorized, and optimised assembly
code are absolutely marginal in that case.
The main problem there is that you are not looking at the speed of the
CPU, but running into memory bandwidth problems. In my own experiments
it was very apparent that it pays off to restructure the code in such a
way that memory access is limited as much as possible, although that
might result in more instructions. (not suggesting that's possible for
you though).
The key point seems to be that you have to do all operations at once
instead of looping over a buffer multiple times. Which is not really a
surprise of course...
e.g.:
for(all samples) {sample = operation1(sample)}
for(all samples) {sample = operation2(sample)}
for(all samples) {sample = operation3(sample)}
can easily be an order of magnitude slower than:
for(all samples) {
sample = operation1(sample)
sample = operation2(sample)
sample = operation3(sample)
}
It is essential that the data does not leave the processors cache,
that's for sure. For the remainder I think modern day processors are
very good at optimizing their computation.
Greets,
Pieter