On Wed, 2009-08-12 at 12:10 +0200, Fons Adriaensen wrote:
It would be 16K FFT, and there would be a lot more
than
just that.
-<snip>
But starting out with the 16K FFT will do for now
You could loop over the data structures, and make a
list of
of all steps (FFT, MAC, IFFT) and their parameters. Then if
you know the relative cost of the three basic operations you
try to divide this in 64 equal parts.
I don't really have to have 64 parts. It's just that I can't have more
AND that each part may not exceed the available computational power for
a given time quantum which could be defined as:
One processor has 8 units @1.2 GHz == 9.6G instructions total
4 warps of 32 threads (=128) share the instructions evenly
samplerate is 96K, periodsize 128 == 750 periods
9,6G / (750 * 128) == 100000 instructions
So each part must not have more than 100K insructions in any of the 64
parts, or else we'll get X-runs
I have here an FFT implementation from 1987 by H V Sorensen.
In the inermost loop I can count almost 50 instructions
There is a 'dowhile' around that and a 'for' loop around the dowhile.
I count each time we hit the work in the innermost loop or do the sin()
and cos() in the outer loop, and then break out of the outer loop
everytime 1000 'work' (50000 instuctions) has been completed:
16384=2^13
Did 1030 'work' in 2 iteration(s)
Did 1036 'work' in 4 iteration(s)
Did 1046 'work' in 7 iteration(s)
Did 1056 'work' in 12 iteration(s)
Did 1008 'work' in 21 iteration(s)
Did 1010 'work' in 32 iteration(s)
Did 1008 'work' in 42 iteration(s)
Did 1008 'work' in 72 iteration(s)
Did 1002 'work' in 84 iteration(s)
Did 1008 'work' in 126 iteration(s)
Did 1004 'work' in 134 iteration(s)
Did 1002 'work' in 167 iteration(s)
Did 1002 'work' in 167 iteration(s)
Did 1002 'work' in 179 iteration(s)
Did 1004 'work' in 251 iteration(s)
Did 1004 'work' in 251 iteration(s)
Did 1004 'work' in 251 iteration(s)
Final 936 'work' in 234 iteration(s)
So doing the 16K FFT for a warp of 32 instances can be done in the first
18 parts (out of 64) at an estimate of 50% GPU
Similarily a 4K FFT can be split evenly over 4 parts (of 16)
4096=2^11
Did 1024 'work' in 22 iteration(s)
Did 1002 'work' in 94 iteration(s)
Did 1002 'work' in 183 iteration(s)
Final 812 'work' in 203 iteration(s)
.. which appears to hit sin() and cos() relatively much more often
You had a long laundry list, now what next?
There are some problems with this:
- The relative cost of FFT and MAC is not know with any
precision.
I'll have to use some source and count them.
- There may be less than 64 steps.
- A single 16k FFT may already be too much, and most FFT
libraries don't allow you to divide it into smaller
pieces.
But division into smaller pieces is really an imperative here .. Guess
I'll have to do some more code editing then.
Again, if you *can* do the division, there is no need
to use multiple threads. you can just merge the lists
of steps for all partition sizes and execute this as
a static schedule using just function calls. This is
how jace worked.
Ciao,