On Tue, 2009-08-11 at 19:54 +0200, Fons Adriaensen wrote:
On Tue, Aug 11, 2009 at 06:50:50PM +0200, Jens M
Andreasen wrote:
That would be four warps
independently working their way through the variously sized sample
blocks, each thread execting serial code that looks very much the same
as jconv itself, including the threading.
Note that the algorithm implemented by libzita-convolver (used by
jconv) when used in real-time mode relies on regular scheduling
(i.e. being called from a Jack process callback) and carefully
set thread priorities.
The priorities are always even .. and then again not nescessarily.
Say warp A (or "process" A) must do four smaller workloads while warp B
is doing one bigger workload? The way to go would then be for warp B to
call __syncthreads() when 25% of its work is done, thus assuring that
warp A will be given all of GPU untill it has catched up at the end of
its first workload and also calls __synthreads(), which gives warp B the
green light to continue. This under the assumption that warp A hasn't
already done it's part and is waiting for B to catch up.
Repeat the procedure at 50% and 75%.
How to structure a convolution engine to run on a
graphics
processor would very much depend on where you want the I/O.
Locally on the card for use by other parts of the complex, unless by
routing directive read or written to those arrays that are transferred
back and forth between the GPU and host at each kernel launch.
How much jconv
would something like a 300Mhz Pentium Pro buy me? (Just
to get a hunch if this would be a possibility at all)
Almost impossible to tell without trying. It also depends
in a very complex way of the configuration - the ratios
will not be the same on all machines.
I found a measure of ~1 sec for a 128K FFT on a PPro @200
Would that be helpful for a guesstimate?
The thing is also that, although the first thing one might come to think
of is a nice convolution reverb with a decay of two seconds, having
instead 32 shorter impulses - all different - opens up another universe.
You could have an increasing delay in front of each of them, giving an
illusion that they are all parts of the same (huge) impulse redponse, or
you could use keyboard triggers and routing to play them like an
instrument.
Still, 500ms would be really very useful and 32 convolutions is mmm ..
perhaps a little overkill. There might be ways for two or four threads
to share one load. IIRC library routines for SSE enabled FFT exists
which could be more or less copied verbatim across four adjacent
threads.