On Tue, 2009-08-11 at 14:10 +0100, Steve Harris wrote:
Doing little
things here
and there /only/ would be very difficult in general.
Ah, I see. I'm not really that familiar with how it works.
Is it not possible to stitch multiple plugins into a single processing
unit, horizontally? ie. can you compose the CUDA objects?
No, youd have to do your "stiching" ..
Going really hightech, with access to the source you
could compile the
mega-plugin on demand. That might be a bit adventurous, but it would
be a clear win over the kind of things the closed-source people can do.
.. at compile-time to get wildly diferent thing running simultaniosly on
each processor. At least the code must be present and then you could do
the final choice at kernel launch sending some parameter to the GPU to
ponder on:
switch(someArgument[blockID]) // we know where we are
{
case STRIP: ...; // do the channel strip thingie
case MOOG: ...; // do classic synth
case REVERB ...
case FAIRLIGHT: ...
case ...
Well, almost that simple at least. There is still the 192 thread rule
saying that you need at least 192 threads on each processor to fully
hide pipeline latencies. Each "warp" - which is a group of 32 threads -
can still take diferent codepaths though, as long as the programmer
takes care that all paths will evt arrive at any synchronisation barrier
issued by some other path. You'll get hung otherwise.
Furthermore, all processor configurations will be the same. If one
processor is configured for two blocks each having 128 threads, then
that is how the world looks like for all codepaths on all processors.
Still with me? In that case you are invited to do something wonderful
for one (or more) multiprocessors, each having 256 threads divided
between two blocks (128 threads per block) that will not use more than
the available 16K registers defined by Compute Model 1.2 so that both
blocks will run concurrently, hiding latencies as well as sync barriers.
Figuring out where to read and write shared in/out in an organized way
would in theory be the first minor obstacle, but nobody will notice
before the individual parts of the project is eventually stiched
together. Also, nobody knows yet what kind of processing will be
available either or why anybody would like to read any other data, so ..
Just to get going, say the kernel will be called every 128 samples at a
samplerate suitable for the processor at hand. Assume something like a
rate of 96K @1.2GHz
Sure, but it doesn't sound like it's as useful
as a GP CPU.
Bah!
/jma
- Steve