Le 12 oct. 2010 à 20:11, Jens M Andreasen a écrit :
On Tue, 2010-10-12 at 16:29 +0200, Stéphane Letz wrote:
>
I've done some test using OpenCL in the
context of the Faust project
(
http://faust.grame.fr/). Up to now results are not really good, and I
guess CUDA/OpenCL will be usable only in specific cases.
What kinds of parallellism have you been exploring?
Well, Faust is able to generate a DAG of separated loops, some of them are data
parallelizable, others are not (recursive).
Right now I'm testing a simple strategy where the DAG is reduced as a sequence of
"group of parallel loops" slices. Sync points are added between slices.
Data paralelizable loops are not yet correctly handled, which has to be done obviously. So
the current model is basically tasks parallellism and is quite naive...
I have found that the multiple channel strip approach with mixdown to
subgroups is straight-forward for a DAW, as well as for polysynths.
Global sync points between multiprocessors - for cascaded processing -
works up to a limit after which the squared cost of the sync eats up the
computational value of the added multiprocessor. I am syncing the 6 MP's
on a GT220 every 16 samples at 96kHz with a penalty in the 5% range. On
higher end cards this approach is not very useful though. Discouraged by
Nvidia staffers also ...
In theory, the granularity of the vector needs to be no higher than 32
vector elements to be efficient - that is how the hardware
multithreading works. In practice, for the given use case, you will find
yourself trashing the instruction cache if you use too many divergent
warps. Two, perhaps three, completely different code paths on each MP
works well.
192 or 256 threads are a minimum to hide instruction latency, leading to
the conclusion that the effective vector as seen by the outside world
needs to be at most 128 elements wide (256/2, which is what I currently
use) and possibly as low as 64 (192 threads / 3 codepaths)
Well you obviously have a lot of practical knowledge I don't have. Any code samples
you could share?
I'll probably now test if directly using CUDA
would give some benefit.
Maybe we can share some ideas?
CUDA is nice :)
So I'll try.
Thanks
Stéphane