[LAD] Linux-audio-dev Digest, Vol 44, Issue 6

Stéphane Letz letz at grame.fr
Tue Oct 12 18:30:22 UTC 2010

Previous message: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6
Next message: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Le 12 oct. 2010 à 20:11, Jens M Andreasen a écrit :

> 
> On Tue, 2010-10-12 at 16:29 +0200, Stéphane Letz wrote:
>>> 
> 
>> I've done some test using OpenCL in the context of the Faust project
>> (http://faust.grame.fr/). Up to now results are not really good, and I
>> guess CUDA/OpenCL will be usable only in specific cases. 
> 
> What kinds of parallellism have you been exploring? 

Well, Faust is able to generate a DAG of separated loops, some of them are data parallelizable, others are not (recursive).
Right now I'm testing a simple strategy where the DAG is reduced as a sequence of  "group of parallel loops" slices. Sync points are added between slices.

Data paralelizable loops are not yet correctly handled, which has to be done obviously. So the current model is basically tasks parallellism and is quite naive...

> 
> I have found that the multiple channel strip approach with mixdown to
> subgroups is straight-forward for a DAW, as well as for polysynths.
> 
> Global sync points between multiprocessors - for cascaded processing -
> works up to a limit after which the squared cost of the sync eats up the
> computational value of the added multiprocessor. I am syncing the 6 MP's
> on a GT220 every 16 samples at 96kHz with a penalty in the 5% range. On
> higher end cards this approach is not very useful though. Discouraged by
> Nvidia staffers also ...
> 
> 
> In theory, the granularity of the vector needs to be no higher than 32
> vector elements to be efficient - that is how the hardware
> multithreading works. In practice, for the given use case, you will find
> yourself trashing the instruction cache if you use too many divergent
> warps. Two, perhaps three, completely different code paths on each MP
> works well.
> 
> 192 or 256 threads are a minimum to hide instruction latency, leading to
> the conclusion that the effective vector as seen by the outside world
> needs to be at most 128 elements wide (256/2, which is what I currently
> use) and possibly as low as 64 (192 threads / 3 codepaths)

Well you obviously have a lot of practical knowledge I don't have. Any code samples you could share?
> 
> 
>> I'll probably now test if directly using CUDA would give some benefit.
>> Maybe we can share some ideas?
> 
> CUDA is nice :)

So I'll try.

Thanks

Stéphane

Previous message: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6
Next message: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Linux-audio-dev mailing list