Re: [LAD] vectorization

27 May 2008

On Wed, 2008-05-07 at 10:50 +0200, Fons Adriaensen wrote:
[about zita convolver]
...
  It's quite complex. I'll try to build up a
picture in four steps. 
-<fast FWD>-
...
  4. The scheme above is FFT-based partitioned
convolution with a single
    partition size P. For efficiency you want large P, but this also
    introduces processing delay as you can only start the computation
    when P new input samples are available. To avoid this delay in a
    real-time application zita-convolver uses multiple partition sizes,
    small at the start of an IR, and larger ones for the later parts.
    There can be up to five sizes. Calculations for P == period size
    are performed directly in the JACK callback, the longer ones are
    performed by lower priority threads. A final optimisation is that
    a sparse matrix representation is used in all three dimensions,
    so no time or memory is wasted on zero-valued data.
     
Since you are bound by bandwith to main memory, it would be nice to get
you off the precious level 2 cache. There are hints available to bypass
the cache, but a better solution might be to look into nVidias C-like
CUDA language. Here are some measures for their library FFT involving a
€200 card with a 256bit GDDR3 memory interface:
 http://www.cv.nrao.edu/~pdemores/gpu/
... and here is an introduction where you can get an idea of how
involved that might be:
 http://www.ddj.com/hpc-high-performance-computing/207200659
Silent cards with passive cooling are  available at about €50+ but have
"only" ordinary 128bit DDR2 memory and considerably less computational
power. 4 Gflops (real world) might be enough for the application at hand
though?
Apparently nVidia has implemented some kind of hardware permute in their
later designs (from 8400 an up), opening up their gpu's to a much wider
range of algorithms than previous generations. Real world performance in
Gflops appears to be about 1/10 of the peak shader thruput mentioned in
this table at wikipedia:
 http://en.wikipedia.org/wiki/GeForce_8_Series#Technical_summary
As an added bonus, should somebody look into and implement any of this,
we would all get the perfect excuse for achieving insane framerates in
Quake/OpenArena :-D
Disclaimer: NVidias product codes have become a disgustingly confusing
alphabet-soup. I might have misread some comparison table.
--

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [LAD] vectorization