[LAD] vectorization

Fons Adriaensen fons at kokkinizita.net
Wed May 7 08:50:09 UTC 2008

On Wed, May 07, 2008 at 06:00:38AM +0200, Jens M Andreasen wrote:

> One thing that I wonder is what the pattern of addition in Fons's
> application really looks like. I assume the fftA * fftB is some windowed
> precalculated impulsresponse and a signal? The addition/accumulate
> suggests that the output fftD has been touched before, implying that
> there could be more variables to work on at once or that the vectors
> would still be in the cache if the order of addition was changed.

It's quite complex. I'll try to build up a picture in four steps.

1. The first step to imagine is just a scalar matrix operating on
   a number of input sample streams A [j][t] and producing a number
   of output sample streams B [i][t]:

   index i = output
   index j = input
   index t = time

   B [i][t] = sum_j (M [i][j] * A [j][t])

2. Now imagine that the matrix is not scalar, but a matrix of FIR
   B [i][t] = sum_j (sum_k (M [i][j][k] * A [j][t- k]))

   with k = 0...L-1, for filters of length L. We now need to
   store the past L-1 samples of each input as well.

   Assume this calculation is done for one value of t (i.e. one
   sample)  at a time (we'll see why soon). There are six ways to
   organise the three nested loops. The one actually used is such
   that k is the innermost index.

3. Now imagine that each sample in the above equation is actually a
   block of P samples. This is how FFT based partitioned convolution
   works. We need a size 2P FFT on each input block (the second half
   are zeros) and on each matrix element (the latter is precomputed
   of course), a size 2P IFFT on each output block, and each of the
   multiplications now becomes a MAC loop over P complex values. The
   output size is 2P, i.e. the second half overlaps witht the next
   iteration and has to be stored. There are size(A) * size(B) * L
   such MAC loops for each block, so it helps to optimize this

4. The scheme above is FFT-based partitioned convolution with a single
   partition size P. For efficiency you want large P, but this also
   introduces processing delay as you can only start the computation
   when P new input samples are available. To avoid this delay in a
   real-time application zita-convolver uses multiple partition sizes,
   small at the start of an IR, and larger ones for the later parts.
   There can be up to five sizes. Calculations for P == period size
   are performed directly in the JACK callback, the longer ones are
   performed by lower priority threads. A final optimisation is that
   a sparse matrix representation is used in all three dimensions,
   so no time or memory is wasted on zero-valued data.


Laboratorio di Acustica ed Elettroacustica
Parma, Italia

Lascia la spina, cogli la rosa.

More information about the Linux-audio-dev mailing list