Wed May 7 08:50:09 UTC 2008

```On Wed, May 07, 2008 at 06:00:38AM +0200, Jens M Andreasen wrote:

> One thing that I wonder is what the pattern of addition in Fons's
> application really looks like. I assume the fftA * fftB is some windowed
> precalculated impulsresponse and a signal? The addition/accumulate
> suggests that the output fftD has been touched before, implying that
> there could be more variables to work on at once or that the vectors
> would still be in the cache if the order of addition was changed.

It's quite complex. I'll try to build up a picture in four steps.

1. The first step to imagine is just a scalar matrix operating on
a number of input sample streams A [j][t] and producing a number
of output sample streams B [i][t]:

index i = output
index j = input
index t = time

B [i][t] = sum_j (M [i][j] * A [j][t])

2. Now imagine that the matrix is not scalar, but a matrix of FIR
filters:

B [i][t] = sum_j (sum_k (M [i][j][k] * A [j][t- k]))

with k = 0...L-1, for filters of length L. We now need to
store the past L-1 samples of each input as well.

Assume this calculation is done for one value of t (i.e. one
sample)  at a time (we'll see why soon). There are six ways to
organise the three nested loops. The one actually used is such
that k is the innermost index.

3. Now imagine that each sample in the above equation is actually a
block of P samples. This is how FFT based partitioned convolution
works. We need a size 2P FFT on each input block (the second half
are zeros) and on each matrix element (the latter is precomputed
of course), a size 2P IFFT on each output block, and each of the
multiplications now becomes a MAC loop over P complex values. The
output size is 2P, i.e. the second half overlaps witht the next
iteration and has to be stored. There are size(A) * size(B) * L
such MAC loops for each block, so it helps to optimize this
operation.

4. The scheme above is FFT-based partitioned convolution with a single
partition size P. For efficiency you want large P, but this also
introduces processing delay as you can only start the computation
when P new input samples are available. To avoid this delay in a
real-time application zita-convolver uses multiple partition sizes,
small at the start of an IR, and larger ones for the later parts.
There can be up to five sizes. Calculations for P == period size
are performed directly in the JACK callback, the longer ones are
performed by lower priority threads. A final optimisation is that
a sparse matrix representation is used in all three dimensions,
so no time or memory is wasted on zero-valued data.

--
FA

Laboratorio di Acustica ed Elettroacustica
Parma, Italia

Lascia la spina, cogli la rosa.

```