On Wed, May 07, 2008 at 06:00:38AM +0200, Jens M Andreasen wrote:
One thing that I wonder is what the pattern of
addition in Fons's
application really looks like. I assume the fftA * fftB is some windowed
precalculated impulsresponse and a signal? The addition/accumulate
suggests that the output fftD has been touched before, implying that
there could be more variables to work on at once or that the vectors
would still be in the cache if the order of addition was changed.
It's quite complex. I'll try to build up a picture in four steps.
1. The first step to imagine is just a scalar matrix operating on
a number of input sample streams A [j][t] and producing a number
of output sample streams B [i][t]:
index i = output
index j = input
index t = time
B [i][t] = sum_j (M [i][j] * A [j][t])
2. Now imagine that the matrix is not scalar, but a matrix of FIR
filters:
B [i][t] = sum_j (sum_k (M [i][j][k] * A [j][t- k]))
with k = 0...L-1, for filters of length L. We now need to
store the past L-1 samples of each input as well.
Assume this calculation is done for one value of t (i.e. one
sample) at a time (we'll see why soon). There are six ways to
organise the three nested loops. The one actually used is such
that k is the innermost index.
3. Now imagine that each sample in the above equation is actually a
block of P samples. This is how FFT based partitioned convolution
works. We need a size 2P FFT on each input block (the second half
are zeros) and on each matrix element (the latter is precomputed
of course), a size 2P IFFT on each output block, and each of the
multiplications now becomes a MAC loop over P complex values. The
output size is 2P, i.e. the second half overlaps witht the next
iteration and has to be stored. There are size(A) * size(B) * L
such MAC loops for each block, so it helps to optimize this
operation.
4. The scheme above is FFT-based partitioned convolution with a single
partition size P. For efficiency you want large P, but this also
introduces processing delay as you can only start the computation
when P new input samples are available. To avoid this delay in a
real-time application zita-convolver uses multiple partition sizes,
small at the start of an IR, and larger ones for the later parts.
There can be up to five sizes. Calculations for P == period size
are performed directly in the JACK callback, the longer ones are
performed by lower priority threads. A final optimisation is that
a sparse matrix representation is used in all three dimensions,
so no time or memory is wasted on zero-valued data.
--
FA
Laboratorio di Acustica ed Elettroacustica
Parma, Italia
Lascia la spina, cogli la rosa.