[LAD] vectorization
Fons Adriaensen
fons at kokkinizita.net
Wed May 7 08:50:09 UTC 2008
On Wed, May 07, 2008 at 06:00:38AM +0200, Jens M Andreasen wrote:
> One thing that I wonder is what the pattern of addition in Fons's
> application really looks like. I assume the fftA * fftB is some windowed
> precalculated impulsresponse and a signal? The addition/accumulate
> suggests that the output fftD has been touched before, implying that
> there could be more variables to work on at once or that the vectors
> would still be in the cache if the order of addition was changed.
It's quite complex. I'll try to build up a picture in four steps.
1. The first step to imagine is just a scalar matrix operating on
a number of input sample streams A [j][t] and producing a number
of output sample streams B [i][t]:
index i = output
index j = input
index t = time
B [i][t] = sum_j (M [i][j] * A [j][t])
2. Now imagine that the matrix is not scalar, but a matrix of FIR
filters:
B [i][t] = sum_j (sum_k (M [i][j][k] * A [j][t- k]))
with k = 0...L-1, for filters of length L. We now need to
store the past L-1 samples of each input as well.
Assume this calculation is done for one value of t (i.e. one
sample) at a time (we'll see why soon). There are six ways to
organise the three nested loops. The one actually used is such
that k is the innermost index.
3. Now imagine that each sample in the above equation is actually a
block of P samples. This is how FFT based partitioned convolution
works. We need a size 2P FFT on each input block (the second half
are zeros) and on each matrix element (the latter is precomputed
of course), a size 2P IFFT on each output block, and each of the
multiplications now becomes a MAC loop over P complex values. The
output size is 2P, i.e. the second half overlaps witht the next
iteration and has to be stored. There are size(A) * size(B) * L
such MAC loops for each block, so it helps to optimize this
operation.
4. The scheme above is FFT-based partitioned convolution with a single
partition size P. For efficiency you want large P, but this also
introduces processing delay as you can only start the computation
when P new input samples are available. To avoid this delay in a
real-time application zita-convolver uses multiple partition sizes,
small at the start of an IR, and larger ones for the later parts.
There can be up to five sizes. Calculations for P == period size
are performed directly in the JACK callback, the longer ones are
performed by lower priority threads. A final optimisation is that
a sparse matrix representation is used in all three dimensions,
so no time or memory is wasted on zero-valued data.
--
FA
Laboratorio di Acustica ed Elettroacustica
Parma, Italia
Lascia la spina, cogli la rosa.
More information about the Linux-audio-dev
mailing list