[LAD] Fw: Re: Some questions about the Jack callback

List overview All Threads
Download

newer

older

[LAD] Some questions about the...

Re: [LAD] Some questions about the...

Fons Adriaensen

20 Sep 2014 20 Sep '14

10:40 p.m.

On Sat, Sep 20, 2014 at 04:10:13PM -0400, Mark D. McCurry wrote:

...

On 09-20, Fons Adriaensen wrote:

Having to do 256 1024-point FFTs just to start a note is insane. It almost certainly means there is something fundamentally wrong with the synthesis algorithm used.

I agree with that notion. In typical patches something between 2-10 IFFTs is expected and even this cost strikes me as too high (zero IFFTs for pure PAD/SUB synth based). In terms of worst case scenarios ZynAddSubFX can have some rather insane characteristics given multiple parts, kits, voices, etc. For instance if a user decided to use all padsynth instances at max quality, they would need 12GB of memory just to store the resulting wavetables. Such extremes are not really seen in practice, but things are slowly getting optimized to avoid them when possible.

You should really look at this from an information theory POV, combined with some psycho-acoustics. Suppose you have to deliver 256 samples in a period when a note starts. That amounts to around 5.3 ms at 48 kHz. That time limits the amount of spectral detail that can be detected given the output from the first period. Which means that there is no point in generating more detail in the first period of a note. Even on sustained notes the amount of spectral detail that can be detected by a human listener is limited by the critical bandwidth of human hearing (which increases with frequency). That means that any set of harmonics that fall within a critical bandwidth can be replaced by a single one with the same energy and nobody would be able to hear the difference. All this means that you *never* need 256 harmonics, not even on bass notes below Fs / (2 * 256). And if the final output is a weighted sum of those IFFT outputs you can as well compute the weighted sum of the inputs and then do a single IFFT - it's a linear transform after all. Ciao, -- FA A world of exhaustive, reliable metadata would be an utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities. (Cory Doctorow) ----- End forwarded message ----- -- FA A world of exhaustive, reliable metadata would be an utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities. (Cory Doctorow)

Show replies by date

Mark D. McCurry

20 Sep 20 Sep

11:01 p.m.

On 09-20, Fons Adriaensen wrote:

...

On Sat, Sep 20, 2014 at 04:10:13PM -0400, Mark D. McCurry wrote:

On 09-20, Fons Adriaensen wrote:

Having to do 256 1024-point FFTs just to start a note is insane. It almost certainly means there is something fundamentally wrong with the synthesis algorithm used.

If you are proposing that 256 harmonics are not needed, then is there a transformation that yields an equivilant psycho-acoustic output in less time than the fft would have taken given any possible spectral input? (The user has full control over the full spectrum in terms of phase/magnitude) If so, I'd be interested in reading some papers on that topic, though I'm skeptical as the work on k-sparse FFTs indicate that this FFT size is much too small to gain any measurable advantages (above k\approx2). As it stands the source for a per note voice wavetable is a spectral representation which is combined with some frequency dependent manipulation (eg removing harmonics which would alias and the aforementioned adaptive harmonics) which get thrown into an IFFT. The resulting wavetable is fairly large to make the error of linear interpolation small (as to minimize the normal running cost). Additionally the output from traversing the wavetable can be the source for a number of nonlinear functions (FM/PM source function and distortions). If there weren't any nonlinear functions later in the chain, then there might be some additional flexibility, but I don't perceive too much wiggle room without precalculating the possible wavetables. Also, the idea of a set critical bandwidth is broken here due to the ability to modulate wildly without recalculating the base wavetable. (This is the largest correctness issue from a signal processing prespective in that synthesis engine ATM). --Mark

Fons Adriaensen

11:34 p.m.

On Sat, Sep 20, 2014 at 05:01:32PM -0400, Mark D. McCurry wrote:

...

I'm certainly not claiming that there is some simple trick to simplify things. But the information theory point still stands: if you compute 256K samples and only output 256, 512 or 1024 that means that 99% percent if the information you have is thrown away. Which probably means it was not necessary to compute those 256K in the first place, at least not to produce the first period of output. The only case where this is not true is for algoritms that deliberately hide or destroy information, i.e. cryptographic ones. 'The user has full control over the full spectrum'. The question is if this is necessary - if all that detail is really perceptible in the final output. If it is not, then there is no point in generating it in the first place.

...

If so, I'd be interested in reading some papers on that topic, though I'm skeptical as the work on k-sparse FFTs indicate that this FFT size is much too small to gain any measurable advantages (above k\approx2). As it stands the source for a per note voice wavetable is a spectral representation which is combined with some frequency dependent manipulation (eg removing harmonics which would alias and the aforementioned adaptive harmonics) which get thrown into an IFFT.

There's been a lot of research the last years into sparse representation of some kinds of signals and into compressive sampling, but these things are not simple from a computational POV. And all of it is about efficiently capturing signals, not generating them.

...

The resulting wavetable is fairly large to make the error of linear interpolation small (as to minimize the normal running cost). Additionally the output from traversing the wavetable can be the source for a number of nonlinear functions (FM/PM source function and distortions). If there weren't any nonlinear functions later in the chain, then there might be some additional flexibility, but I don't perceive too much wiggle room without precalculating the possible wavetables. Also, the idea of a set critical bandwidth is broken here due to the ability to modulate wildly without recalculating the base wavetable.

If a signal from a wavetable is used as input to non-linear processes, or modulated wildly that means that much of the detailed spectral information contained in the wavetable is modified in complex ways or smeared out or just destroyed. Which again raises the question if that information was required there in the first place. In summary, simplifying the current algorithms while preserving the exact output you have now may well be very difficult or impossible. But I doubt very much that you need the exact same output. Ciao, -- FA A world of exhaustive, reliable metadata would be an utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities. (Cory Doctorow)

Mark D. McCurry

21 Sep 21 Sep

12:21 a.m.

On 09-20, Fons Adriaensen wrote:

...

On Sat, Sep 20, 2014 at 05:01:32PM -0400, Mark D. McCurry wrote:

I'm certainly not claiming that there is some simple trick to simplify things.

Darn, I was hoping for you to point out some sort of interesting way of working with pscho-acoustic space. I have worked with it some, but I surely have large gaps in the details and possible best ways of working with it.

...

But the information theory point still stands: if you compute 256K samples and only output 256, 512 or 1024 that means that 99% percent if the information you have is thrown away. Which probably means it was not necessary to compute those 256K in the first place, at least not to produce the first period of output. The only case where this is not true is for algoritms that deliberately hide or destroy information, i.e. cryptographic ones.

Yep, a good example of this was done a good while back when detuned sets of identical oscillators were made to use the same wavetable.

...

'The user has full control over the full spectrum'. The question is if this is necessary - if all that detail is really perceptible in the final output. If it is not, then there is no point in generating it in the first place.

I'd side with it not really being needed, but going for a more restricted interface does require careful though on how the user is going to interact with the software. ZynAddSubFX is a very large complex beast even just within the oscillator generator and I'll admit that I don't have any bright ideas on how to better express the parameter space without negatively impacting some portion of the existing use cases. Designing a good user experience is hard IMO and getting it right isn't going to be quite as simple as reading up on a book or two and the fairly general nature of some of zyn's components certainly makes nailing things down somewhat harder.

...

Yeah, that's the side of things that I've been exposed to. Lots of compressed sensing and sparse model based representations used to extract information out and manipulate it. It is somewhat hard to verify if you have managed to get yourself into a bubble and overlook nearby work which you have not directly interacted with.

...

The answer to that is that yes, most of the information in the original signal isn't really going to be present there in a useful way, but changing the original signal while remaining perceptually close to the distorted version is hard, though this is more of a problem with maintaining compatibility than anything else.

...

In summary, simplifying the current algorithms while preserving the exact output you have now may well be very difficult or impossible. But I doubt very much that you need the exact same output.

Yes, I agree completely. --Mark

Fons Adriaensen

11:19 a.m.

On Sat, Sep 20, 2014 at 06:21:02PM -0400, Mark D. McCurry wrote:

...

Critical bandwidths would be something that maybe can be exploited without introducing too much complexity. Almost all lossy audio compression schemes are based on this. So one way would be to explore in how far you could parametrise e.g. an ogg or mp3 decoder and turn it into a synth engine.

...

I'd side with it not really being needed, but going for a more restricted interface does require careful though on how the user is going to interact with the software.

One way would be to keep the user interface, but to reduce the user input before it's used. That's cheating of course, but I don't see any fundamental reason why that wouldn't be acceptable... this is about a musical instrument, not a scientific tool. Which means that the user will explore things instead of having a predefined input that leads to a certain result.

...

Yeah, that's the side of things that I've been exposed to. Lots of compressed sensing and sparse model based representations used to extract information out and manipulate it.

Fascinating stuff, isn't it ? I only got aware of this two years or so ago, and worked my way through some of the fundamental papers by Candès, Donoho e.a. If you have pointers to any material related to audio applications of compressive sensing I'd be very interested. Ciao, -- FA A world of exhaustive, reliable metadata would be an utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities. (Cory Doctorow)

Len Ovens

4:54 p.m.

On Sun, 21 Sep 2014, Fons Adriaensen wrote:

...

Almost all lossy audio compression schemes are based on this. So one way would be to explore in how far you could parametrise e.g. an ogg or mp3 decoder and turn it into a synth engine.

The expected latency with ogg or mp3 is 200+ms, the Celt end of Opus might be a better choice (5ms). Even Silk is probably higher latency than a sound generator wants. The latency listed for all these codecs is the whole chain: encode plus transport plus decode. OPUS gets it's lower latency in some cases by throwing away missing packets that ogg might be able to wait for. So it is possible that the decode part of all of these might have useful ideas. -- Len Ovens www.ovenwerks.net

Will Godfrey

5:39 p.m.

On Sun, 21 Sep 2014 07:50:45 -0700 (PDT) Len Ovens <len(a)ovenwerks.net> wrote:

...

On Sun, 21 Sep 2014, Fons Adriaensen wrote:

Almost all lossy audio compression schemes are based on this. So one way would be to explore in how far you could parametrise e.g. an ogg or mp3 decoder and turn it into a synth engine.

Sheesh! I'm away for one day (live folk music actually) and the thread explodes :o I can follow this in outline, although not the finer details, but my greatest concern is how would you go about proving there was no noticeable difference to the listener? With all the interaction possibilities I suspect there are rather a lot of corner cases. On a slightly divergent point, someone tried to correct a spelling mistake in Yoshi that made a saved parameter invalid, but of all the sounds in all the banks I have, this made a very slight but noticeable change to just one instrument. Had it not been one I use frequently I might still not have realised it. I'm also reminded of the situation when the web was fairly new and people would make copies of copies of jpegs. The differences only got noticeable about 3 steps down, but by then it was too late. The damage had been done. When I show off my music to others, the sound is the first comment (I could wish otherwise) so personally I'm rather twitchy about anything that might alter that, and as a musician I'd rather spend out for a more powerful computer than have a more efficient but possibly compromised mojo. ... of course, as an engineer I would like the greatest efficiency possible - fortunately I don't talk to myself :) Reading this back, it seems rather like a rant. I'm sorry, but 'our' sound has become critical to my compositions. -- Will J Godfrey http://www.musically.me.uk Say you have a poem and I have a tune. Exchange them and we can both have a poem, a tune, and a song.

Len Ovens

6:18 p.m.

On Sun, 21 Sep 2014, Will Godfrey wrote:

...

On Sun, 21 Sep 2014 07:50:45 -0700 (PDT) Len Ovens <len(a)ovenwerks.net> wrote:

On Sun, 21 Sep 2014, Fons Adriaensen wrote:

Almost all lossy audio compression schemes are based on this. So one way would be to explore in how far you could parametrise e.g. an ogg or mp3 decoder and turn it into a synth engine.

The expected latency with ogg or mp3 is 200+ms, the Celt end of Opus might be a better choice (5ms). Even Silk is probably higher latency than a

I can follow this in outline, although not the finer details, but my greatest concern is how would you go about proving there was no noticeable difference to the listener? With all the interaction possibilities I suspect there are rather a lot of corner cases.

My first comment is that there would be slight differences. The question is really if those differences are more or less pleasing to listen to. What would be happening is not a replacenet for the same sound, but rather a new sound that happened to be similar and would have to have it's own merrits.

...

I'm also reminded of the situation when the web was fairly new and people would make copies of copies of jpegs. The differences only got noticeable about 3 steps down, but by then it was too late. The damage had been done.

In this case it is always first generation. No matter what your first generation is, lossy encoding will give differences to the final sound. Although, a sound that started out using compression techniques might sound less different than other sounds.

...

When I show off my music to others, the sound is the first comment (I could wish otherwise) so personally I'm rather twitchy about anything that might alter that, and as a musician I'd rather spend out for a more powerful computer than have a more efficient but possibly compromised mojo.

Sound is king. Lossless encoding is best. The start of the thread was based on the present hardware and lowering cpu load... Faster Hardware, more cores, etc. may not be possible for everyone... particularely someone who is using a small R-pi like board as a head-less stage box. I have seen effects boxes done this way, but not sound generators yet (though the MOD could possibly be used this way).

...

... of course, as an engineer I would like the greatest efficiency possible - fortunately I don't talk to myself :)

As a musician, I am quite willing to use 500watts for an amp delivering 50watts of sound if it just happens to be "that sound".

...

Reading this back, it seems rather like a rant. I'm sorry, but 'our' sound has become critical to my compositions.

I did not feel ranted at. It is hard to know how much time or sound matter to the person using the SW. For example with the note on instance noted earlier, The sound module could at note start, choose to do only half of it's setup in the first period and finish in the second, only starting to make sound at that point. (using silence as was suggested in the first period) However, that note start delay may not be acceptable to the artist. They may be quite willing to have faster HW or even use two HW boxes for more layers rather than have that small delay. The latency you originally used as an example was to my mind higher than I would like to use for a guitar effect, though I have with this netbook because internal sound can't go lower (jack won't even start at 64/2). <dream helmet on> I think the MOD is in many ways the wave of the future. I see off-loading more of the sound processing to the audio interface as the general computer interfaces become more throughput oriented and less lowlatency capable. Having an audio interface that is kind of a secialty computer, but with OS access for the user just makes sense. Many AIs already have quite a lot of processing inside, but are not open. The cost is not that high for this added processing (end cost of $50?) and I would think having the ability to add processing power with cards the size of the mini/micro PCIe wireless cards should not be difficult. If Jack is run with very low latency, then using a netjack like interface between cores could easily allow the use of 16 or more cores/threads and still have an acceptable latency. What if a second (open) video card was used for audio processing? -- Len Ovens www.ovenwerks.net

Paul Davis

6:29 p.m.

New subject: [LAD] Fw: Re: Some questions about the Jack callback

On Sun, Sep 21, 2014 at 12:15 PM, Len Ovens <len(a)ovenwerks.net> wrote:

...

is netbook because internal sound can't go lower (jack won't even start at 64/2). <dream helmet on> I think the MOD is in many ways the wave of the future. I see off-loading more of the sound processing to the audio interface as the general computer interfaces become more throughput oriented and less lowlatency capable.

in other words, the precise opposite of what has happened over the last 10 years, in which we've ended up with audio interface chipsets that can't even do multiple sample rates. though to be fair, there is a pro-/pro-sumer category where "builtin FX" does seem to have some appeal.

...

Having an audio interface that is kind of a secialty computer, but with OS access for the user just makes sense. Many AIs already have quite a lot of processing inside, but are not open. The cost is not that high for this added processing (end cost of $50?) and I would think having the ability to add processing power with cards the size of the mini/micro PCIe wireless cards should not be difficult. If Jack is run with very low latency, then using a netjack like interface between cores could easily allow the use of 16 or more cores/threads and still have an acceptable latency. What if a second (open) video card was used for audio processing?

video cards have very bad latency characteristics at present. CUDA etc. are all about bandwidth, not latency. anyway, none of this matters. if the application runs on the CPU, and is responsive, then the CPU and its own infrastructure have to be able to meet the latency requirements.

Len Ovens

10:23 p.m.

New subject: [LAD] Fw: Re: Some questions about the Jack callback

On Sun, 21 Sep 2014, Paul Davis wrote:

...

video cards have very bad latency characteristics at present. CUDA etc. are all about bandwidth, not latency.

I guess that makes sense. Frame rate is a lot slower for video than audio. -- Len Ovens www.ovenwerks.net

Len Ovens

6:31 p.m.

On Sun, 21 Sep 2014, Len Ovens wrote:

...

<dream helmet on> I think the MOD is in many ways the wave of the future. I see off-loading more of the sound processing to the audio interface as the general computer interfaces become more throughput oriented and less lowlatency capable. Having an audio interface that is kind of a secialty computer, but with OS access for the user just makes sense. Many AIs already have quite a lot of processing inside, but are not open. The cost is not that high for this added processing (end cost of $50?) and I would think having the ability to add processing power with cards the size of the mini/micro PCIe wireless cards should not be difficult. If Jack is run with very low latency, then using a netjack like interface between cores could easily allow the use of 16 or more cores/threads and still have an acceptable latency. What if a second (open) video card was used for audio processing?

To add to this, I am wondering, because of the higher latency of some of the newer USB AIs, if it would make sense to have a jack backend that allows jack to run at a lower latency than the AI. So the AI would run 64/2, but jack would run at 32/2 or less so that there was time to offload processing on more cores/threads. -- Len Ovens www.ovenwerks.net

4185

days inactive

4186

days old

linux-audio-dev@lists.linuxaudio.org

Manage subscription

10 comments

5 participants

tags (0)

participants (5)

Fons Adriaensen
Len Ovens
Mark D. McCurry
Paul Davis
Will Godfrey