[LAU] Analysis of monophonic audio signals on the commandline

Fons Adriaensen fons at linuxaudio.org
Sun Feb 27 13:24:51 CET 2022


On Sun, Feb 27, 2022 at 11:37:45AM +0100, Jeanette C. wrote:

> Hm, if such images are clean, I suppose a program can be written to
> translate the sonogram to values.

They are not always very clean and there are two reasons for this:
- the quality of the recording (filtering will help),
- the complexity of the sound.

> 2D representations of all kind are unfeasible really.

The problem here is that some bird sounds can only be represented
correctly in 2D parameter space.

Some contain a clear single frequency, usually sweeping and modulated.
Such modulation can be quite fast, in the tens of Hz region.
Some others contain very short broadband features, and the whole notion
of a single frequency is not valid at all. I've seen impulsive waveforms
of only a few milliseconds in some recordings. And many bird sounds are
a mix of those two extremes. A sonogram deals with both of them, that
is why it is useful. So what would be needed is some form of analysis
that produces less output but is still able to handle both cases and
everything in between.

Using classical analysis methods, there is a limit to the product of
resolution in time and frequency, similar to the uncertainty principle
in quantum physics. Human (and animal) hearing can in some cases go
beyond that limit - this is possible only by making some a-priori
assumptions about the signal.

The problem is similar to one that occurs in time-stretching of audio:
the algorithm must decide if some feature should be regarded as 
significant in the time domain or in the frequency domain. Which is
why software such as rubberband has both user options and some
not-so-simle internal decision making. 

As a simple example, take a 1 kHz sinewave that is amplitude modulated
by a 10 Hz signal. The actual frequencies present then are 990, 1000,
and 1010 Hz. Now how should this be analysed ?

Option 1: as a modulated 1 kHz signal. When time-stretched, e.g by
a factor of 2, the amplitude as a function of time is preserved,
the modulation frequency becomes 5 Hz, and the output frequencies
are 995, 1000, and 1005 Hz.

Option 2: as three separate and unrelated frequencies. Each of them
is stretched separately, and the output is 990, 1000, and 1010 Hz.
So this will still sound as 10 Hz modulation, just longer.

Which one is correct ? The simple fact is that both are, it is just
a matter of interpretation. Deciding this is something our brains
are good at, based on experience and expectations.

Exactly the same question arises when trying to reduce a signal 
to something that can be described by a 1D function.

Ciao,

-- 
FA





More information about the Linux-audio-user mailing list