Pitch estimation - Linux-audio-user

25 Apr 2024

Hello all,
Several people have asked how the pitch estimation
in zita-at1 works.
The basic method is to look at the autocorrelation
of the signal. This is a measure of how similar a
signal is to a time-shifted version of itself. It
can be computed efficiently as the inverse FFT of
the power spectrum.
In many cases the strongest autocorrelation peak
corresponds to the fundamental period. But this can
easily get ambiguous as there will also be peaks at
integer multiples of that period, and for strong
harmonics. To avoid errors it is necessary to look
also at the signal spectrum and level, and combine
all that info in some way. How exactly is mostly a
matter of trial and error. Which is why I need more
examples.
Have a look at
<http://kokkinizita.linuxaudio.org/linuxaudio/pitchdet1.png>
This a test of the pitch detection algorithm used in
zita-at1.
The X-axis is time in seconds, a new pitch estimate is
made every 10.667 ms (512 samples at 48 kHz).
Vertically we have autocorrelation, the Y-axis is in
samples. Red is positive, blue negative. The green dots
are the detected pitch period, zero means unvoiced.
The blue line on top is signal level in dB.
Note how this singer has a habit of letting the pitch
'droop', by up to an octave, at the end of a note. He
is probably not aware of it. This happens at 28.7s,
again at 30.8s, and in fact during the entire track.
What should an autotuner do with this ? Turn the glide
into a chromatic scale ? The real solution here would
be to edit the recording, adding a fast fadeout just
before the 'droop'. Even a minimal amount of reverb
will hide this.
The fragment from 29.7 to 30.3s is an example of a
vowel with very strong harmonics which show up as
the red bands below the real pitch period. In this
case the 2nd and 3rd harmonic were actually about 20
dB stronger than the fundamental. This is resolved
because the autocorrelation is still strongest at
the fundamental pitch.
The very last estimate in the next fragment (at 30.85s)
is an example of where this goes wrong, the algorithm
selects twice the real pitch period, assuming the
first autocorrelation peak is the 2nd harmonic.
This happens because there was significant energy
at the subharmonic, actually leakage from another
track via the headphone used by singer.
The false 'voiced' detection at 30.39s is also the
result of a signal leaking via the headphone.
Ciao,
--
FA