alex stone wrote:
On Mon, Mar 9, 2009 at 2:59 PM, Olivier Guilyardi <list(a)samalyse.com
<mailto:list@samalyse.com>> wrote:
alex stone wrote:
If you're intent on automating a speech
analysis, voice noise removal
device of some sort, then you might do well to start with a 'pre and
post' framework. Things like lipsmacking, glottal and nasal noise for
the end of phrase, etc, are fairly easy to identify, and generally
occur
pre and post. So that may well be a decent
percentage of any cleanup
done quickly. (Dependent of course on language. Cleaning up russian
would be a different 'module' to cleaning up French, or Finnish.)
That sounds encouraging. What to you mean by "pre and post" (sorry
if that's an
obvious question to you)?
[...]
Pre and post meaning the start and finish of a
recorded wav or region.
Example being the first few, and the last few, milliseconds or so. Most
of this would be obvious to the ear, so i can imagine a means to edit
this could be mechanised in some way. (Being
careful, of course, not to dehumanise the original recording too far.)
Alright, got that. We've done some experimentations. On a 3 minutes of speech
recording, we got 53 noises, with the following repartition:
inspire: 37
expiration: 2
lips: 6
nose: 3
glottal: 1
inspire+lips: 4
That means "inspiring" (breathing in, between two phrases or groups of words)
noises makes 70% of the noises. These can be silenced, no need for frequency
filtering, because they always happen "pre and post" as you say, and
they're
apparently always preceded and followed by small silences.
Here's the spectrogram+waveform of two inspire noises (the cursor, a white
vertical line, is on the noise). On each view appears 1 noise, surrounded by speech:
http://www.samalyse.com/code/speechfilter/inspire1.png
http://www.samalyse.com/code/speechfilter/inspire2.png
We've also measured the duration of 14 inspire noises. Except for 1, all of them
are under 1 second. The durations range from from 256ms to 1024ms, with an
average of 529ms.
An automatic way of removing these inspire noises may largely satisfy the users
I'm dealing with. Saving 70% of manual editing time is everything but marginal.
So I'm going to concentrate on that at first, leaving all lips, nose, ... noises
for later.
Thinking further about modules, you might consider the
inclusion (should
you try this) of a user definable module, in which the user could set
parameters. Consider the lone singer at home, or the voice over artist
who uses the same 'voice' on a regular basis. They would tend to form
phrases, and speech, in the same way, more often than not, including
mouth noise, nasal, etc.... (Big generalisation here, but to get the
point across...)
If the user can use his or her own template each time as a start point,
then it might prove more efficient, and definable, as a mechanised process.
(Alex singing module, Olivier talking module, etc)
Correct me if I'm wrong, but if I code this as a plugin which exposes
parameters, I think that presets should be handled by the host, not the plugin.
Anyway, I'm not sure I could code this as a plugin, because, for detecting the
inspire noise, I would need to buffer something like 2 seconds of signal. I
suppose that might not be such a problem though, there's already plenty of non
RT-capable plugins...
Plus, before removal, visual/auditive review of the detected noises (in some
sort of audio editor) sounds quite important: there's alway a risk to confuse a
noise with the end of a phrase or an other element of speech. So I might need to
craft a little gui, or manage to integrate this detection into rezound, ardour,
etc..
Anyway, before this happens I need to find a way to detect the noises. A
colleague has told me that the best technology in this field currently involves
using a database of noise recordings. You then try to find these noises in the
signal by doing a more-or-less tolerant comparison in frequency domain.
However, looking at the above spectrograms and waveforms, I think there could be
a more algorithmic way of detecting these noises, thus avoiding the need for
such a database, given the following facts:
1 - on the waveform: their amplitude is much lower than speech
2 - on the spectrogram: the frequencies in the noise seem to spread rather
homogeneously (maybe a bit like white noise) where the speech contain noticeable
peaks under 1000Hz or so.
Do you think I could use these characteristics to detect the noises?
PS: Alex, maybe that you could try and improve the way you handle citations when
posting? That's not essential, but would make your replies more readable...
--
Olivier