On Wed, Oct 16, 2013 at 6:51 PM, Philipp Überbacher <murks@tuxfamily.org> wrote:
I was hoping for something that requires less DSP knowledge.
I think we all do... note although I dabble in DSP, I won't claim to "know" DSP...
 
However given that those low-level tools are available,
hints on how to combine them or on possibly useful algorithms etc.
would be appreciated as well.

Of the three catagories you mentioned (speech, music, noise), speech is probably the easiest to find...
FFT the whole track (windows of... 8192 or so perhaps), then check for frequency content in the speech range[1]: 300 - 3.400 Hz.
If the content is steadily within those frequency ranges (allowing for some FFT windowing error), the that should be ok.

Music (depending on type) is generally rythmical, so transients should be present, and somewhat evenly spaced. Easier to detect if the music hasn't been compressed to a brick-wall.
Noise (depending on type) is generally *not* rythmical, so transients should be present but not evenly spaced...

The above is a suggestion only: I don't know is it the best way to go. Depending on the content, you'll have some success with the above approach.
Advice on "music-information-retrieval" or content analysis is probably better on the Music-DSP mailing list, perhaps ask there?

HTH, -Harry

[1]: Voice frequencies, http://en.wikipedia.org/wiki/Voice_frequency