[LAU] Open Source Audio Interface

Tue Sep 2 21:16:26 UTC 2014

On Mon, 1 Sep 2014, Len Ovens wrote:

> On Tue, 2 Sep 2014, Kazakore wrote:
>
>>> Madi can be sub-ms RT (return trip)
>> 
>> Really? Coax, optical or both? I used MADI routers in my work but that was 
>> more about sharing multi-channel audio across sites miles apart than 
>> low-latency monitoring... (I also have to sadly admit the MADI system is 
>> one of the ones I new the least about by the time I left that job :( )
>
> It depends on distance, but it has almost no overhead compared to most other 
> formats (besides analog and aes3). It depends on the dsp time used to split 
> it up and route.

After reading some more on ethernet, it becomes easier to see why MADI can 
have a lot less latency than any ethernet transport.

MADI uses a network physical standard, but it does not use some of the 
other bits. The MADI tx/rx buffer contains one whole MADI frame at a time. 
The frame itself has no extra data over the size of aes3 data. So each 
channel is 32bits long. There is no routing information or error 
correction to be calculated beyond the aes3 parity bit. MADI is a physical 
point to point protocol. MADI does not use OS drivers to deal with any of 
this, but rather uses it's own HW to do this as the bit stream enters the 
card. This means that the audio data from the ADC can reach a DAC at the 
other end within the same frame if the channel count is less than full or 
by the next word clock if not. So the latency is one audio word 
effectively. When used as a computer IF, The card will of course store 
more than one frame as the audio driver requires. However, this is sw 
latency and not required by the protocol itself.

ethernet, on the other hand, uses hw that is not controled. There needs to 
be an OS driver to make it work. Because of the variety of HW and drivers, 
any audio protocol is at least layer 2. This includes routing information, 
and restricts data to 1500 bytes (46 audio channels) per packet. That in 
itself is not such a big deal and would only add one word latency if the 
whole thing was done in hardware. However, it is dealt with by the OS at 
both ends which has other things it needs to do, so the latency of the OS 
affects this too. This includes latency going through switches (and their 
OS) as well as scheduling around other network traffic. Also the 
possiblity of colisions exists and so dealing with audio in chunks of 
words makes sense. What this means in practice, is that as a computer IF, 
there may be no difference between MADI and a level 2 ethernet transport.

Level 3 (Ip based) transport adds one more level of sw. It expects to deal 
with other network traffic too. It has another OS controled level of sw 
that checks for packet order and may delay a packet if it thinks it is 
out of order. It assumes data integrity is more important than latency. 
Convenience is a factor as well because domain names can be used to set 
the link up without any other user input. This again increases latency. 
Latency can be tuned but it would take user action. Netjack, does this 
very successfully, but to work well, other traffic needs to be minimized.

In order to use ethernet hardware in any standard fashion, level 2 is the 
minimum the audio protocol can run at. This means the protocol needs to 
know the MAC of the other end point. While it would be possible for the 
user to enter this info, that would make the protocol hard to use and 
should be avoided. Rather a method of discovering other audio end points 
would be better. The setup of the system does not have to take place at 
level 2 but could be higher.

Capabilties of different physical enet IFs would need to be addressed. The 
protocol would need to know the speed of the slowest link if switches are 
involved. Any new installation will be using 1000m or higher, but the 
minimum would be a 100m link. A 10m link would not work because the 
minimum packet size for eithernet is 84bytes (with guard space). The 
number of audio frames would be limited to about 14k sample rate. By using 
a separate protocol it would be possible to use a 10m link at a higher 
latency. (4 audio frames at a time). I suppose that considering that alsa 
(or jack anyway) seems to consider 16*2 words at a time anyway, the whole 
protocol could work this way. Not so that 10m is supported so much as it 
would allow direct tunneling of IP traffic without splitting up packets.

My thought is something like this:
We control all network traffic. Lets try for 4 words of audio. For sync 
purposes, at each word boundry a short audio packet is sent of 10 
channels. This would be close to minimum enet packet size. Then there 
should be room for one full size enet packet, in fact even at 100m the 
small sync packet could contain more than 10 channels (I have basically 
said 10m would not be supported, but if no network traffic was supported 
then 10m could do 3 or 4 channels with no word sync). So:
Word 1 - audio sync plus 10 tracks - one full network traffic packet
word 2 - audio sync plus 10 tracks - one full audio packet 40 tracks
 					split between word 1 and 2
wors 3 - audio sync plus 10 tracks - one full audio packet 40 tracks
 					split between word 2 and 3
word 4 - audio sync plus 10 tracks - one full audio packet 40 tracks
 					split between word 3 and 4

This would allow network traffic at ~20m and 40 tracks with 4 word 
latency. 1000m would allow much better network performance and more 
channels. I don't know how this would effect CPU use. I haven't mentioned 
MIDI or other control, but there is space time wise to add it to the audio 
sync packet. As this is an open spec, I would probably use MIDI or OSC as 
the control for levels and routing. I have yet to run the numbers, but the 
ten tracks with sync is not maxed out, it may go as high as 15 or 16 while 
still leaving one full network packet at each word for 4x the network 
speed. The thing is this could be controlled on the fly. The user could 
choose how many channels they wish to use. The ice1712 gave 12/10 i/o on 
the d44/d66 as well as the d1010, but in this case that would not happen, 
only the channels needed could show and the user could choose to make 
physical inputs 7/8 look like alsa 1/2 very easily.

The driver could store more than one network traffic packet at a time and 
if they are smaller than full 1500 byte size send two in the same window 
if they are short enough.

In this whole exercise, I am trading through put to gain lower latency and 
(more) predictable timing. Because of HW differences and the fact that the 
actual hw is serviced outside our control, I don't think the IF could be 
used as a sync source. As is the case now, two boxes would require 
external sync to truely be in sync. Daisy chained boxes could be close 
enough without external sync to not need resampling, but not close enough 
to deal with two mics and the same audio.

powered from the cat5 cable should not be included IMO, but it may make 
sense to do so anyway, so that if someone does do it anyway, it is 
interchangable. :P

This in not meant to replace things like netjack, which encapsulates 
audio/MIDI/transport all in one or remote content protocols. This is a 
local solution meant generally for one room or hall. If other uses are 
found that is a plus.

Does this make any sense? All calc was done for 24bit/48k audio. As with 
ADAT, AES3 and MADI channels could be paired if 96k is required.... though 
I think it is flexable enough on a 1000m (even 100m) line that higher 
rates could be sent natively.

Enough rambling.

--
Len Ovens
www.ovenwerks.net