<br><br><div><span class="gmail_quote">On 4/9/07, <b class="gmail_sendername">Tim Blechmann</b> <<a href="mailto:tim@klingt.org">tim@klingt.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
> Hand written assembler is still many orders faster than what gcc is<br>> capable of doing. In Ardour peak computation (for both metering and<br>> waveform displaying) is written in SSE (the first part in pure assembly,
<br>> the second in a C-level abstraction which is almost 1:1 assembly). Both<br>> functions are more than 20x faster in raw performance than what gcc 4.1<br>> can do.<br><br>btw, is there a reason, why ardour is using assembler code instead of
<br>compiler intrinsics?</blockquote><div><br>Two issues - one of the core concepts of jack et al is the idea of a run time defined samples/period. The compiler has no idea that a typical routine is always called with some multiple of 64 samples and can't unroll well.
<br><br>Secondly - the compiler intrinsics for SSE1,2,3,4 basically suck. You can, fairly effectively, use the _mm_whatever abstractions, but as soon as you get into type casting you get into a world of hurt and the compiler generates very inefficient code.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">beside that, if ardour is using a fixed block size, using compile-time</blockquote>
<div><br>Would be nice, but not enough hardware can run at low samples/period and there are always situations where you want to run at more.<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
loop unrolling would be another point, where one could gain speed (iirc,<br>the micro-benchmarks i did for pnpd/nova indicated an additional<br>performance boost around 40%) ...</blockquote><div><br>Consistently memory aligning things is an issue on x86.
<br><br>Since the compiler can't figure it out (and it would be nice if there was some compiler intrinsic that said "this routine is nearly always called with some multiple of 32 bytes) the hand unrolled routines (more every day) basically have to:
<br>normally loop until you have alignment (hopefully just a test and branch)<br>on some arches, doing loops in 64 byte quantities is a bigger win than 16, so loop with 16 byte quantities until you can do 64<br>then do 64 byte quantities for a while
<br>then back to 16<br>then back to 4<br><br>It's a pretty easy pattern once you get used to it, but it pays to oprofile first, have the best algorithm second, then... SSE like crazy. :)<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
tim<br><br>--<br><a href="mailto:tim@klingt.org">tim@klingt.org</a> ICQ: 96771783<br><a href="http://tim.klingt.org">http://tim.klingt.org</a><br><br>After one look at this planet any visitor from outer space would say
<br>"I want to see the manager."<br> William S. Burroughs<br><br>_______________________________________________<br>Linux-audio-user mailing list<br><a href="mailto:Linux-audio-user@lists.linuxaudio.org">Linux-audio-user@lists.linuxaudio.org
</a><br><a href="http://lists.linuxaudio.org/mailman/listinfo.cgi/linux-audio-user">http://lists.linuxaudio.org/mailman/listinfo.cgi/linux-audio-user</a><br><br><br></blockquote></div><br><br clear="all"><br>-- <br>Mike Taht
<br>PostCards From the Bleeding Edge<br><a href="http://the-edge.blogspot.com">http://the-edge.blogspot.com</a>