[linux-audio-dev] Performance problems caused by dlopen()

Will Benton willb at cs.wisc.edu
Sat Oct 5 21:02:01 UTC 2002


On Saturday, October 5, 2002, at 04:30  PM, nick wrote:

> I keep hearing about how aligning XXX on XX boundaries (or similar)
> gives huge performance increases etc..
>
> Where's a good place to start finding out about these techniques?

Techniques for improving cache performance should be common in any 
MS-level compilers class, so you may want to search for course notes on 
the web.  Here are a few other things to check out:

*  "Compiler transformations for high-performance computing"
    http://citeseer.nj.nec.com/bacon93compiler.html
    This is a survey paper that covers compiler optimizations, 
emphasizing techniques for implementing high-performance scientific 
code; you'll probably find that it shares several things in common with 
high-performance audio code (except for the RT requirements of audio 
and the massive data sizes of scientific code, of course).  You can 
probably implement most of these by hand or coax gcc to implement some 
of them for you.

*  http://oprofile.sf.net
    Oprofile is a profiler that uses the hardware counters.  On most 
modern processors, you can get a counter for I-cache and D-cache misses.

*  _Modern Compiler Design and Implementation_
    by Steven Muchnick

*  _Modern Compiler Implementation in {Java|ML|C}_  (pick one)
    by Andrew Appel

I don't remember how big cache lines are on the x86, but page-aligning 
data (i.e. address % 4096 == 0) should be a good start.  You also want 
to try and have inner loops operate only on data that fits in a cache 
line.  There is a technique called "loop tiling" to restructure loops 
so that they will work on a cache line worth of data at a time.  As an 
aside, you can usually get loop performance gains by loop unrolling or 
software pipelining, but you'll want to make sure that the code for 
your inner loop fits in the I-cache.

The big thing to remember is that once you start getting into this 
stuff, you're getting into optimizations that are not only specific to 
one architecture, but to a particular processor.  Optimizations that 
will work on a PIII might not improve performance on an Athlon, and 
optimizations for either will result in code that is much slower on a 
celeron (and vice versa).  Therefore, most of the serious low-level 
stuff is best left to a compiler backend.  If you're willing to do 
serious autoconf work, though (I think fftw does something like this), 
you  can probably implement some good stuff by hand.

In general, you can get huge performance increases by respecting the 
memory hierarchy.  That is to say, if your code (or the code your 
compiler generates) respects the fact that
     REGISTERS are an order of magnitude faster than
     L1 CACHE which is an order of magnitude faster than
     L2 CACHE which is an order of magnitude faster than
     MAIN MEMORY which is an order of magnitude faster than
     DISK...
then you'll have code that's much, much faster than a naive translation.

However, a lot of this is probably overkill for an effects plugin or 
pattern-based softsynth.  :-)




best,
wb




More information about the Linux-audio-dev mailing list