On Mon, 2009-05-11 at 17:54 +0200, Jens M Andreasen wrote:
If it indeed is SSE that gives the performance
advantage rather than the
caheline, then the rules can be relaxed from powers of 2 to multiples of
4 (floats).
Make that multiples of 16, and the buffer will be aligned with the
caheline as well (for P4 - Core2.) So buffersizes of 48, 96 and why not
192 should be quite OK.