Jens M Andreasen wrote:
I did some
testing on this in past when developing bunch on SSE
routines. The performance difference was around 2x.
What was 2x and compared to what? Unaligned SSE or exact cacheline
match?
Sorry for not being specific, in the test it was SSE unaligned vs
256-bit aligned. :)
For 96 float buffers it thus shouldn't be a problem. Something like 97
would be...
- Jussi