Re: [LAD] vectorization

16 Apr 2008

On Wed, 2008-04-16 at 09:19 +0200, Christian Schoenebeck wrote:
...
  But if you're totally sceptical, ... 
Yes, it bothers me if the compiler can beat "hand-optimized" ASM :-D
It turns out that I get different results on consecutive runs of the
same binary, look:
Benchmarking mixdown (WITH coeff):
pure C++                : 380 ms
ASM SSE                 : 150 ms
GCC vector extensions   : 140 ms <-- best
Benchmarking mixdown (WITH coeff):
pure C++                : 390 ms
ASM SSE                 : 140 ms <-- best
GCC vector extensions   : 160 ms
This with slightly unrolled ASM to match the --funroll-loops option:
--8<-----------------------------------------
x86_sse_mix_buffers_with_gain --
...
.MBWG_SSE:
        cmp $8, %ecx #; we know it's not zero, but if it's not >=8, then
        jnge .MBWG_NONALIGN #; we jump straight to the "normal" code
        #; copy gain to fill %xmm1
        movss   20(%ebp), %xmm1
        shufps  $0x00, %xmm1, %xmm1
.MBWG_SSELOOP:
        movaps  (%esi), %xmm0 #; source => xmm0
        addl $16, %esi #; src+=4
        movaps  (%esi), %xmm2 #; ++source => xmm2
        addl $16, %esi #; src+=4
        mulps   %xmm1,  %xmm0 #; apply gain to source
        mulps   %xmm1,  %xmm2 #; apply gain to source
        addps   (%edi), %xmm0 #; mix with destination
        movaps  %xmm0, (%edi) #; copy result to destination
        addl $16, %edi #; dst+=4
        addps   (%edi), %xmm2 #; mix with destination
        movaps  %xmm2, (%edi) #; copy result to destination
        addl $16, %edi #; dst+=4
        subl $8, %ecx #; nframes-=8
        cmp $8, %ecx
        jge .MBWG_SSELOOP
        cmp $0, %ecx
        je .MBWG_END
        #; if there are remaining frames, the nonalign code will do
nicely
        #; for the rest 1-7 frames.
...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [LAD] vectorization