[LAD] How do you improve optimization for integer calculations?

7 Apr 2019

Hello.
I just have attempt, writing serious application, considering language
(C), tookit (gtk3, cairo, pango, etc) and that it is audio app. Besides
that many audio things use simple "float" for audio data, i noticed
some posts, where tension towards integer math appeared. While it is
not my case (my app is jack-based, and audio data have FP type
everywhere), i slightly distracted from audio side.
My app for now is spectrum analyzer, which later should evolve into
couple of spectral editing helper utilities (more exactly to generate
spectrogram and apply changes either by resynth or by filtering with
diff spectrogram).
I distracted to experiments with graphics post-processing for instant
spectrum view rendering. Due to cairo nature, color data in cairo image
surface may be at best argb32, rgb24 and rgb30 (for now i implemented
uspport only for first two). As result post-processing is all done in
integer mode. I can understand this, as i noticed that channel orders
in memory depends on endianness, what points that cairo relies more on
masks and bitshifts than for byte value arrays where is possible.
I decided to look in debugger for how gcc does optimize these ops.
Noticed following. Below is code chunk from kdbg with some asm blocks
expanded. There is only one line, where i tried to use floating point
ops, just for comparison (plus some context for better understanding).
            for (int l = 0; l < 3; l++)
                for (int c = 0; c < 3; c++)
                {
                    unsigned char
                        * p = bpix[l][c];
                    col[0] += p[0] ,
0x5b27 movzbl 0x60(%rsp),%esi
                    col[1] += p[1] ,
0x5a34 movzbl 0x65(%rsp),%eax
                    col[2] += p[2] ,
0x5a42 movzbl 0x62(%rsp),%edx
                    col[3] += p[3];
0x5a47 movzbl 0x67(%rsp),%esi
0x5a4c mov    0x10(%rsp),%rbp
                }
            col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= 9.0;
0x5a39 pxor   %xmm0,%xmm0
            op[0] += col[0] ,
0x5b8d add    %sil,0x0(%rbp)
            op[1] += col[1] ,
0x5b94 add    %cl,0x1(%rbp)
            op[2] += col[2] ,
0x5b97 add    %dl,0x2(%rbp)
            op[3] += col[3];
0x5b91 add    %al,0x3(%rbp)
            for (int l = 0; l < 3; l++)
                ip[l] += 3;
            op += ch_n;
        }
Notice, how line, containing chain of divisions, is compiled to single
sse operation. What is interesting, this is only asm line, involving
xmm or imm - i can't find anything else like mov, involving these
registers.
With division to integer 9 - it is still one line, but no xmm is
involved.
                    col[0] += p[0] ,
0x5b17 movzbl 0x64(%rsp),%eax
                    col[1] += p[1] ,
0x5a35 movzbl 0x65(%rsp),%eax
                    col[2] += p[2] ,
0x5a4a movzbl 0x7e(%rsp),%esi
                    col[3] += p[3];
0x5a4f movzbl 0x7f(%rsp),%ecx
                }
            col[0] /= 9, col[1] /= 9, col[2] /= 9, col[3] /= 9;
0x5a3a mov    $0x38e38e39,%r8d
            op[0] += col[0] ,
0x5b78 add    %dl,0x0(%rbp)
            op[1] += col[1] ,
0x5b3a add    %dil,0x1(%rbp)
As for GCC opts, i used unusual combo "-march=native -O3 -g" in order
to get exhausing optimization, but still keep it readable for kdbg
disasm to get a look.
While searching for info about sse integer support, search pointed me
to wikipedia page about x86 instructions, where i found almost enough
instruction set - logical, add, sub and mul, signed and
unsigned, words and bytes (at least in SSE2 which according to
description enabled MMX set to use SSE registers, where integer support
was primary in MMX).
So, i'm in maze - why doesn't gcc involve such ops in integer mode?
Could it be possible, that what it did is better without SSE?
I'm about to eventually try more FP ops in this place but unsure, about
possible conversions, since source and dest are anyway cairo surfaces
with integer data.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[LAD] How do you improve optimization for integer calculations?