In Sun, 7 Apr 2019 17:13:31 +0200
Maarten de Boer <mdb.list(a)resorama.com> wrote:
Hi,
[…]
col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /=
9.0; 0x5a39 pxor %xmm0,%xmm0
[…]
Notice, how line, containing chain of divisions, is compiled to
single sse operation.
I don’t see any SSE operation here. The pxor is just to zero the xmm0
register.
It’s a bit difficult to know what you are doing here, not having
context and not knowing the datatypes, but it does indeed look like
this code could benefit from vectorisation, since you are doing
calculation in blocks of 4. E.g. you can multiply 4 floating points
in a single SSE instruction, add 4 floating points in a single SSE
instructions, etc.
e.g.
__m128 factor = _mm_set_ps1 (1.0f/9.0f);
__m128 result = _mm_mul_ps (packed, factor);
would divide the 4 floats in packed by 9. (We could use _mm_div_ps ,
but multiplication is faster than division)
(See
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
<https://software.intel.com/sites/landingpage/IntrinsicsGuide/> )
Depending on your data, it might be faster to stay in the floating
point domain as long as possible to use SSE floating point
operations, and convert to integer at the last moment.
If you do want/need to stay in the integer domain, note that their is
no SIMD instruction for integer division, but you could use a
multiplication here as well:
__m128i _mm_mulhi_epi16 (__m128i a, __m128i b)
multiplies the packed 16-bit integers in a and b (so 8 at the same
time), producing intermediate 32-bit integers, and stores the high 16
bits of the intermediate integers in the result.
Taking the high 16 bits of the 32 bit intermediate result is
effectively dividing by 65536. Since x/9 can be expressed (with some
error) as x*7281/65536:
__m128i factor = _mm_set1_epi16 (7282);
__m128i result = _mm_mulhi_epi16(packed, factor)
Of course you would have to get your 8 bit integers (I assume)
into/out of the packed 16 bit registers.
That said, whether you want to do this kind of vectorisation by hand
is a different matter. The compiler is pretty good in doing these
kind of optimisations. Make sure you pass the right flags to turn on
SSE and AVX at the levels you want to support. But it certainly is
possible to improve what the compilers does. I have obtained
significant speed boosts though rewriting inner loops with SSE
intrinsics. But even if you choose to stay in C, having some
knowledge of the SSE instruction set certainly might help.
Maarten
Thanks, Maarten.
Looks like you propose to use intel-specific intrinsics. I already
looked gcc docs in hope to find something similar, but only found
vectorization section, from C extensions. Hoped to see something
probablt gcc-specific, but not intel-spec. I made earlier experiments,
getting purest sse code, while multiplying or adding elements from
fixed-sized array, created as auto.
I really did not recognize that nasty trick, clearing xmm0 :).
Also i understood, why SSE can't be used there. Without integer
division support it is undoable with SSE - replacing with
multiplication means conversion to float.
Yet, just as experiment, i replaced this and 4 add lines with:
op[0] = (col[1] + col[2]) / 2;
to look, wether it will involve PAVGW or similar - it did not.
Can't really understand meaning of that single pxor line without other
SSE stuff. But after i changed gcc options to -O2, more meaningful
lines appeared where expected. Probably -O3 shuffled code too hard to
correctly represent in debugger even with -g :/ .
I'm now certain about to try FP way.
More about my app - i'm learning in university, for software
engineering. It so happened, that just in begining of this course i got
final decision to begin what i searched for long time ago (these
spectral editing helpers), and during session new subject appeared,
where we had to write any application, relying on input/output (such as
any GUI program), so i dedicated my plan to this.
I'm still uncertain is it ok to publish it before it is defended,
otherwise it is ready for that.
As for post-proc - i'm experimenting with subpixel rendering (my
another weakness :) ). I'm taking 3x3 pixel blocks (so-called
super-pixels) and pass them through complete chain. Probably 3x1 could
be good, it cairo has no problem rendering to surfaces with such
virtual pixel ratio, for now it is that. In my sequence image is
splited to grey minimum and color remainder, with grey mixed down at
subpixel level, while color part simply pixelated, both summed in
destination. Code chunk, i showed in previous post, is for this
remainder averaging part.
With current implementation and all cairo rendering to 3x res surf
commented out, it has ≈30% more speed comparing to simple downsampling
with cairo itself (when 3x surf itself is used as drawing source), but
this is taking in account that for now source and dest surfaces are
created and destroyed on each draw() callback run (i'm just about to
solve this issue).