Hi,
[…]
col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= 9.0;
0x5a39 pxor %xmm0,%xmm0
[…]
Notice, how line, containing chain of divisions, is compiled to single
sse operation.
I don’t see any SSE operation here. The pxor is just to zero the xmm0 register.
It’s a bit difficult to know what you are doing here, not having context and not knowing
the datatypes, but it does indeed look like this code could benefit from vectorisation,
since you are doing calculation in blocks of 4. E.g. you can multiply 4 floating points in
a single SSE instruction, add 4 floating points in a single SSE instructions, etc.
e.g.
__m128 factor = _mm_set_ps1 (1.0f/9.0f);
__m128 result = _mm_mul_ps (packed, factor);
would divide the 4 floats in packed by 9. (We could use _mm_div_ps , but multiplication is
faster than division)
(See
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
<https://software.intel.com/sites/landingpage/IntrinsicsGuide/> )
Depending on your data, it might be faster to stay in the floating point domain as long as
possible to use SSE floating point operations, and convert to integer at the last moment.
If you do want/need to stay in the integer domain, note that their is no SIMD instruction
for integer division, but you could use a multiplication here as well:
__m128i _mm_mulhi_epi16 (__m128i a, __m128i b)
multiplies the packed 16-bit integers in a and b (so 8 at the same time), producing
intermediate 32-bit integers, and stores the high 16 bits of the intermediate integers in
the result.
Taking the high 16 bits of the 32 bit intermediate result is effectively dividing by
65536. Since x/9 can be expressed (with some error) as x*7281/65536:
__m128i factor = _mm_set1_epi16 (7282);
__m128i result = _mm_mulhi_epi16(packed, factor)
Of course you would have to get your 8 bit integers (I assume) into/out of the packed 16
bit registers.
That said, whether you want to do this kind of vectorisation by hand is a different
matter. The compiler is pretty good in doing these kind of optimisations. Make sure you
pass the right flags to turn on SSE and AVX at the levels you want to support. But it
certainly is possible to improve what the compilers does. I have obtained significant
speed boosts though rewriting inner loops with SSE intrinsics. But even if you choose to
stay in C, having some knowledge of the SSE instruction set certainly might help.
Maarten