[…]

col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= 9.0;
0x5a39 pxor %xmm0,%xmm0
[…]

Notice, how line, containing chain of divisions, is compiled to single
sse operation.

It’s a bit difficult to know what you are doing here, not having context and not knowing the datatypes, but it does indeed look like this code could benefit from vectorisation, since you are doing calculation in blocks of 4. E.g. you can multiply 4 floating points in a single SSE instruction, add 4 floating points in a single SSE instructions, etc.

__m128 result = _mm_mul_ps (packed, factor);

would divide the 4 floats in packed by 9. (We could use _mm_div_ps , but multiplication is faster than division)

Depending on your data, it might be faster to stay in the floating point domain as long as possible to use SSE floating point operations, and convert to integer at the last moment.

If you do want/need to stay in the integer domain, note that their is no SIMD instruction for integer division, but you could use a multiplication here as well:

multiplies the packed 16-bit integers in a and b (so 8 at the same time), producing intermediate 32-bit integers, and stores the high 16 bits of the intermediate integers in the result.

Taking the high 16 bits of the 32 bit intermediate result is effectively dividing by 65536. Since x/9 can be expressed (with some error) as x*7281/65536:

__m128i factor = _mm_set1_epi16 (7282);

__m128i result = _mm_mulhi_epi16(packed, factor)

Of course you would have to get your 8 bit integers (I assume) into/out of the packed 16 bit registers.

That said, whether you want to do this kind of vectorisation by hand is a different matter. The compiler is pretty good in doing these kind of optimisations. Make sure you pass the right flags to turn on SSE and AVX at the levels you want to support. But it certainly is possible to improve what the compilers does. I have obtained significant speed boosts though rewriting inner loops with SSE intrinsics. But even if you choose to stay in C, having some knowledge of the SSE instruction set certainly might help.

Maarten