In Sun, 7 Apr 2019 23:06:45 +0500
Nikita Zlobin <cook60020tmp(a)mail.ru> wrote:
I really did not recognize that nasty trick, clearing
xmm0 :).
Also i understood, why SSE can't be used there. Without integer
division support it is undoable with SSE - replacing with
multiplication means conversion to float.
I recently discovered fast integer division algorythm, allowing to
accelerate multiple divisions with same divisor. I got working this
way, but then discovered that gcc uses this method, so it is still
doable by SSE. Though from other side, i still can't find enough
places, where benefit of working with colors as single integers rather
than separate color values would be meaningful... one such place is
accumulator, used for averaging. While input is uint8_t[4], accumulator
is uint16_t[4]. I have to either work with them by elements or use
masks, bitshifts and OR for each element... just to prepare single
value and store (either uing32_t[2] or just one uint64_t).
Looks like benchmarks are necessary, along with these intrinsics, to
test, wether integer SSE really better than what gcc proposes.