Looks like you propose to use intel-specific intrinsics. I already
looked gcc docs in hope to find something similar, but only found
vectorization section, from C extensions. Hoped to see something
probablt gcc-specific, but not intel-spec.

I am not sure if I understand what you mean, but Intel’s SSE intrinsics are well supported by gcc.

This might be a good read: https://www.it.uu.se/edu/course/homepage/hpb/vt12/lab4.pdf

[…] Probably -O3 shuffled code too hard to
correctly represent in debugger even with -g :/ 

Instead of using the debugger to look at the assembly, you could use objdump -S -l on the object file

-S, --source     Intermix source code with disassembly
-l, --line-numbers     Include line numbers and filenames in output

Good luck.

Maarten