On Sun, 7 Apr 2019 22:27:34 +0200
Maarten de Boer <mdb.list(a)resorama.com> wrote:
Looks like you
propose to use intel-specific intrinsics. I already
looked gcc docs in hope to find something similar, but only found
vectorization section, from C extensions. Hoped to see something
probablt gcc-specific, but not intel-spec.
I am not sure if I understand what you mean, but Intel’s SSE
intrinsics are well supported by gcc.
This might be a good read:
https://www.it.uu.se/edu/course/homepage/hpb/vt12/lab4.pdf
<https://www.it.uu.se/edu/course/homepage/hpb/vt12/lab4.pdf>
[…] Probably -O3 shuffled code too hard to
correctly represent in debugger even with -g :/
Instead of using the debugger to look at the assembly, you could use
objdump -S -l on the object file
-S, --source Intermix source code with disassembly
-l, --line-numbers Include line numbers and filenames in output
Good luck.
Maarten
Thanks for advises. It is probably time for some feedback about solving.
While it is not about audio, i hope it will be useful here too, as it
is about more common gcc simd usage. In two words, i found way to
involve packed simd way without using intrinsics.
I casually found SIMD intrinsics in 'info gcc', C extensions / Target
builtins, with mmx/sse in x86 builtins (__builtin_ia32_paddb and such).
However I hoped for less restricting solution. And this is time when i
got real meaning of gcc vector extension __attribute__(( vector_size
(2^n) )).
After I involved this, I god so wanted packed instructions, both for
float and int code versions. What is interesting, float variant before
improving was 2/3 times slower than int-based variant (I'm in maze, why
that isolatex pxor xmm0, xmm0 instruction would ever appear without
other SIMD code). After rewriting float variant with vector ext it
barely was close to unmodified int variant by performance :) .
Int variant also got some SIMD ops (only FP, there are no scalar SSE
ops for int, probably because make no sence). And it is still forward,
related to FP variant :) . It only seems, that I now stucked at data
transfer speed bootleneck.
For exact example, my code is like this:
/**************/
typedef uint16_t v8u16 __attribute__(( vector(16) ));
v8u16
vect1 = {...some uint8_t values...},
vect2 = {............same.........},
vect3 = {...};
vect1 = (vect1 + vect2 + vect3) / 3;
/** writing vect1 elements back to dynamic memory **/
Each value in vector is filled from byte value, loaded from dynamic
memory. Speedup is noticed even when using only half of vector's
capacity, and it is still tiny bit better, then if just using MMX
4*uint16 way. But when trying to use full capacity, speedup is
negligible. My cpu is Intel B950, sandybridge. I'm pretty sure, without
this bootleneck or with more complex processing, speedup would be more
notable.
Also another factor here appears to be dynamic memory access order. The
above example is almost exact excerpt from my code, with some omitions
to save space. Data from dyn mem are processed by bocks of 3*4 uint8_t
(3 ARGB32 pixels). Each block is distributed between all 3 vectors so,
that it occipes exactly 4 matching elements in each, allowing to load 2
blocks in 3 x 128bit vectors. Bit if i try to do it in single step
during initialization, for each vector needs to jump between 2 blocks
during init (could be more for avx2 and avx512, though I lack them).
This reduces performance even lower than before code vectorization.
The only way to not lose it is seems to do half-initialization, with
following per-element init like:
vect1[4] = , vect1[5] = ,
and so on.
What is really good with these vectors - they are safe to be used even
if target arch doesn't support necessary vector size or has no SIMD
ever. According to 'info gcc':
Specifying a combination that is not valid for the
current
architecture causes GCC to synthesize the instructions using a
narrower mode. For example, if you specify a variable of type `V4SI'
and your architecture does not allow for this specific SIMD type, GCC
produces code that uses 4 `SIs'.
So, theoretically it is posible to enage vector size 64 (avx512) and
still expect it working on lesser architectures. The only issue - i
can't understand wether it decreases performance for lessers, comparing
to dedicated vector sizes or vectorless code. Some times i got
slowdown, but sometimes it was just fine. Probably will need more
tests. Even finally trying some intrinsics :) (I hoped to avoid, but
why not just try).