Hi everyone,
I am new to Arm Neon programming, and am having a very difficult time
making sense of the conventions for mapping gcc intrinsics to actual Arm
Neon v8 instructions. Can someone explain to me how to do the following:
I am making a custom hashing function that uses AES rounds. All the data
is byte wide and in fixed blocks of 16MB, and I am just trying to pipe
it into the ARM Neon core as quickly as possible. From reading the Arm
developer documentation here:
https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/LD1--multiple-structures---Load-multiple-single-element-structures-to-one--two--three--or-four-registers-?lang=en
I want to use a sequence of these instructions:
1 | LD1 { V0.16B, V1.16B, V2.16B, V3.16B }, [x0], #64
|
2 | LD1 { V4.16B, V5.16B, V6.16B, V7.16B }, [x0], #64
|
3 | ...
|
However, after reading this page:
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#load-1
I can't find any intrinsic which actually generates this instruction,
let alone the GCC convention for using these intrinsics.
Can someone give me an example of a simple C program, compilable by gcc
(preferably v9.4 as that is what is available on Ubuntu Bionic, the
version running on the embedded platform), that can access these
instructions? I'm fine with embedded assembly if that is what is
necessary, but I can't figure out the correct syntax to make it work.
Thank you for any advice.