EmbDev.net

Forum: ARM programming with GCC/GNU tools ARM pessimizer


von Paul D. (pderocco)


Rate this post
useful
not useful
I'm using arm-none-eabi-gcc 4.7.3 or 4.8.3 from launchpad.net, compiling 
for an M4 with the following options:
1
-mcpu=cortex-m4
2
-mthumb
3
-mfloat-abi=hard
4
-mfpu=fpv4-sp-d16
5
-g3
6
-gdwarf-2
7
-gstrict-dwarf
8
-O3
9
-ffunction-sections
10
-fdata-sections
11
-std=gnu99
12
-fsigned-char
13
-D__VFPV4__

Here's a small code fragment, part of a FIR filter:
1
i = T00 + T10 + T20 + T30 + T40 + T50 + T60 + T70 
2
        + T80 + T90 + TA0 + TB0 + TC0 + TD0 + TE0 + TF0 
3
        - ((2 * T00) & -((s >>  0) & 1)) 
4
        - ((2 * T10) & -((s >>  2) & 1)) 
5
        - ((2 * T20) & -((s >>  4) & 1)) 
6
        - ((2 * T30) & -((s >>  6) & 1)) 
7
        - ((2 * T40) & -((s >>  8) & 1)) 
8
        - ((2 * T50) & -((s >> 10) & 1)) 
9
        - ((2 * T60) & -((s >> 12) & 1)) 
10
        - ((2 * T70) & -((s >> 14) & 1)) 
11
        - ((2 * T80) & -((s >> 16) & 1)) 
12
        - ((2 * T90) & -((s >> 18) & 1)) 
13
        - ((2 * TA0) & -((s >> 20) & 1)) 
14
        - ((2 * TB0) & -((s >> 22) & 1)) 
15
        - ((2 * TC0) & -((s >> 24) & 1)) 
16
        - ((2 * TD0) & -((s >> 26) & 1)) 
17
        - ((2 * TE0) & -((s >> 28) & 1)) 
18
        - ((2 * TF0) & -((s >> 30) & 1));

s is an unsigned int containing bits to be filtered. The T* symbols are 
#defined constants. The compiler cleverly compiles -((s >> #) & 1 into a 
signed bit-field extract instruction, which picks out the bit, right 
justifies it, and propagates it through all 32 bits. For a while, it was 
sane enough to load the initial constant (the sum of all the T* symbols) 
into a register, then for each bit, compute the mask, AND each one with 
the corresponding constant, and subtract it from the register. Then, all 
of a sudden, some other change prompted it to compute each mask and 
store it into a local variable on the stack, and then use it later. 
Since there are actually eight pieces of code like this, the result is 
huge, memory-intensive, and slow. This code previously ran at about 3x 
real time, now it's on the edge of underrunning (on a Kinetis K70).

What mechanism would prompt the compiler to do such a dumb thing? Is 
there any optimization option that relates to this? I've tried both 
compiler versions, -O1, -O2, -O3 and -Os, tried various "register" 
declarations, tried a bunch of the -fno-blahblah optimization options 
listed in the docs, but there are a ton of them. Any ideas?

--

Ciao,               Paul D. DeRocco
Paul                mailto:pderocco@ix.netcom.com

von Johann L. (gjlayde)


Rate this post
useful
not useful
You could subscribe to the gcc-help @ gcc.gnu.org mailing list, see

http://gcc.gnu.org/lists.html#subscribe

and hope that some arm-gcc expert is around.

Please keep in mind that it's much more helpful when you provide code 
that can be compiled, i.e. compose a small test case that passes 
compilation (e.g. with -c) and does not contain unknown parts (like your 
private deadbeef.h header or missing definition(s) of T*).

von Paul D. (pderocco)


Rate this post
useful
not useful
Johann L. wrote:
> You could subscribe to the gcc-help @ gcc.gnu.org mailing list, see
>
> http://gcc.gnu.org/lists.html#subscribe
>
> and hope that some arm-gcc expert is around.

I have asked over there, too. Apologies to those who've seen this same 
question over there--I don't know how much commonality there is between 
the two forum memberships.

> Please keep in mind that it's much more helpful when you provide code
> that can be compiled, i.e. compose a small test case that passes
> compilation (e.g. with -c) and does not contain unknown parts (like your
> private deadbeef.h header or missing definition(s) of T*).

The reason I didn't is that if I wrap that fragment in a function and 
compile it, it generates glorious, beautiful, efficient code. But when I 
include it in a much larger function, it turns into a bloody mess. I can 
fix it by factoring the function into smaller ones, but this is realtime 
DSP, and I can see that it should be easy to do the whole thing (not 
just this fragment) in the available register set. And indeed, it 
originally did just that, but in the process of adding to my code, I 
seemed to cross some complexity threshold where the result suddenly went 
from wonderful to
horrible.

It looks like a register pressure issue, but I can't imagine what goal 
it is trying to achieve. It suddenly decided to start computing 
subexpressions, storing them into invented stack-based temporaries, and 
then going back and computing the final expression values based on these 
temporaries. This bumped by stack usage from a modest 16 or 20 bytes 
(for a few explicit local variables) up to somewhere between 100 and 200 
depending upon what other options I fiddled with, and it interspersed 
dozens and dozens of completely unnecessary loads and stores.

Were these "common subexpressions"? Well, some were common to multiple 
switch cases (the posted fragment was one switch case), but none that 
would ever actually get used more than once. I tried -fno-gcse, and that 
didn't help.

Another aspect of the problem is that it seems to want to schedule 
instructions as though it were compiling for some machine with a really 
deep pipeline, which the M4 is not. It frequently launches a bunch of 
loads, and then uses the results, when it could do the same work in 
fewer registers if it deferred the loading until it needed the data, or 
even one instruction before it needed the data. Since my data is in 0WS 
RAM, this isn't helpful.

So I'm just wondering if anyone has seen anything like this before, and 
knows what optimization knob to twiddle to make it go away.

Does the GCC Thumb2 backend have a reputation for being good or bad? I 
think the x86 backend is amazingly good, and I had good luck with old 
ARM7 backend years ago. This Kinetis K70 project is my first Thumb2 
experience, and so far the compiler is like Dr. Jekyll and Mr. Hyde.

von Lyon (Guest)


Rate this post
useful
not useful
Hi,
You said:
>The reason I didn't is that if I wrap that fragment in a function and
>compile it, it generates glorious, beautiful, efficient code.
a)But did you tried to use such a small, efficient function inside a 
bigger one?
b)Maybe you already know - here I am just asking - did you checked the 
CMSIS3 library? has some optimized DSP library functions, including FIR 
filters - I understand yours could be special one, but…
Lyon

von Lyon (Guest)


Rate this post
useful
not useful
Hi,
Check again this setting: -D__VFPV4__ seems to be for Neon processor so 
some mixed things could happen. CMSIS has a special parameter for that.
Lyon

Please log in before posting. Registration is free and takes only a minute.
Existing account
Do you have a Google/GoogleMail account? No registration required!
Log in with Google account
No account? Register here.