I'm using arm-none-eabi-gcc 4.7.3 or 4.8.3 from launchpad.net, compiling
for an M4 with the following options:
1
-mcpu=cortex-m4
2
-mthumb
3
-mfloat-abi=hard
4
-mfpu=fpv4-sp-d16
5
-g3
6
-gdwarf-2
7
-gstrict-dwarf
8
-O3
9
-ffunction-sections
10
-fdata-sections
11
-std=gnu99
12
-fsigned-char
13
-D__VFPV4__
Here's a small code fragment, part of a FIR filter:
1
i=T00+T10+T20+T30+T40+T50+T60+T70
2
+T80+T90+TA0+TB0+TC0+TD0+TE0+TF0
3
-((2*T00)&-((s>>0)&1))
4
-((2*T10)&-((s>>2)&1))
5
-((2*T20)&-((s>>4)&1))
6
-((2*T30)&-((s>>6)&1))
7
-((2*T40)&-((s>>8)&1))
8
-((2*T50)&-((s>>10)&1))
9
-((2*T60)&-((s>>12)&1))
10
-((2*T70)&-((s>>14)&1))
11
-((2*T80)&-((s>>16)&1))
12
-((2*T90)&-((s>>18)&1))
13
-((2*TA0)&-((s>>20)&1))
14
-((2*TB0)&-((s>>22)&1))
15
-((2*TC0)&-((s>>24)&1))
16
-((2*TD0)&-((s>>26)&1))
17
-((2*TE0)&-((s>>28)&1))
18
-((2*TF0)&-((s>>30)&1));
s is an unsigned int containing bits to be filtered. The T* symbols are
#defined constants. The compiler cleverly compiles -((s >> #) & 1 into a
signed bit-field extract instruction, which picks out the bit, right
justifies it, and propagates it through all 32 bits. For a while, it was
sane enough to load the initial constant (the sum of all the T* symbols)
into a register, then for each bit, compute the mask, AND each one with
the corresponding constant, and subtract it from the register. Then, all
of a sudden, some other change prompted it to compute each mask and
store it into a local variable on the stack, and then use it later.
Since there are actually eight pieces of code like this, the result is
huge, memory-intensive, and slow. This code previously ran at about 3x
real time, now it's on the edge of underrunning (on a Kinetis K70).
What mechanism would prompt the compiler to do such a dumb thing? Is
there any optimization option that relates to this? I've tried both
compiler versions, -O1, -O2, -O3 and -Os, tried various "register"
declarations, tried a bunch of the -fno-blahblah optimization options
listed in the docs, but there are a ton of them. Any ideas?
--
Ciao, Paul D. DeRocco
Paul mailto:pderocco@ix.netcom.com
You could subscribe to the gcc-help @ gcc.gnu.org mailing list, see
http://gcc.gnu.org/lists.html#subscribe
and hope that some arm-gcc expert is around.
Please keep in mind that it's much more helpful when you provide code
that can be compiled, i.e. compose a small test case that passes
compilation (e.g. with -c) and does not contain unknown parts (like your
private deadbeef.h header or missing definition(s) of T*).
Johann L. wrote:> You could subscribe to the gcc-help @ gcc.gnu.org mailing list, see>> http://gcc.gnu.org/lists.html#subscribe>> and hope that some arm-gcc expert is around.
I have asked over there, too. Apologies to those who've seen this same
question over there--I don't know how much commonality there is between
the two forum memberships.
> Please keep in mind that it's much more helpful when you provide code> that can be compiled, i.e. compose a small test case that passes> compilation (e.g. with -c) and does not contain unknown parts (like your> private deadbeef.h header or missing definition(s) of T*).
The reason I didn't is that if I wrap that fragment in a function and
compile it, it generates glorious, beautiful, efficient code. But when I
include it in a much larger function, it turns into a bloody mess. I can
fix it by factoring the function into smaller ones, but this is realtime
DSP, and I can see that it should be easy to do the whole thing (not
just this fragment) in the available register set. And indeed, it
originally did just that, but in the process of adding to my code, I
seemed to cross some complexity threshold where the result suddenly went
from wonderful to
horrible.
It looks like a register pressure issue, but I can't imagine what goal
it is trying to achieve. It suddenly decided to start computing
subexpressions, storing them into invented stack-based temporaries, and
then going back and computing the final expression values based on these
temporaries. This bumped by stack usage from a modest 16 or 20 bytes
(for a few explicit local variables) up to somewhere between 100 and 200
depending upon what other options I fiddled with, and it interspersed
dozens and dozens of completely unnecessary loads and stores.
Were these "common subexpressions"? Well, some were common to multiple
switch cases (the posted fragment was one switch case), but none that
would ever actually get used more than once. I tried -fno-gcse, and that
didn't help.
Another aspect of the problem is that it seems to want to schedule
instructions as though it were compiling for some machine with a really
deep pipeline, which the M4 is not. It frequently launches a bunch of
loads, and then uses the results, when it could do the same work in
fewer registers if it deferred the loading until it needed the data, or
even one instruction before it needed the data. Since my data is in 0WS
RAM, this isn't helpful.
So I'm just wondering if anyone has seen anything like this before, and
knows what optimization knob to twiddle to make it go away.
Does the GCC Thumb2 backend have a reputation for being good or bad? I
think the x86 backend is amazingly good, and I had good luck with old
ARM7 backend years ago. This Kinetis K70 project is my first Thumb2
experience, and so far the compiler is like Dr. Jekyll and Mr. Hyde.
Hi,
You said:
>The reason I didn't is that if I wrap that fragment in a function and>compile it, it generates glorious, beautiful, efficient code.
a)But did you tried to use such a small, efficient function inside a
bigger one?
b)Maybe you already know - here I am just asking - did you checked the
CMSIS3 library? has some optimized DSP library functions, including FIR
filters - I understand yours could be special one, but…
Lyon
Hi,
Check again this setting: -D__VFPV4__ seems to be for Neon processor so
some mixed things could happen. CMSIS has a special parameter for that.
Lyon