I'm using arm-none-eabi-gcc 4.7.3 or 4.8.3 from launchpad.net, compiling for an M4 with the following options:
-mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -g3 -gdwarf-2 -gstrict-dwarf -O3 -ffunction-sections -fdata-sections -std=gnu99 -fsigned-char -D__VFPV4__
Here's a small code fragment, part of a FIR filter:
i = T00 + T10 + T20 + T30 + T40 + T50 + T60 + T70 + T80 + T90 + TA0 + TB0 + TC0 + TD0 + TE0 + TF0 - ((2 * T00) & -((s >> 0) & 1)) - ((2 * T10) & -((s >> 2) & 1)) - ((2 * T20) & -((s >> 4) & 1)) - ((2 * T30) & -((s >> 6) & 1)) - ((2 * T40) & -((s >> 8) & 1)) - ((2 * T50) & -((s >> 10) & 1)) - ((2 * T60) & -((s >> 12) & 1)) - ((2 * T70) & -((s >> 14) & 1)) - ((2 * T80) & -((s >> 16) & 1)) - ((2 * T90) & -((s >> 18) & 1)) - ((2 * TA0) & -((s >> 20) & 1)) - ((2 * TB0) & -((s >> 22) & 1)) - ((2 * TC0) & -((s >> 24) & 1)) - ((2 * TD0) & -((s >> 26) & 1)) - ((2 * TE0) & -((s >> 28) & 1)) - ((2 * TF0) & -((s >> 30) & 1));
s is an unsigned int containing bits to be filtered. The T* symbols are #defined constants. The compiler cleverly compiles -((s >> #) & 1 into a signed bit-field extract instruction, which picks out the bit, right justifies it, and propagates it through all 32 bits. For a while, it was sane enough to load the initial constant (the sum of all the T* symbols) into a register, then for each bit, compute the mask, AND each one with the corresponding constant, and subtract it from the register. Then, all of a sudden, some other change prompted it to compute each mask and store it into a local variable on the stack, and then use it later. Since there are actually eight pieces of code like this, the result is huge, memory-intensive, and slow. This code previously ran at about 3x real time, now it's on the edge of underrunning (on a Kinetis K70). What mechanism would prompt the compiler to do such a dumb thing? Is there any optimization option that relates to this? I've tried both compiler versions, -O1, -O2, -O3 and -Os, tried various "register" declarations, tried a bunch of the -fno-blahblah optimization options listed in the docs, but there are a ton of them. Any ideas? -- Ciao, Paul D. DeRocco Paul mailto:firstname.lastname@example.org
You could subscribe to the gcc-help @ gcc.gnu.org mailing list, see http://gcc.gnu.org/lists.html#subscribe and hope that some arm-gcc expert is around. Please keep in mind that it's much more helpful when you provide code that can be compiled, i.e. compose a small test case that passes compilation (e.g. with -c) and does not contain unknown parts (like your private deadbeef.h header or missing definition(s) of T*).
Johann L. wrote: > You could subscribe to the gcc-help @ gcc.gnu.org mailing list, see > > http://gcc.gnu.org/lists.html#subscribe > > and hope that some arm-gcc expert is around. I have asked over there, too. Apologies to those who've seen this same question over there--I don't know how much commonality there is between the two forum memberships. > Please keep in mind that it's much more helpful when you provide code > that can be compiled, i.e. compose a small test case that passes > compilation (e.g. with -c) and does not contain unknown parts (like your > private deadbeef.h header or missing definition(s) of T*). The reason I didn't is that if I wrap that fragment in a function and compile it, it generates glorious, beautiful, efficient code. But when I include it in a much larger function, it turns into a bloody mess. I can fix it by factoring the function into smaller ones, but this is realtime DSP, and I can see that it should be easy to do the whole thing (not just this fragment) in the available register set. And indeed, it originally did just that, but in the process of adding to my code, I seemed to cross some complexity threshold where the result suddenly went from wonderful to horrible. It looks like a register pressure issue, but I can't imagine what goal it is trying to achieve. It suddenly decided to start computing subexpressions, storing them into invented stack-based temporaries, and then going back and computing the final expression values based on these temporaries. This bumped by stack usage from a modest 16 or 20 bytes (for a few explicit local variables) up to somewhere between 100 and 200 depending upon what other options I fiddled with, and it interspersed dozens and dozens of completely unnecessary loads and stores. Were these "common subexpressions"? Well, some were common to multiple switch cases (the posted fragment was one switch case), but none that would ever actually get used more than once. I tried -fno-gcse, and that didn't help. Another aspect of the problem is that it seems to want to schedule instructions as though it were compiling for some machine with a really deep pipeline, which the M4 is not. It frequently launches a bunch of loads, and then uses the results, when it could do the same work in fewer registers if it deferred the loading until it needed the data, or even one instruction before it needed the data. Since my data is in 0WS RAM, this isn't helpful. So I'm just wondering if anyone has seen anything like this before, and knows what optimization knob to twiddle to make it go away. Does the GCC Thumb2 backend have a reputation for being good or bad? I think the x86 backend is amazingly good, and I had good luck with old ARM7 backend years ago. This Kinetis K70 project is my first Thumb2 experience, and so far the compiler is like Dr. Jekyll and Mr. Hyde.
Hi, You said: >The reason I didn't is that if I wrap that fragment in a function and >compile it, it generates glorious, beautiful, efficient code. a)But did you tried to use such a small, efficient function inside a bigger one? b)Maybe you already know - here I am just asking - did you checked the CMSIS3 library? has some optimized DSP library functions, including FIR filters - I understand yours could be special one, but… Lyon
Hi, Check again this setting: -D__VFPV4__ seems to be for Neon processor so some mixed things could happen. CMSIS has a special parameter for that. Lyon