EmbDev.net

Forum: ARM programming with GCC/GNU tools Cortex M0/M0+/M1/M23 BAD Optimisation in GCC


von Steven J. (strontium)


Rate this post
useful
not useful
Over a year ago I found a strange optimisation issue in GCC, one that 
hasn't been addressed in the mean time and Arm just stated "the approach 
suggested by Richard Earnshaw upstream seems like a significant amount 
of work. Therefore I'm afraid it is unlikely to be solved anytime soon." 
So it looks like it isn't a priority to address it "anytime soon"

However I feel its quite a bad Optimisation issue for a microcontroller, 
and so I am trying to make as many people aware of it as possible.

Upstream Bug Reports:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460
https://bugs.launchpad.net/gcc-arm-embedded/+bug/1502611

So what is the problem?  Well I am an assembler programmer, and 
typically when i start working on a MCU I always look at the generated 
Machine Code the compiler is producing, well when i Started using Cortex 
M0+ i noticed that when I accessed registers, the compiler was producing 
LOTS of literal table entries, way more than i thought reasonable. So i 
checked the code generated for a Cortex M3, and it looked like one would 
expect, then it became apparent that the M3 code was Legal M0 code, and 
would run on a M0 unmodified.  At that point I thought WTF, why is the 
M0 producing such bad code when the M3 does such a better job, and uses 
no new instructions.  Days of messing with compiler optimisation flags 
and different strategies to access registers revealed that there was no 
way to change it. Here is an example:
1
const uint32_t v1 = 0x80000001; // First Value
2
const uint32_t v2 = 0x80000002; // Second Value
3
const uint32_t v3 = 0x80000003; // Third Value
4
const uint32_t v4 = 0x80000004; // Fourth Value
5
6
/* TEST1 : Write 32 bit values to known register locations */
7
void test1(void)
8
{
9
    volatile uint32_t* const r1 = (uint32_t*)(0x40002800U); // First Register
10
    volatile uint32_t* const r2 = (uint32_t*)(0x40002804U); // Second Register
11
    volatile uint32_t* const r3 = (uint32_t*)(0x40002808U); // Third Register
12
    volatile uint32_t* const r4 = (uint32_t*)(0x4000280CU); // Fourth Register
13
14
    *r1 = v1;
15
    *r2 = v2;
16
    *r3 = v3;
17
    *r4 = v4;
18
}

So pretty straight forward typical MCU stuff, have a IO register at a 
known location, and write to it.  What does that generate:

Well on M3 it looks like this:
[code]
00000000 <test1>:
   0:  4b04        ldr  r3, [pc, #16]  ; (14 <v1+0x8>)
   2:  4a05        ldr  r2, [pc, #20]  ; (18 <v1+0xc>)
   4:  601a        str  r2, [r3, #0]
   6:  3201        adds  r2, #1
   8:  605a        str  r2, [r3, #4]
   a:  3201        adds  r2, #1
   c:  609a        str  r2, [r3, #8]
   e:  3201        adds  r2, #1
  10:  60da        str  r2, [r3, #12]
  12:  4770        bx  lr
  14:  40002800   .word  0x40002800
  18:  80000001   .word  0x80000001
[\code]

Which is pretty optimal, one would be hard pressed to do better 
manually. It has two literal table entries, and creates all the other 
addresses using offsets, it even collapses the constant and uses maths 
to calculate the new constants rather than having a new literal entry 
per constant.  Very Nice.

But what does it do for M0/M0+/M1 and M23:
[code]
00000000 <test1>:
   0:  4b06        ldr  r3, [pc, #24]  ; (1c <v1+0x10>)
   2:  4a07        ldr  r2, [pc, #28]  ; (20 <v1+0x14>)
   4:  601a        str  r2, [r3, #0]
   6:  4b07        ldr  r3, [pc, #28]  ; (24 <v1+0x18>)
   8:  4a07        ldr  r2, [pc, #28]  ; (28 <v1+0x1c>)
   a:  601a        str  r2, [r3, #0]
   c:  4b07        ldr  r3, [pc, #28]  ; (2c <v1+0x20>)
   e:  4a08        ldr  r2, [pc, #32]  ; (30 <v1+0x24>)
  10:  601a        str  r2, [r3, #0]
  12:  4b08        ldr  r3, [pc, #32]  ; (34 <v1+0x28>)
  14:  4a08        ldr  r2, [pc, #32]  ; (38 <v1+0x2c>)
  16:  601a        str  r2, [r3, #0]
  18:  4770        bx  lr
  1a:  46c0        nop      ; (mov r8, r8)
  1c:  40002800   .word  0x40002800
  20:  80000001   .word  0x80000001
  24:  40002804   .word  0x40002804
  28:  80000002   .word  0x80000002
  2c:  40002808   .word  0x40002808
  30:  80000003   .word  0x80000003
  34:  4000280c   .word  0x4000280c
  38:  80000004   .word  0x80000004
[\code]

Yes, thats what it generates, EVERY single constant has its own literal 
table entry.  That means that every single register access requires 2 
reads from flash, BEFORE the register can be written. So it goes from 2 
Loads and 4 Stores, to 8 Loads and 4 Stores.  Clearly GCC can do better, 
because for the M3 it produces an optimal result, but for M0 for some 
reason we get this which is probably the worst possible result one could 
expect.

Now i know people will say, Oh, use structures, use unions, you are not 
giving GCC enough information to optimize.  It doesn't matter, no matter 
how you encode the exact same operations on cortex-m0 GCC will produce 
very sub-optimal code, when on the M3 its almost perfectly optimal and 
not using any more M3 specific instructions to do it.

In the upstream bug reports is the full test suite i produced, you can 
see in it all the different register access strategies and the horrible 
code each one produces.

The reason I think this is such a bad bug is we are talking about the 
Cortex M0 family, they are low end micros with small amounts of flash 
typically, and not very fast.  And they are microcontrollers, which 
means  they aren't being used for much general purpose computing, they 
are IO processors, using Bit bashing, and I2C peripherals, etc, etc. 
These kinds of accesses are common and form a large part of the work of 
any microcontroller.  So this bug directly hits the weakest of the 
Cortex M line, and makes them slower, makes the programs take up more 
flash, and makes them use more energy (because they require more cycles 
to do the same thing), reducing battery life.  If anyone uses these 
micros, they could do themselves a favour by going to the bugs linked 
above and stating that this effects you and you would appreciate a 
solution.

To put the differences in perspective, of my 6 test cases, Functions are 
an average of 66% larger than they need to be and use an average of 28% 
more cpu cycles than they need.  Worst case, one function was 114% 
larger than it needed to be and use 40% more cycles.  Oh and thats at 
zero wait states, the more wait states on your flash the worse this 
problem becomes because its creating many excessive loads from the 
flash, all of which will slow down as wait states increase.  Thats a LOT 
of wasted flash and Cpu cycles for just accessing IO registers.

I have tested all the way from GCC 4.9 to GCC 7.1 and all are equally 
effected.

von Martin (Guest)


Rate this post
useful
not useful
Please take  this slightly modified code  and compile again.
If m3  is still much better than  M0 result then lets see why.

Your code is not very representative.




const uint32_t v1 = 0x80001001; // First Value
const uint32_t v2 = 0x80030002; // Second Value
const uint32_t v3 = 0x80900003; // Third Value
const uint32_t v4 = 0x800000f4; // Fourth Value

/* TEST1 : Write 32 bit values to known register locations */
void test1(void)
{
    volatile uint32_t* const r1 = (uint32_t*)(0x41002800U); // First 
Register
    volatile uint32_t* const r2 = (uint32_t*)(0x40302804U); // Second 
Register
    volatile uint32_t* const r3 = (uint32_t*)(0x40052808U); // Third 
Register
    volatile uint32_t* const r4 = (uint32_t*)(0x4700280CU); // Fourth 
Register

    *r1 = v1;
    *r2 = v2;
    *r3 = v3;
    *r4 = v4;
}

von Uwe Bonnes (Guest)


Rate this post
useful
not useful
When the target is a function argument, things work as expected.

#include <stdint.h>

void test6(uint8_t* r)
{
    r[0] = 0xFF;
    r[1] = 0xFE;
    r[2] = 0xFD;
    r[3] = 0xFC;
    r[4] = 0xEE;
    r[8] = 0xDD;
    r[12] = 0xCC;
}

von Uwe Bonnes (Guest)


Rate this post
useful
not useful
Martin, your registers are spaced far apart. In the original codes they 
are near to each other, so offset addressing is possible for the first 
example, but not for your example, so no optimization is possible in 
yout case.

And yes, I think the first example is representative!

von Steven J. (strontium)


Rate this post
useful
not useful
Hi Martin,

Actually, I dont need to compile it. I know your test will not reproduce 
the optimization error, and will force the M3 code generator to use 
instructions not available on the M0.

Take a real Cortex M0+ the LPC824.

Uarts registers are defined by a structure like so:
1
/**
2
 * @brief UART register block structure
3
 */
4
typedef struct {
5
  __IO uint32_t  CFG;        /*!< Configuration register */
6
  __IO uint32_t  CTRL;      /*!< Control register */
7
  __IO uint32_t  STAT;      /*!< Status register */
8
  __IO uint32_t  INTENSET;    /*!< Interrupt Enable read and set register */
9
  __O  uint32_t  INTENCLR;    /*!< Interrupt Enable clear register */
10
  __I  uint32_t  RXDATA;      /*!< Receive Data register */
11
  __I  uint32_t  RXDATA_STAT;    /*!< Receive Data with status register */
12
  __IO uint32_t  TXDATA;      /*!< Transmit data register */
13
  __IO uint32_t  BRG;        /*!< Baud Rate Generator register */
14
  __IO uint32_t  INTSTAT;      /*!< Interrupt status register */
15
  __IO uint32_t  OSR;             /*!< Oversampling Selection regiser */
16
  __IO uint32_t  ADDR;            /*!< Address register for automatic address matching */
17
} LPC_USART_T;

AND are given a fixed address in the memory map like this:
1
#define LPC_USART0_BASE       (0x40064000UL)
2
3
#define LPC_USART0          ((LPC_USART_T    *) LPC_USART0_BASE)

So we have a series of consecutive addresses starting from a fixed and 
constant base address.  Uart drivers typically access multiple registers 
within the Uart processing Transmit and Recieve, check fifo, over runs, 
line errors, etc, etc. This optimisation bug will expose as multiple 
literal table entries for each and every register, when one would 
suffice.  This will cause the Uart driver to be bigger AND slower than 
is necessary.  It will also potentially increase CPU register pressure 
which has various other potential knock on effects.

Your example specifically uses both values and addresses which are not 
derivable from one another.  Its not the same thing.  Both GCC 
maintainers AND ARM system engineers recognise this is a legitimate 
deficiency within the Cortex M0 code generation of GCC.

It is legitimate to say that the constant values in the test i presented 
are contrived, and they are, but I do that specifically to trigger the 
optimisation fault.  It doesnt change the fault or that there are missed 
optimisations which GCC is able to produce on another core (without 
resorting to an expanded instruction set)  However accessing multiple 
registers in a tightly coupled and contiguous block IS NOT contrived as 
demonstrated with the LPC824 defined code above.

Hope that helps explain how/why the tests were created and what they are 
attempting to reproduce in a simple and repeatable way for compiler 
testing purposes.

von Johann L. (gjlayde)


Rate this post
useful
not useful
Steven J. wrote:
> I have tested all the way from GCC 4.9 to GCC 7.1 and all are equally
> effected.

If the incentive of your post is to get a better understanding of the 
inner working of GCC and why the generated code does not meet your 
expectations, the right place for discussion is one of the GCC mailing 
lists like

gcc-help@gcc.gnu.org  for help on using GCC

gcc@gcc.gnu.org       for GCC development

https://gcc.gnu.org/lists.html

To start learning about the inner working and which transformations are 
performed by GCC, you can read dumps as generated by means of

-fdump-tree-all -fdump-rtl-all -save-temps -dp

The 3-digit numbers in the file extensions of the generated files 
indicate the pass number which was responsible for the transformation / 
analysis.  The higher a number, the later a pass is run in the 
compilation.

If you problem is that constants are propagated into accesses with known 
offsets, but you's prefer to see indirect accesses with offset, you can 
try to obfuscate the address after it's assigned to a variable like so:
1
uint32_t_volatile *sfr = (uint32_t volatile*) 0x12345678;
2
3
__asm ("" : "+r" (sfr));
4
5
sfr[0] = sfr[1];

: Edited by User
von Steven J. (strontium)


Attached files:

Rate this post
useful
not useful
Johann L. wrote:
> Steven J. wrote:
>> I have tested all the way from GCC 4.9 to GCC 7.1 and all are equally
>> effected.
>
> If the incentive of your post is to get a better understanding of the
> inner working of GCC

Actually, the purpose of my post is to make people who use M0 type cores 
aware of this mis-optimization in this common use case.

>
> To start learning about the inner working and which transformations are
> performed by GCC, you can read dumps as generated by means of
>
> -fdump-tree-all -fdump-rtl-all -save-temps -dp

Thats handy and i will look at that.  Thanks.

>
> If you problem is that constants are propagated into accesses with known
> offsets, but you's prefer to see indirect accesses with offset.

Actually, currently, constants are just being left as constants.  No 
offsetting is being done.

My preference is when I tell the compiler to optimise that it do so, and 
not just leave the code in the most verbose unoptimised state it can.  A 
lot of work is put into the optimisers and I fully appreciate the work 
people do on GCC, but there is something amiss here when compiling for 
cortex-m3 can generate the optimal code, but compiling for cortex-m0 can 
not, and these are trivial cases, not highly complex corner cases.

Like I said, I noticed this excessive literal generation in code I was 
writing for a M0 target and tried to track it down, because I believed I 
was doing something wrong.  If you have written any code for a 
cortex-m0/m0+ processor go look at the assembler it generates, it will 
clearly not be optimal and have excessive literals in the literal 
tables.  Its a problem effecting every M0/M0+ program that's been built 
with GCC and I think its worth knowing if you are using this processor.

There are two optimiser faults, GCC wont effectively calculate constants 
from one another, for example if one constant can be derived from 
another, simply by adding 10 to it, thats what GCC should do, not create 
a literal for it, because thats 4 bytes for the literal, PLUS a flash 
read which is slow.  AND it can be shown that GCC is quite capable of 
doing this, just not for cortex-m0.

The second is that addresses are not properly being calculated as 
offsets, but are being produced as literals for each unique address. 
The two problems are related, but they are two problems.  The issue 
seems to be, , according to the feedback on the bug, that for Cortex-m0 
cores its a  "costing issue in the thumb1 code generation path" so for 
cortex-m0 cores GCC is not properly calculating the cost of each 
instruction or the use of the literal table, whereas for cortex-m3 it 
is.

> you can
> try to obfuscate the address after it's assigned to a variable like so:
>
1
uint32_t volatile *sfr = (uint32_t volatile*) 0x12345678;
2
 
3
__asm ("" : "+r" (sfr));
4
 
5
sfr[0] = sfr[1];

Cute workaround.  It works for an array of bytes.  If i use it in say 
the test I posted in my first post, it doesn't do anything.  If i use it 
for a structure (test 3 of my test set) it will produce an offset for 
the address of the structure, but not properly calculate the values 
being written and still generate excessive literals.

However simply needing to resort to the workaround indicates a problem 
with the optimizer.  We shouldn't need to trick the compiler to have it 
generate the trivially optimal case.

My test set is attached to the GCC bug report, but for completeness I 
attach it here.

Please log in before posting. Registration is free and takes only a minute.
Existing account
Do you have a Google/GoogleMail account? No registration required!
Log in with Google account
No account? Register here.