Over a year ago I found a strange optimisation issue in GCC, one that
hasn't been addressed in the mean time and Arm just stated "the approach
suggested by Richard Earnshaw upstream seems like a significant amount
of work. Therefore I'm afraid it is unlikely to be solved anytime soon."
So it looks like it isn't a priority to address it "anytime soon"
However I feel its quite a bad Optimisation issue for a microcontroller,
and so I am trying to make as many people aware of it as possible.
Upstream Bug Reports:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460https://bugs.launchpad.net/gcc-arm-embedded/+bug/1502611
So what is the problem? Well I am an assembler programmer, and
typically when i start working on a MCU I always look at the generated
Machine Code the compiler is producing, well when i Started using Cortex
M0+ i noticed that when I accessed registers, the compiler was producing
LOTS of literal table entries, way more than i thought reasonable. So i
checked the code generated for a Cortex M3, and it looked like one would
expect, then it became apparent that the M3 code was Legal M0 code, and
would run on a M0 unmodified. At that point I thought WTF, why is the
M0 producing such bad code when the M3 does such a better job, and uses
no new instructions. Days of messing with compiler optimisation flags
and different strategies to access registers revealed that there was no
way to change it. Here is an example:
1
constuint32_tv1=0x80000001;// First Value
2
constuint32_tv2=0x80000002;// Second Value
3
constuint32_tv3=0x80000003;// Third Value
4
constuint32_tv4=0x80000004;// Fourth Value
5
6
/* TEST1 : Write 32 bit values to known register locations */
7
voidtest1(void)
8
{
9
volatileuint32_t*constr1=(uint32_t*)(0x40002800U);// First Register
10
volatileuint32_t*constr2=(uint32_t*)(0x40002804U);// Second Register
11
volatileuint32_t*constr3=(uint32_t*)(0x40002808U);// Third Register
So pretty straight forward typical MCU stuff, have a IO register at a
known location, and write to it. What does that generate:
Well on M3 it looks like this:
[code]
00000000 <test1>:
0: 4b04 ldr r3, [pc, #16] ; (14 <v1+0x8>)
2: 4a05 ldr r2, [pc, #20] ; (18 <v1+0xc>)
4: 601a str r2, [r3, #0]
6: 3201 adds r2, #1
8: 605a str r2, [r3, #4]
a: 3201 adds r2, #1
c: 609a str r2, [r3, #8]
e: 3201 adds r2, #1
10: 60da str r2, [r3, #12]
12: 4770 bx lr
14: 40002800 .word 0x40002800
18: 80000001 .word 0x80000001
[\code]
Which is pretty optimal, one would be hard pressed to do better
manually. It has two literal table entries, and creates all the other
addresses using offsets, it even collapses the constant and uses maths
to calculate the new constants rather than having a new literal entry
per constant. Very Nice.
But what does it do for M0/M0+/M1 and M23:
[code]
00000000 <test1>:
0: 4b06 ldr r3, [pc, #24] ; (1c <v1+0x10>)
2: 4a07 ldr r2, [pc, #28] ; (20 <v1+0x14>)
4: 601a str r2, [r3, #0]
6: 4b07 ldr r3, [pc, #28] ; (24 <v1+0x18>)
8: 4a07 ldr r2, [pc, #28] ; (28 <v1+0x1c>)
a: 601a str r2, [r3, #0]
c: 4b07 ldr r3, [pc, #28] ; (2c <v1+0x20>)
e: 4a08 ldr r2, [pc, #32] ; (30 <v1+0x24>)
10: 601a str r2, [r3, #0]
12: 4b08 ldr r3, [pc, #32] ; (34 <v1+0x28>)
14: 4a08 ldr r2, [pc, #32] ; (38 <v1+0x2c>)
16: 601a str r2, [r3, #0]
18: 4770 bx lr
1a: 46c0 nop ; (mov r8, r8)
1c: 40002800 .word 0x40002800
20: 80000001 .word 0x80000001
24: 40002804 .word 0x40002804
28: 80000002 .word 0x80000002
2c: 40002808 .word 0x40002808
30: 80000003 .word 0x80000003
34: 4000280c .word 0x4000280c
38: 80000004 .word 0x80000004
[\code]
Yes, thats what it generates, EVERY single constant has its own literal
table entry. That means that every single register access requires 2
reads from flash, BEFORE the register can be written. So it goes from 2
Loads and 4 Stores, to 8 Loads and 4 Stores. Clearly GCC can do better,
because for the M3 it produces an optimal result, but for M0 for some
reason we get this which is probably the worst possible result one could
expect.
Now i know people will say, Oh, use structures, use unions, you are not
giving GCC enough information to optimize. It doesn't matter, no matter
how you encode the exact same operations on cortex-m0 GCC will produce
very sub-optimal code, when on the M3 its almost perfectly optimal and
not using any more M3 specific instructions to do it.
In the upstream bug reports is the full test suite i produced, you can
see in it all the different register access strategies and the horrible
code each one produces.
The reason I think this is such a bad bug is we are talking about the
Cortex M0 family, they are low end micros with small amounts of flash
typically, and not very fast. And they are microcontrollers, which
means they aren't being used for much general purpose computing, they
are IO processors, using Bit bashing, and I2C peripherals, etc, etc.
These kinds of accesses are common and form a large part of the work of
any microcontroller. So this bug directly hits the weakest of the
Cortex M line, and makes them slower, makes the programs take up more
flash, and makes them use more energy (because they require more cycles
to do the same thing), reducing battery life. If anyone uses these
micros, they could do themselves a favour by going to the bugs linked
above and stating that this effects you and you would appreciate a
solution.
To put the differences in perspective, of my 6 test cases, Functions are
an average of 66% larger than they need to be and use an average of 28%
more cpu cycles than they need. Worst case, one function was 114%
larger than it needed to be and use 40% more cycles. Oh and thats at
zero wait states, the more wait states on your flash the worse this
problem becomes because its creating many excessive loads from the
flash, all of which will slow down as wait states increase. Thats a LOT
of wasted flash and Cpu cycles for just accessing IO registers.
I have tested all the way from GCC 4.9 to GCC 7.1 and all are equally
effected.
Please take this slightly modified code and compile again.
If m3 is still much better than M0 result then lets see why.
Your code is not very representative.
const uint32_t v1 = 0x80001001; // First Value
const uint32_t v2 = 0x80030002; // Second Value
const uint32_t v3 = 0x80900003; // Third Value
const uint32_t v4 = 0x800000f4; // Fourth Value
/* TEST1 : Write 32 bit values to known register locations */
void test1(void)
{
volatile uint32_t* const r1 = (uint32_t*)(0x41002800U); // First
Register
volatile uint32_t* const r2 = (uint32_t*)(0x40302804U); // Second
Register
volatile uint32_t* const r3 = (uint32_t*)(0x40052808U); // Third
Register
volatile uint32_t* const r4 = (uint32_t*)(0x4700280CU); // Fourth
Register
*r1 = v1;
*r2 = v2;
*r3 = v3;
*r4 = v4;
}
When the target is a function argument, things work as expected.
#include <stdint.h>
void test6(uint8_t* r)
{
r[0] = 0xFF;
r[1] = 0xFE;
r[2] = 0xFD;
r[3] = 0xFC;
r[4] = 0xEE;
r[8] = 0xDD;
r[12] = 0xCC;
}
Martin, your registers are spaced far apart. In the original codes they
are near to each other, so offset addressing is possible for the first
example, but not for your example, so no optimization is possible in
yout case.
And yes, I think the first example is representative!
Hi Martin,
Actually, I dont need to compile it. I know your test will not reproduce
the optimization error, and will force the M3 code generator to use
instructions not available on the M0.
Take a real Cortex M0+ the LPC824.
Uarts registers are defined by a structure like so:
1
/**
2
* @brief UART register block structure
3
*/
4
typedefstruct{
5
__IOuint32_tCFG;/*!< Configuration register */
6
__IOuint32_tCTRL;/*!< Control register */
7
__IOuint32_tSTAT;/*!< Status register */
8
__IOuint32_tINTENSET;/*!< Interrupt Enable read and set register */
So we have a series of consecutive addresses starting from a fixed and
constant base address. Uart drivers typically access multiple registers
within the Uart processing Transmit and Recieve, check fifo, over runs,
line errors, etc, etc. This optimisation bug will expose as multiple
literal table entries for each and every register, when one would
suffice. This will cause the Uart driver to be bigger AND slower than
is necessary. It will also potentially increase CPU register pressure
which has various other potential knock on effects.
Your example specifically uses both values and addresses which are not
derivable from one another. Its not the same thing. Both GCC
maintainers AND ARM system engineers recognise this is a legitimate
deficiency within the Cortex M0 code generation of GCC.
It is legitimate to say that the constant values in the test i presented
are contrived, and they are, but I do that specifically to trigger the
optimisation fault. It doesnt change the fault or that there are missed
optimisations which GCC is able to produce on another core (without
resorting to an expanded instruction set) However accessing multiple
registers in a tightly coupled and contiguous block IS NOT contrived as
demonstrated with the LPC824 defined code above.
Hope that helps explain how/why the tests were created and what they are
attempting to reproduce in a simple and repeatable way for compiler
testing purposes.
Steven J. wrote:> I have tested all the way from GCC 4.9 to GCC 7.1 and all are equally> effected.
If the incentive of your post is to get a better understanding of the
inner working of GCC and why the generated code does not meet your
expectations, the right place for discussion is one of the GCC mailing
lists like
gcc-help@gcc.gnu.org for help on using GCC
gcc@gcc.gnu.org for GCC development
https://gcc.gnu.org/lists.html
To start learning about the inner working and which transformations are
performed by GCC, you can read dumps as generated by means of
-fdump-tree-all -fdump-rtl-all -save-temps -dp
The 3-digit numbers in the file extensions of the generated files
indicate the pass number which was responsible for the transformation /
analysis. The higher a number, the later a pass is run in the
compilation.
If you problem is that constants are propagated into accesses with known
offsets, but you's prefer to see indirect accesses with offset, you can
try to obfuscate the address after it's assigned to a variable like so:
Johann L. wrote:> Steven J. wrote:>> I have tested all the way from GCC 4.9 to GCC 7.1 and all are equally>> effected.>> If the incentive of your post is to get a better understanding of the> inner working of GCC
Actually, the purpose of my post is to make people who use M0 type cores
aware of this mis-optimization in this common use case.
>> To start learning about the inner working and which transformations are> performed by GCC, you can read dumps as generated by means of>> -fdump-tree-all -fdump-rtl-all -save-temps -dp
Thats handy and i will look at that. Thanks.
>> If you problem is that constants are propagated into accesses with known> offsets, but you's prefer to see indirect accesses with offset.
Actually, currently, constants are just being left as constants. No
offsetting is being done.
My preference is when I tell the compiler to optimise that it do so, and
not just leave the code in the most verbose unoptimised state it can. A
lot of work is put into the optimisers and I fully appreciate the work
people do on GCC, but there is something amiss here when compiling for
cortex-m3 can generate the optimal code, but compiling for cortex-m0 can
not, and these are trivial cases, not highly complex corner cases.
Like I said, I noticed this excessive literal generation in code I was
writing for a M0 target and tried to track it down, because I believed I
was doing something wrong. If you have written any code for a
cortex-m0/m0+ processor go look at the assembler it generates, it will
clearly not be optimal and have excessive literals in the literal
tables. Its a problem effecting every M0/M0+ program that's been built
with GCC and I think its worth knowing if you are using this processor.
There are two optimiser faults, GCC wont effectively calculate constants
from one another, for example if one constant can be derived from
another, simply by adding 10 to it, thats what GCC should do, not create
a literal for it, because thats 4 bytes for the literal, PLUS a flash
read which is slow. AND it can be shown that GCC is quite capable of
doing this, just not for cortex-m0.
The second is that addresses are not properly being calculated as
offsets, but are being produced as literals for each unique address.
The two problems are related, but they are two problems. The issue
seems to be, , according to the feedback on the bug, that for Cortex-m0
cores its a "costing issue in the thumb1 code generation path" so for
cortex-m0 cores GCC is not properly calculating the cost of each
instruction or the use of the literal table, whereas for cortex-m3 it
is.
> you can> try to obfuscate the address after it's assigned to a variable like so:>
Cute workaround. It works for an array of bytes. If i use it in say
the test I posted in my first post, it doesn't do anything. If i use it
for a structure (test 3 of my test set) it will produce an offset for
the address of the structure, but not properly calculate the values
being written and still generate excessive literals.
However simply needing to resort to the workaround indicates a problem
with the optimizer. We shouldn't need to trick the compiler to have it
generate the trivially optimal case.
My test set is attached to the GCC bug report, but for completeness I
attach it here.