EmbDev.net

Forum: ARM programming with GCC/GNU tools switch() can compile egregious padding


von Bill B. (auldreekie)


Rate this post
useful
not useful
arm-elf-gcc 4.1.0 (downloaded from this site) running under Mac OS X
10.4.7

If you compile this code with arm-elf-gcc -c:

void bug (int x, int *y)
{
  switch (x) {
  case 99:*y=0;break;
  case 100:*y=1;break;
  case 88:*y=2;break;
  case 91:*y=3;break;
  case 90:*y=4;break;
  case 101:*y=5;break;
  case 37:*y=6;break;
  default:break;
  }
  return;
}

And then disassemble the output with arm-elf-objdump -d,
you will see 260 bytes of embedded padding that serve no purpose.
Optimizing with -O1 or -O2 do NOT make the strange padding go away,
however optimizing with -Os does, so this is a workaround, albeit one
that makes debugging much more difficult.

Shall I submit this bug to the GCC maintainers?  Is there anything about
the Darwin compile of arm-elf-gcc they should know?

Thanks,
--Bill

von A.K. (Guest)


Rate this post
useful
not useful
It's not a bug, it's a feature:
-O1,-O2 optimizes for speed at the expense of size
-Os optimizes for size at the expense of speed.

BTW: gcc -S yields assembly output, no disasm required.

> Shall I submit this bug to the GCC maintainers?

You'd better not.

von Bill B. (auldreekie)


Rate this post
useful
not useful
> -O1,-O2 optimizes for speed at the expense of size
> -Os optimizes for size at the expense of speed.

OK, but this is not consistent with the GCC manual for
version 4.1.1.  Section 3.10, page 65, says "with '-O'
[same as -O1] the compiler tries to reduce code
size AND execution time."  On page 66 for -O2 it says
GCC performs "nearly all supported optimizations that
do NOT involve a space-speed tradeoff." [emphasis mine]

It wouldn't suprise me if the manual is lying about this
but it would be nice to know for sure.

With regard to optimizing for speed, can you explain
to me why 260 bytes of padding make the test code
go faster?  The blank space does not appear to be used
at all.  Not as a lookup table or anything.  It just takes up
space for no apparent reason that I can tell.  Of course I
could be missing something -- but what?

> BTW: gcc -S yields assembly output, no disasm required.

Thanks!  I'm rusty and had forgotten that.

>> Shall I submit this bug to the GCC maintainers?
>
> You'd better not.

Take a look at the disassembly yourself -- can you tell me
where I'm missing the utility of that huge amount of padding?

von Bill B. (auldreekie)


Rate this post
useful
not useful
To save the readers the trouble, here is what I get for disassembly
using arm-elf-gcc -S.  I initially omitted this to make this thread
more readable but on second thought it seems better to include it.

Note the 260 bytes of padding at symbol .L10 -- a symbol that is not
referenced anywhere else in the code.  --Bill

  .file  "armgccbug.c"
  .text
  .align  2
  .global  bug
  .type  bug, %function
bug:
  @ args = 0, pretend = 0, frame = 8
  @ frame_needed = 1, uses_anonymous_args = 0
  mov  ip, sp
  stmfd  sp!, {fp, ip, lr, pc}
  sub  fp, ip, #4
  sub  sp, sp, #8
  str  r0, [fp, #-16]
  str  r1, [fp, #-20]
  ldr  r3, [fp, #-16]
  sub  r3, r3, #37
  cmp  r3, #64
  ldrls  pc, [pc, r3, asl #2]
  b  .L11
  .p2align 2
.L10:
  .word  .L3
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L4
  .word  .L11
  .word  .L5
  .word  .L6
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L11
  .word  .L7
  .word  .L8
  .word  .L9
.L7:
  ldr  r2, [fp, #-20]
  mov  r3, #0
  str  r3, [r2, #0]
  b  .L11
.L8:
  ldr  r2, [fp, #-20]
  mov  r3, #1
  str  r3, [r2, #0]
  b  .L11
.L4:
  ldr  r2, [fp, #-20]
  mov  r3, #2
  str  r3, [r2, #0]
  b  .L11
.L6:
  ldr  r2, [fp, #-20]
  mov  r3, #3
  str  r3, [r2, #0]
  b  .L11
.L5:
  ldr  r2, [fp, #-20]
  mov  r3, #4
  str  r3, [r2, #0]
  b  .L11
.L9:
  ldr  r2, [fp, #-20]
  mov  r3, #5
  str  r3, [r2, #0]
  b  .L11
.L3:
  ldr  r2, [fp, #-20]
  mov  r3, #6
  str  r3, [r2, #0]
.L11:
  sub  sp, fp, #12
  ldmfd  sp, {fp, sp, pc}
  .size  bug, .-bug
  .ident  "GCC: (GNU) 4.1.0"

von Clifford S. (clifford)


Rate this post
useful
not useful
It looks like a jump table to me. Each entry in the .L10 table
references one of the cases. My ARM assembler is not good enough to
analyse it completely without a lot of effort, and teh compiler appears
to have done some pretty fancy analysis to reduce the table size.

Nonetheless it is difficult to class it as a bug just because it does
not generate the code you think it should! If the code works in the
manner defined by the ISO standard, it is not a bug.

Also I would not sweat about the code produced from such a perverse
example!

Clifford.

von A.K. (Guest)


Rate this post
useful
not useful
> The blank space does not appear to be used
> at all.  Not as a lookup table or anything.

It is a lookup table. The blank space (L11) is the space occupied by
those values between 37 and 101 which are not used as case labels.

L10 is implicitly referenced by the LDRLS instruction (a conditional
indirect jump) which knows that the value of PC is equal to L10 at the
time the address is calculated.

>   sub  r3, r3, #37
>   cmp  r3, #64
>   ldrls  pc, [pc, r3, asl #2]
>   b  .L11
> .L10:

This is the core of the switch code, a nice example for predicated
execution btw. It is a table based jump, and the execution time does not
depend on the number of case labels in the switch statement.

The code generated with -Os however follows a tree structured
compare-and-branch strategy and execution time increases with the number
of case labels.

Your code is a border case, because the average execution time of the
compare-and-branch tree might actually be similar to or slightly lower
as the execution time of the table jump, but only when nonsequential
instruction flow is fast (a highly system dependant parameter).

von A.K. (Guest)


Rate this post
useful
not useful
> It wouldn't suprise me if the manual is lying about this
> but it would be nice to know for sure.

It is incorrect in your case, yes. But "lying" is a hard word,
especially if you intend to talk to the developers ;-).

Note that this part of the GCC docs is common to all versions, whereas
each target architecture may have its own decisions what code to
generate for switch statements depending on what optimization level.

Generating optimized code often causes trouble when debugging, because
the amount of reordering done by the compiler. For easy debugging,
compile without optimization.

von A.K. (Guest)


Rate this post
useful
not useful
The label .L10 is explicitly referenced when compiling thumb code,
that's why it is also present in the native code.

von Bill B. (auldreekie)


Rate this post
useful
not useful
A.K. wrote:
> It is a lookup table.

A.K., Clifford, thanks so much for the clear analysis.  I am retraining
myself from 68k to ARM and I don't see things like implicit PC-based
references very well yet!  Now that you've explained it I can see the
elegance.

(The example comes from a message-passing paradigm in which there are a
multitude of "objects" that are passed a simple "directive" (such as
INSTALL, RESET, WRITE, READ, etc.).  Each "object" has a
switch(directive) inside that decides how it should proceed.  Since
there are over 150 potential directives available the switch() example I
condensed for the forum is, regrettably, not perverse at all in the
context of this paradigm.  The blessing is that I find the code very
easy to understand and maintain.)

Thanks again,
--Bill

von Clifford S. (clifford)


Rate this post
useful
not useful
Clifford Slocombe wrote:
> My ARM assembler is not good enough to
> analyse it completely without a lot of effort, and the compiler appears
> to have done some pretty fancy analysis to reduce the table size.
>

Ok, not that fancy in fact. The table simply contains and entry for
every value from 37 to 101. (101-37)+1*sizeof(word) = 260 (+1 because it
is inclusive). So the code subtracts the lowest cast value from x and
uses that value to index the table to obtain the address to jump to. It
is both simple code generation and fast.

Note that most entries refernce the default case, and one for each
explicit case.

The lesson is perhaps that perversely bad code should be avoided if
space is a concern, you are just giving the compiler a hard time! If you
genuinely need this combination of values, then an if - else if - else
chain would produce smaller code, and probably similar to that produced
when you used -Os.

It was a bold and brave statment in the forst place to assert that it
was a bug!

Clifford

von Clifford S. (clifford)


Rate this post
useful
not useful
Bill Burgess wrote:
> A.K. wrote:
> The blessing is that I find the code very
> easy to understand and maintain.)


I see others got in while I was away with explanations, so my last post
is more-or-less duplicate information. Another way to implement such
code that is just as, if not more wasteful, but probably faster and
easier to maintain would be in fact to explicitly create a lookup table:

static int lookup[101] ;

...

lookup[99] = 0 ;
lookup[100] = 2 ;
lookup[88] = 3 ;
lookup[91] = 4 ;
lookup[90] = 5 ;
lookup[101] = 5 ;
lookup[37] = 6 ;

...
int z = lookup[x] ;
if( z )
{
*y = z ;
}


Of course the actual implementation may differ. If teh table is not
really that sparse, you might use a static initialiser for example.

Clifford

von Bill B. (auldreekie)


Rate this post
useful
not useful
> It was a bold and brave statment in the forst place to assert that it
> was a bug!

For a long time it seemed that every bug I ever found was mine and
humility was the rule.  But in the last few years real bugs in
"established" systems have stung me.  A few years ago I encountered an
inconsistent bug in the manufacturing process for a 68k-based SoC:
about a third of all SoC units had such terrible jitter in the
PLL-generated system clock that they could not transfer IrDA packets at
115 kbaud though they were specified to do so.  The manufacturer
repeatedly denied any flaws, and I found the problem only after two
months of blaming my own code and my own hardware, and losing much
stomach lining in the process.

Even in the case we're discussing here, the compiler and the manual did
not agree, with the latter inaccurately claiming size-vs-space
neutrality for -O1 and -O2.

But nevertheless I accept your kind reproach and promise to remember
humility before my next outburst!

Cheers and sincere thanks,
--Bill

von Clifford S. (clifford)


Rate this post
useful
not useful
Bill Burgess wrote:
> Even in the case we're discussing here, the compiler and the manual did
> not agree, with the latter inaccurately claiming size-vs-space
> neutrality for -O1 and -O2.
>

I disagree; the manual is suitably guarded adout optimisation, and uses
words such as "tries" and "typically" when referring to optimisation.
For individual and small pieces of code, there are no guarantees. The
balance between size and speed simply does not match your expectations.
Also the statements in the manual are the general 'philosophy'; the
scope for optimisation, and the empasis applied by the developers of
individual processor optimisers are presumably guided by the processor
architecture and the typical resource availability in systems employing
such a processor.

Optimisation is an 'holistic' process, and the compiler applies a
'strategy' that generally results in a improvements in non-trivial code.
In this case the strategy presumably preferred speed over size, because
teh size implications were small, and if you were concerned about a few
hundred bytes, you would have chosen -Os - it seems a fair assumption to
me.

The particular results on a tiny fragment of code are not indicitive,
you need sufficuient code, of sufficient complexity for the optimiser to
apply all its strategies to see a genuine improvement.

Clifford

von Bill B. (auldreekie)


Rate this post
useful
not useful
Clifford Slocombe wrote:
> The balance between size and speed simply does not match your expectations.

Yes this is so.  Thanks for the clarity about the 'holistic' nature of
optimizing.

On this topic, perhaps you can clear up a remaining confusion:  how does
(or should) an optimizer treat the -O0 switch for "no optimizing"?  My
understanding has always been that a user who requests "no optimization"
is asking the compiler, in the hope of easing debugging, to use no
cleverness whatsoever.

In the example here, the jump table appears when compiling with -O0, but
is replaced by a decision tree if one omits one of the case statements.
One could argue that this jump table is a bit of "cleverness" because it
only appears in certain specific cases, and because in this case it
represents an explict decision to improve speed at the expense of size.
Wouldn't a decision tree have been the more generic "no cleverness"
approach?

So the question -- is it really correct to think of -O0 as the "no
cleverness" switch, or is it better to expect that the compiler will
always optimize for speed, to a greater or lesser extent, unless
explictly told otherwise with -Os?

Thanks, --Bill

von A.K. (Guest)


Rate this post
useful
not useful
For switch statements, there are two possibilities:
- Unconditionally use a version which produces reasonably small code.
- Use the same decisions with and without optimization.
The table variant obviously cannot be used in all cases. Apparently
GCC/ARM code generation follows the second altervative.

Optimizing code generators for switch statements use a cost based
approach. Make some statistics on the number and range of values etc,
then try to find out which code fits best. Accounting for speed/space
flags like -Os. Those heuristics are not necessarily exact though, e.g.
the length of a condition branch instruction is likely guessed instead
of calculated, on machines where it varies depending on the target
distance (x86: 2 vs. 6 bytes).

von Clifford S. (clifford)


Rate this post
useful
not useful
Bill Burgess wrote:
>
> So the question -- is it really correct to think of -O0 as the "no
> cleverness" switch, or is it better to expect that the compiler will
> always optimize for speed, to a greater or lesser extent, unless
> explictly told otherwise with -Os?
>
Well, while the optimiser is pretty clever, the basic compilation is not
really stupid either. For example if you write code like x *= 4 ;, the
compiler will recognise the "power of two" constant and use a left
shift.

Optimisation is not reccomended for debugging (unless you are debugging
the optimiser!), because it's 'cleverness' can result in out of order
execution and even entire chunks of code being removed. What you will
often see while stepping the code for example is the 'current statement'
jumping to apparently unrelated lines or executing them in a strange
order. It is difficult in these situations to follow what is happening.

Without optimisation, each statement is implemented with a distinct and
separate piece of code, occuring in the same sequential order as the
original source code. It does not mean however that that code will be
'not clever'. In your example, because the jump table is simply data,
and is not executed, it has no effect on order of execution.

As I said earlier, the code generation in your example is not that
clever, it is an 'obvious' implementation that favours speed over
compactness. More importantly it is a 'deterministic' implementation -
that is the time taken to get from the switch statment to the case is
the same for all cases. This is normally desirable and expected
behaviour - especially in hard real-time systems. The inefficiency in
this case is caused by the source code - by the sparse and unordered
sequence of cases. If you are down to your last few bytes, you might
consider an intrinsically more space efficient code implementation
rather than relying on the optimiser.

Finally to second guess the compiler and especially the optimiser rather
defeats the object of using a high level language in the first instance.
A certain degree of trust is required in order to maintain productivity,
and not "sweat the little stuff'. You have to accept the compiler as
your "assembler expert" and let it do its job. Of course that is not to
say you should be complacent, the one bug I ever encountered in a
compiler was in MSC v6.0 (circa 1989), which concidentally screwed up
switch-case statements when the cases were sparse and out of order and
the optimiser was applied; but GCC is a mature and high quality compiler
being used throughout the world in all kinds of applications, and in
this case the code was not actually wrong. As I said in the first
instance, if the code runs as defined, then it is not a bug, regardless
of the code generated, that is for your 'assembler expert' to decide!



Clifford

von Bill B. (auldreekie)


Rate this post
useful
not useful
Clifford Slocombe wrote:
> Bill Burgess wrote:
>>
>> So the question -- is it really correct to think of -O0 as the "no
>> cleverness" switch, or is it better to expect that the compiler will
>> always optimize for speed, to a greater or lesser extent, unless
>> explictly told otherwise with -Os?
>>
> Without optimisation...does not mean however that that code will be
> 'not clever'.

Understood.  This is all very helpful.  In my situation I may indeed be
down to my last few bytes; depending on the outcome of the next few
months of coding on a development system, I may need to add external RAM
to the final design, something to be avoided if at all possible.  So I
very much appreciate your and A.K.'s great advice and explanations about
all this.  I will definitely adjust my C code to improve the efficiency
of the switch() statements.

Sorry about my audacity in prematurely declaring this a bug (I recently
found an obscure bug in the GNU assembler -- sourceware.org bugzilla
#2582 -- perhaps breeding overconfidence).  This plainly was not a fault
in GCC, simply a reflection of the tradeoffs its programmers chose.  I
am glad to have a better understanding of those tradeoffs, and promise
to have greater respect for GCC in the future!

I also promise to look more carefully for implicit PC-based references
in ARM disassembly, a blind spot I acquired from the 68k.  If I were
better at catching these I would have recognized the jump table for what
it was from the get-go.

Chastened but thankful, --Bill

von Clifford S. (clifford)


Rate this post
useful
not useful
Just one last suggestion. When I really feel the need to inspect the
generated code, I find it easier to do it within the debugger. Not only
will it display interleaved assembly and source so you can directly
relate the source to the code, but you can actually execute it on the
spot, instruction by instruction to understand how it works.

Clifford

Please log in before posting. Registration is free and takes only a minute.
Existing account
Do you have a Google/GoogleMail account? No registration required!
Log in with Google account
No account? Register here.