arm-elf-gcc 4.1.0 (downloaded from this site) running under Mac OS X 10.4.7 If you compile this code with arm-elf-gcc -c: void bug (int x, int *y) { switch (x) { case 99:*y=0;break; case 100:*y=1;break; case 88:*y=2;break; case 91:*y=3;break; case 90:*y=4;break; case 101:*y=5;break; case 37:*y=6;break; default:break; } return; } And then disassemble the output with arm-elf-objdump -d, you will see 260 bytes of embedded padding that serve no purpose. Optimizing with -O1 or -O2 do NOT make the strange padding go away, however optimizing with -Os does, so this is a workaround, albeit one that makes debugging much more difficult. Shall I submit this bug to the GCC maintainers? Is there anything about the Darwin compile of arm-elf-gcc they should know? Thanks, --Bill
It's not a bug, it's a feature:
-O1,-O2 optimizes for speed at the expense of size
-Os optimizes for size at the expense of speed.
BTW: gcc -S yields assembly output, no disasm required.
> Shall I submit this bug to the GCC maintainers?
You'd better not.
> -O1,-O2 optimizes for speed at the expense of size > -Os optimizes for size at the expense of speed. OK, but this is not consistent with the GCC manual for version 4.1.1. Section 3.10, page 65, says "with '-O' [same as -O1] the compiler tries to reduce code size AND execution time." On page 66 for -O2 it says GCC performs "nearly all supported optimizations that do NOT involve a space-speed tradeoff." [emphasis mine] It wouldn't suprise me if the manual is lying about this but it would be nice to know for sure. With regard to optimizing for speed, can you explain to me why 260 bytes of padding make the test code go faster? The blank space does not appear to be used at all. Not as a lookup table or anything. It just takes up space for no apparent reason that I can tell. Of course I could be missing something -- but what? > BTW: gcc -S yields assembly output, no disasm required. Thanks! I'm rusty and had forgotten that. >> Shall I submit this bug to the GCC maintainers? > > You'd better not. Take a look at the disassembly yourself -- can you tell me where I'm missing the utility of that huge amount of padding?
To save the readers the trouble, here is what I get for disassembly using arm-elf-gcc -S. I initially omitted this to make this thread more readable but on second thought it seems better to include it. Note the 260 bytes of padding at symbol .L10 -- a symbol that is not referenced anywhere else in the code. --Bill .file "armgccbug.c" .text .align 2 .global bug .type bug, %function bug: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 1, uses_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 sub sp, sp, #8 str r0, [fp, #-16] str r1, [fp, #-20] ldr r3, [fp, #-16] sub r3, r3, #37 cmp r3, #64 ldrls pc, [pc, r3, asl #2] b .L11 .p2align 2 .L10: .word .L3 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L4 .word .L11 .word .L5 .word .L6 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L11 .word .L7 .word .L8 .word .L9 .L7: ldr r2, [fp, #-20] mov r3, #0 str r3, [r2, #0] b .L11 .L8: ldr r2, [fp, #-20] mov r3, #1 str r3, [r2, #0] b .L11 .L4: ldr r2, [fp, #-20] mov r3, #2 str r3, [r2, #0] b .L11 .L6: ldr r2, [fp, #-20] mov r3, #3 str r3, [r2, #0] b .L11 .L5: ldr r2, [fp, #-20] mov r3, #4 str r3, [r2, #0] b .L11 .L9: ldr r2, [fp, #-20] mov r3, #5 str r3, [r2, #0] b .L11 .L3: ldr r2, [fp, #-20] mov r3, #6 str r3, [r2, #0] .L11: sub sp, fp, #12 ldmfd sp, {fp, sp, pc} .size bug, .-bug .ident "GCC: (GNU) 4.1.0"
It looks like a jump table to me. Each entry in the .L10 table references one of the cases. My ARM assembler is not good enough to analyse it completely without a lot of effort, and teh compiler appears to have done some pretty fancy analysis to reduce the table size. Nonetheless it is difficult to class it as a bug just because it does not generate the code you think it should! If the code works in the manner defined by the ISO standard, it is not a bug. Also I would not sweat about the code produced from such a perverse example! Clifford.
> The blank space does not appear to be used > at all. Not as a lookup table or anything. It is a lookup table. The blank space (L11) is the space occupied by those values between 37 and 101 which are not used as case labels. L10 is implicitly referenced by the LDRLS instruction (a conditional indirect jump) which knows that the value of PC is equal to L10 at the time the address is calculated. > sub r3, r3, #37 > cmp r3, #64 > ldrls pc, [pc, r3, asl #2] > b .L11 > .L10: This is the core of the switch code, a nice example for predicated execution btw. It is a table based jump, and the execution time does not depend on the number of case labels in the switch statement. The code generated with -Os however follows a tree structured compare-and-branch strategy and execution time increases with the number of case labels. Your code is a border case, because the average execution time of the compare-and-branch tree might actually be similar to or slightly lower as the execution time of the table jump, but only when nonsequential instruction flow is fast (a highly system dependant parameter).
> It wouldn't suprise me if the manual is lying about this > but it would be nice to know for sure. It is incorrect in your case, yes. But "lying" is a hard word, especially if you intend to talk to the developers ;-). Note that this part of the GCC docs is common to all versions, whereas each target architecture may have its own decisions what code to generate for switch statements depending on what optimization level. Generating optimized code often causes trouble when debugging, because the amount of reordering done by the compiler. For easy debugging, compile without optimization.
The label .L10 is explicitly referenced when compiling thumb code, that's why it is also present in the native code.
A.K. wrote:
> It is a lookup table.
A.K., Clifford, thanks so much for the clear analysis. I am retraining
myself from 68k to ARM and I don't see things like implicit PC-based
references very well yet! Now that you've explained it I can see the
elegance.
(The example comes from a message-passing paradigm in which there are a
multitude of "objects" that are passed a simple "directive" (such as
INSTALL, RESET, WRITE, READ, etc.). Each "object" has a
switch(directive) inside that decides how it should proceed. Since
there are over 150 potential directives available the switch() example I
condensed for the forum is, regrettably, not perverse at all in the
context of this paradigm. The blessing is that I find the code very
easy to understand and maintain.)
Thanks again,
--Bill
Clifford Slocombe wrote: > My ARM assembler is not good enough to > analyse it completely without a lot of effort, and the compiler appears > to have done some pretty fancy analysis to reduce the table size. > Ok, not that fancy in fact. The table simply contains and entry for every value from 37 to 101. (101-37)+1*sizeof(word) = 260 (+1 because it is inclusive). So the code subtracts the lowest cast value from x and uses that value to index the table to obtain the address to jump to. It is both simple code generation and fast. Note that most entries refernce the default case, and one for each explicit case. The lesson is perhaps that perversely bad code should be avoided if space is a concern, you are just giving the compiler a hard time! If you genuinely need this combination of values, then an if - else if - else chain would produce smaller code, and probably similar to that produced when you used -Os. It was a bold and brave statment in the forst place to assert that it was a bug! Clifford
Bill Burgess wrote: > A.K. wrote: > The blessing is that I find the code very > easy to understand and maintain.) I see others got in while I was away with explanations, so my last post is more-or-less duplicate information. Another way to implement such code that is just as, if not more wasteful, but probably faster and easier to maintain would be in fact to explicitly create a lookup table: static int lookup[101] ; ... lookup[99] = 0 ; lookup[100] = 2 ; lookup[88] = 3 ; lookup[91] = 4 ; lookup[90] = 5 ; lookup[101] = 5 ; lookup[37] = 6 ; ... int z = lookup[x] ; if( z ) { *y = z ; } Of course the actual implementation may differ. If teh table is not really that sparse, you might use a static initialiser for example. Clifford
> It was a bold and brave statment in the forst place to assert that it > was a bug! For a long time it seemed that every bug I ever found was mine and humility was the rule. But in the last few years real bugs in "established" systems have stung me. A few years ago I encountered an inconsistent bug in the manufacturing process for a 68k-based SoC: about a third of all SoC units had such terrible jitter in the PLL-generated system clock that they could not transfer IrDA packets at 115 kbaud though they were specified to do so. The manufacturer repeatedly denied any flaws, and I found the problem only after two months of blaming my own code and my own hardware, and losing much stomach lining in the process. Even in the case we're discussing here, the compiler and the manual did not agree, with the latter inaccurately claiming size-vs-space neutrality for -O1 and -O2. But nevertheless I accept your kind reproach and promise to remember humility before my next outburst! Cheers and sincere thanks, --Bill
Bill Burgess wrote: > Even in the case we're discussing here, the compiler and the manual did > not agree, with the latter inaccurately claiming size-vs-space > neutrality for -O1 and -O2. > I disagree; the manual is suitably guarded adout optimisation, and uses words such as "tries" and "typically" when referring to optimisation. For individual and small pieces of code, there are no guarantees. The balance between size and speed simply does not match your expectations. Also the statements in the manual are the general 'philosophy'; the scope for optimisation, and the empasis applied by the developers of individual processor optimisers are presumably guided by the processor architecture and the typical resource availability in systems employing such a processor. Optimisation is an 'holistic' process, and the compiler applies a 'strategy' that generally results in a improvements in non-trivial code. In this case the strategy presumably preferred speed over size, because teh size implications were small, and if you were concerned about a few hundred bytes, you would have chosen -Os - it seems a fair assumption to me. The particular results on a tiny fragment of code are not indicitive, you need sufficuient code, of sufficient complexity for the optimiser to apply all its strategies to see a genuine improvement. Clifford
Clifford Slocombe wrote:
> The balance between size and speed simply does not match your expectations.
Yes this is so. Thanks for the clarity about the 'holistic' nature of
optimizing.
On this topic, perhaps you can clear up a remaining confusion: how does
(or should) an optimizer treat the -O0 switch for "no optimizing"? My
understanding has always been that a user who requests "no optimization"
is asking the compiler, in the hope of easing debugging, to use no
cleverness whatsoever.
In the example here, the jump table appears when compiling with -O0, but
is replaced by a decision tree if one omits one of the case statements.
One could argue that this jump table is a bit of "cleverness" because it
only appears in certain specific cases, and because in this case it
represents an explict decision to improve speed at the expense of size.
Wouldn't a decision tree have been the more generic "no cleverness"
approach?
So the question -- is it really correct to think of -O0 as the "no
cleverness" switch, or is it better to expect that the compiler will
always optimize for speed, to a greater or lesser extent, unless
explictly told otherwise with -Os?
Thanks, --Bill
For switch statements, there are two possibilities: - Unconditionally use a version which produces reasonably small code. - Use the same decisions with and without optimization. The table variant obviously cannot be used in all cases. Apparently GCC/ARM code generation follows the second altervative. Optimizing code generators for switch statements use a cost based approach. Make some statistics on the number and range of values etc, then try to find out which code fits best. Accounting for speed/space flags like -Os. Those heuristics are not necessarily exact though, e.g. the length of a condition branch instruction is likely guessed instead of calculated, on machines where it varies depending on the target distance (x86: 2 vs. 6 bytes).
Bill Burgess wrote: > > So the question -- is it really correct to think of -O0 as the "no > cleverness" switch, or is it better to expect that the compiler will > always optimize for speed, to a greater or lesser extent, unless > explictly told otherwise with -Os? > Well, while the optimiser is pretty clever, the basic compilation is not really stupid either. For example if you write code like x *= 4 ;, the compiler will recognise the "power of two" constant and use a left shift. Optimisation is not reccomended for debugging (unless you are debugging the optimiser!), because it's 'cleverness' can result in out of order execution and even entire chunks of code being removed. What you will often see while stepping the code for example is the 'current statement' jumping to apparently unrelated lines or executing them in a strange order. It is difficult in these situations to follow what is happening. Without optimisation, each statement is implemented with a distinct and separate piece of code, occuring in the same sequential order as the original source code. It does not mean however that that code will be 'not clever'. In your example, because the jump table is simply data, and is not executed, it has no effect on order of execution. As I said earlier, the code generation in your example is not that clever, it is an 'obvious' implementation that favours speed over compactness. More importantly it is a 'deterministic' implementation - that is the time taken to get from the switch statment to the case is the same for all cases. This is normally desirable and expected behaviour - especially in hard real-time systems. The inefficiency in this case is caused by the source code - by the sparse and unordered sequence of cases. If you are down to your last few bytes, you might consider an intrinsically more space efficient code implementation rather than relying on the optimiser. Finally to second guess the compiler and especially the optimiser rather defeats the object of using a high level language in the first instance. A certain degree of trust is required in order to maintain productivity, and not "sweat the little stuff'. You have to accept the compiler as your "assembler expert" and let it do its job. Of course that is not to say you should be complacent, the one bug I ever encountered in a compiler was in MSC v6.0 (circa 1989), which concidentally screwed up switch-case statements when the cases were sparse and out of order and the optimiser was applied; but GCC is a mature and high quality compiler being used throughout the world in all kinds of applications, and in this case the code was not actually wrong. As I said in the first instance, if the code runs as defined, then it is not a bug, regardless of the code generated, that is for your 'assembler expert' to decide! Clifford
Clifford Slocombe wrote: > Bill Burgess wrote: >> >> So the question -- is it really correct to think of -O0 as the "no >> cleverness" switch, or is it better to expect that the compiler will >> always optimize for speed, to a greater or lesser extent, unless >> explictly told otherwise with -Os? >> > Without optimisation...does not mean however that that code will be > 'not clever'. Understood. This is all very helpful. In my situation I may indeed be down to my last few bytes; depending on the outcome of the next few months of coding on a development system, I may need to add external RAM to the final design, something to be avoided if at all possible. So I very much appreciate your and A.K.'s great advice and explanations about all this. I will definitely adjust my C code to improve the efficiency of the switch() statements. Sorry about my audacity in prematurely declaring this a bug (I recently found an obscure bug in the GNU assembler -- sourceware.org bugzilla #2582 -- perhaps breeding overconfidence). This plainly was not a fault in GCC, simply a reflection of the tradeoffs its programmers chose. I am glad to have a better understanding of those tradeoffs, and promise to have greater respect for GCC in the future! I also promise to look more carefully for implicit PC-based references in ARM disassembly, a blind spot I acquired from the 68k. If I were better at catching these I would have recognized the jump table for what it was from the get-go. Chastened but thankful, --Bill
Just one last suggestion. When I really feel the need to inspect the generated code, I find it easier to do it within the debugger. Not only will it display interleaved assembly and source so you can directly relate the source to the code, but you can actually execute it on the spot, instruction by instruction to understand how it works. Clifford
Please log in before posting. Registration is free and takes only a minute.
Existing account
Do you have a Google/GoogleMail account? No registration required!
Log in with Google account
Log in with Google account
No account? Register here.