Hello there, I have been reading some articles, and even books, about C-code styling to guarantee efficient machine code on ARM platforms. This is an example: "ARM System Developer's Guide: Designing and Optimizing System Software" Some of those articles date back to early 2000's, and I am not sure how sophisticated the compilers now are. Does anyone know whether such "tricks" still matter during development (e.g. using count-down loops, avoiding huge structures as function arguments, avoiding too many function arguments, etc) still matter? Or does a typical compiler now does it all? If the compiler now does it all, does this mean that we can take for instance the C/C++ source code that was targeting an x86 architecture, and recompile it with a "good" compiler targeting ARM, and we may end up with efficient code execution on the ARM platform? Thanks in advance!
>Or does a typical compiler now does it all?
Im sorry to say that, but I think that this question is lacking some
reflection beforehand.
The examples you listed are NOT recommended because:
The compiler in or before the year 2000 didn't know how to efficiently
transfer great structs by parameter. This is even in the year 2012
something inefficient which could by NO means ever optimized away.
This is the same for parameter lists. Think yourself: How could any
optimizer reduce are parameter list (supposed we do not consider simple
cases where a parameter is never used or may be derived by a
comparatively simple rule from another parameter)? Just by subtracting a
one and leaving a parameter? No optimizer, at no time in no future, as
distant as thinkable, can do that! How could he? That would have a
serious impact to the algorithm itself for which the program author is
reliable not the compiler.
The third example is a bit more specific. Most processors have faster
instructions for a conditional jump where the condition is a comparison
with zero compared to those instructions where the condition may be any
arbitrary integer. Under this condition it is every language for every
compiler recommended to use count-down loops. The reason lies in the
executing processor not in the compiler. How could any compiler modify
the instruction set? It couldn't! It could select another instruction.
But there are (for some reasons you should explore by yourself) no such
instructions which have no speed penalty.
Optimization as you understand it, was never an objective. It had never
the target to optimize away language inherent properties. Why should it?
I wish I could bring it to a point, but that's not that easy.
Optimization mainly attempts to find code sequences which are doubled,
or logically (mathematically) clueless and may be omitted. I don't want
to go into depth, because that's a very, very, very wide field, but I
talk about something like an attempt to multiply any arbitray integer a
by one. Optimizers will remove the multiplication by one.
Kind regards...
Thanks for your detailed reply... Please let me place my arguments below (keeping into consideration that after all, I am not an expert programmer) "The compiler in or before the year 2000 didn't know how to efficiently transfer great structs by parameter" - I meant (pointers) to large structures. AFAIK, it is recommended to keep structures within reasonable size for more efficient de-referencing of internal elements (less memory loads). For the same reason, it is also advisable to place smaller structure elements first (to avoid large elements at the beginning of the structure that may cause inefficient access to later smaller elements). Also, different orders of the elements within a structure result in different sizes of the overall structure in memory (different packing/padding). Compiler maybe able to split large structures to a hierarchy of structures? Also, it may reorder the elements of a struct for efficient memory footprint/bandwidth. "How could any optimizer reduce are parameter list" - Group them into structures, and pass a pointer to this structure rather than passing them as individual arguments? "Under this condition it is every language for every compiler recommended to use count-down loops. The reason lies in the executing processor not in the compiler. How could any compiler modify the instruction set?" - No... compiler won't change instruction set, but can it (for example, and where possible) replace a count-up loop with a count-down one to make use of the "free" comparison with zero. - Finally, if I have C/C++ source code that runs efficiently on x86, do I need to consider modifying the source code if I need to recompile it to run efficiently on ARM?
Thr rule is simple: Don't write optimized code. Use efficient algorithms, but do not bother with processor-specific optimizations. Micro-optimizations cost a lot of time and make your code less readable. The time saved is better invested in finding a faster algorithm for your problem. That gives more speed-up. There are important exceptions to that. But your very first thought should always be: Is that which I'm about to write still easy to understand after 12 hours of tiresome coding? By someone else than me? If that is not the case, you will waste a lot of your time and introduce bugs, and your little optimizations may even become pointless. Now to exceptional cases: Start optimizing when your program is too slow or uses too much RAM/Flash/whatever. Not earlier. Even then, first try to find a better algorithm. Only after you know for sure that no one ever devised a faster way of doing what you want to do, then go and add CPU-specific optimizations. Measure what parts of your programs are too slow and optimize only those. Anything else is a waste of your precious programmer time. By the way, many micro-optimizations are the same across CPU types, so if you take a program written on a different CPU, it will perform quite similar. The remaining difference isn't worth your time (again, unless you measured it to be too slow for your use case) Another side note: Compilers actually are able to perform many micro-optimizations automatically, including converting an up-counting loop to a down-counting one. In what cases and on which CPU this is supported is highly compiler- and version-dependent, however.
Thanks Sam for your reply - Fair enough!
Felix von Leitner (aka. Fefe) has given a nice talk on the problem. ftp://media.ccc.de/events/camp2007/video/m4v/cccamp07-en-1952-Know_your_ compiler.m4v
IA wrote: > "The compiler in or before the year 2000 didn't know how to efficiently > transfer great structs by parameter" I have not seen any significant pressure to optimize this. In part because people know to avoid this (yes, this is a self referential argument) and in part because passing structures by reference is a lot more efficient and likely will be for some time. I'd guess that until someone implements a copy-of-write mechanism for this, it won't change soon. > "How could any optimizer reduce are parameter list" > > - Group them into structures, and pass a pointer to this structure > rather than passing them as individual arguments? The function call mechanism is defined by the ABI. This usually is a long term definition, not to be easily changed by some compiler author. While it sometimes happens, not count on ABI changes in the lifetime of a given architecture. This more likely happens when a new architecture is introduced, or an existing architecture changes significantly. > - No... compiler won't change instruction set, but can it (for example, > and where possible) replace a count-up loop with a count-down one to > make use of the "free" comparison with zero. GCC already does this for ages. > - Finally, if I have C/C++ source code that runs efficiently on x86, do > I need to consider modifying the source code if I need to recompile it > to run efficiently on ARM? Generally not, not unless you claim maximum achievable performance instead of acceptable performance. Not in case of those architectures. Things are somewhat different when we refer to 8-Bit processors like PIC, 8051 and AVR though, since these often have deficiencies implementing certain C language elements and concepts. There are limits. When the absolute maximum speed is to be achieved and the original code was sort of hand tuned by looking at the code generated, it may still require some hand tuning for a different machine. Compilers are not perfect. Also there are effects out of the compilers scope which have to be considered when writing efficient code and the details to optimize for can be different on different machines, even those sharing the same fundamental architecture. An example is closely coupled multiprocessing, which has to account for dirty stuff like cache line sizes and coherency mechanisms.
> - Finally, if I have C/C++ source code that runs efficiently on x86, do > I need to consider modifying the source code if I need to recompile it > to run efficiently on ARM? If you can find a meaningfull definition of "efficiently", which considers the platform differences between a 4Ghz Quadcore x86 and a single core 500Mhz (or so) ARM, then no. In case that you have hand-optimized inline Assembler for the x86 platform, well, I would consider to stay with the x86... Oliver
Oliver wrote: > If you can find a meaningfull definition of "efficiently", which > considers the platform differences between a 4Ghz Quadcore x86 and a > single core 500Mhz (or so) ARM, then no. While the top ARM implementations are not even remotely within range of top Intel chips, especially with regard to single thread performance and overall throughput, things are different in the mobile world, where the various quad core ARMs are roughly comparable to Intel's Atom products.
Sam P. and especially A.K. (with unproncouncable nick "prx" :-)) had some important points which I agree with. I must confess, that, while thinking about your (and A.K.s arguments, I found some proposals which I would seriously decline, some I have doubts about having positive impacts (like collection parameters into structures and) and some I find unconvinient (like loop sequence reversal) and I wonder how "any" optimization makes sense at all. :-) For example: I (myself) missed a certain point which I consider relatively important. That is: I prefer what I (literally) wrote above what might be more efficient. This is relevant in two cases you mentioned: Reverse loop order would also reverse the sequence of operation. In case I rely on a "sideeffect", which could be accessing registers in a certain order (as with some registers of the AVRs) I desparately don't want the compiler to "optimize" because that would lead to a defective behavior. The same is relevant in case of passing structures and the optimization you mentioned by reorder of the members. In case I communicate those structures I even do not wish that the compiler reorders them. (This point conflicts with the definition that in C one may rely in no respect to any particular order of the members. But this is usual to an extent which makes it relevant in my opinion). Optimzation is not a "silver bullet". Its just a tool with its quirks. It is advisable to see optimization as something with two faces (like Jekyll and Hyde). It may help but it may also be evil. Additionally one may (as me) experience practical issues with certain compilers (I prefer to switch of optimisatons on IAR for the STM-ARMS). And furthermore I found that many (partly very experienced programmers like Karl Heinz, which you may meet quite often here) who state, that a program must function as well with and without optmizations. I myself did that too. But this is merely a hint for a beginner. An experienced programmer shall (my opinion) consider this more a rule of thumb instead of a general rule. It may give hints when optimization affects the programs behavior but it is not negative by itself. But to come back to your question: I think it is a bit dispropionate to consider the degree by which compilers optimized in 2000 as "incomplete" or unsophisticated in comparison to todays optimizers. It may be true for some minor, very specific issues but not in general. Very interesting discussion. Good night. :-)
Hmm wrote: > Reverse loop order would also reverse the sequence of operation. Reversing the counter does not necessarily reverse the order of operations, as this would be a bit too agressive. If you look at GCC's loop optimizations, you may occasionally find a loop counter which looses its role as array index. Example:
1 | for (i = k; i < n; ++i) |
2 | g(a[i]); |
may translate into
1 | .L3: |
2 | ldr r0, [r5, #4]! |
3 | bl g |
4 | add r4, r4, #1 |
5 | cmp r4, r6 |
6 | bne .L3 |
where the loop counter (r4) is disconnected from its role as array index because an incremented pointer register (r5) is used instead. In this code, r4 is therefore free to count from n-k downward to 0 like in
1 | .L3: |
2 | ldr r0, [r5, #4]! |
3 | bl g |
4 | subs r4, r4, #1 |
5 | bne .L3 |
Note that optimizations roughly like these may also help in a more subtle but sometimes very effective way when the implementation uses deep out of order processing instead of a simple Cortex-M pipeline, because they separate data and control flows which could otherwise be linked creating unneccessary dependencies.
>Reversing the counter does not necessarily reverse the order of operations, ...
I see. That's a point, A.K.!
Should have another look again at gcc source.
> Felix von Leitner (aka. Fefe) has given a nice talk on the problem.
Thanks Christian, but the video does not play. I get a still image with
the title of the talk, and there is no audio. Was anybody able to play
the video?
Thanks A. K. for your detailed reply!
Thanks Oliver... I agree with A. K. on his reply to you
> I prefer what I (literally) wrote above what might be more efficient. Agree... Readability of source code seems to be more important than low-level optimization as compilers are smart enough nowadays > Reverse loop order would also reverse the sequence of operation. In case > I rely on a "sideeffect", which could be accessing registers in a > certain order (as with some registers of the AVRs) I desparately don't > want the compiler to "optimize" because that would lead to a defective > behavior. Besides the example that A.K. has shown, there are some cases where you may even need to re-write the loop by incrementing the pointer rather than using an index (i.e. avoid using a[i]). This enables generating more efficient code by leveraging the post-increment addressing mode of ARM load/store instructions (hence saving the increment instruction). I invite you to take a look at the book that I have mentioned in my original post. It has many of such tricks that I have tried myself. I have to admit that many of them have shown to be useless with setting the compiler to a high-level of optimization, but it is sometimes worth knowing what tricks the compiler may do. > a program must function as well with and without optmizations Sure, functionality should be guaranteed. Optimizations may just reduce memory footprint/bandwidth, and increase throughput
All, After this fruitful discussion, I would come to the conclusion, that I'd better focus on the algorithm, where there is higher potential for performance improvement. It is still a plus to know how to optimize the source code at the low level. This may help in cases where a piece of code is detected to be a bottleneck, and low-level optimizing it becomes a requirement. I confess that I have always been a passive participant in forums (I refer a lot to forums, but I almost never posted in any). I enjoyed the fact that I could open a discussion with a bunch of experts, and get educated as well as contribute to others. I should do that more often ;) Thank you, and all the best!
Christian, I was able to play the video with VLC. Media player was not playing the audio. Will watch it some time soon. Thanks!
IA wrote: > may even need to re-write the loop by incrementing the pointer rather > than using an index (i.e. avoid using a[i]). This enables generating > more efficient code Or less efficient code, depending on the implementation of the processor. If the increment operation is part of the load operation, it forces loads in successive iterations depend on each other, whereas they are independant if indexed addressing is used. Could make a significant difference for an out of order core. This is exactly one of those optimizations, which may turn out to be de-optimizations when you want to run the code on a different machine later, maybe even within the same basic architecture. With todays compilers, better use indexed operations in C and leave those kind of optimizations to the compiler, which is (hopefully) designed to optimize for the actual target hardware and avoids some pitfalls. The example above shows, that GCC already does optimize the way you would have done by hand, while it does not for x86, which does not have an autoincrement address mode but a cheap scaled index instead.
A.K., > With todays compilers, better use indexed operations in C and leave > those kind of optimizations to the compiler, which is (hopefully) > designed to optimize for the actual target hardware and avoids some > pitfalls. The example above shows, that GCC already does optimize the > way you would have done by hand, while it does not for x86, which does > not have an autoincrement address mode but a cheap scaled index instead. > Reply That is the conclusion that I have reached. The chances that the compiler will do the job are high enough to prevent me from spending my time focusing on such HW-related low-level optimization details. If, I happen to fall into a situation where a piece of code is acting as a bottleneck, and it has to be optimized, I may need to optimize its source code (based on my knowledge of the HW core), as a step before thinking of writing it in Assembly.
Please log in before posting. Registration is free and takes only a minute.
Existing account
Do you have a Google/GoogleMail account? No registration required!
Log in with Google account
Log in with Google account
No account? Register here.