EmbDev.net

Forum: ARM programming with GCC/GNU tools Do we still need to write "efficient" C-code targeting ARM? Or does compilers do the job?


von IA (Guest)


Rate this post
useful
not useful
Hello there,

I have been reading some articles, and even books, about C-code styling 
to guarantee efficient machine code on ARM platforms. This is an 
example:

"ARM System Developer's Guide: Designing and Optimizing System Software"

Some of those articles date back to early 2000's, and I am not sure how 
sophisticated the compilers now are. Does anyone know whether such 
"tricks" still matter during development (e.g. using count-down loops, 
avoiding huge structures as function arguments, avoiding too many 
function arguments, etc) still matter? Or does a typical compiler now 
does it all?

If the compiler now does it all, does this mean that we can take for 
instance the C/C++ source code that was targeting an x86 architecture, 
and recompile it with a "good" compiler targeting ARM, and we may end up 
with efficient code execution on the ARM platform?

Thanks in advance!

von Hmm (Guest)


Rate this post
useful
not useful
>Or does a typical compiler now does it all?

Im sorry to say that, but I think that this question is lacking some 
reflection beforehand.

The examples you listed are NOT recommended because:

The compiler in or before the year 2000 didn't know how to efficiently 
transfer great structs by parameter. This is even in the year 2012 
something inefficient which could by NO means ever optimized away.

This is the same for parameter lists. Think yourself: How could any 
optimizer reduce are parameter list (supposed we do not consider simple 
cases where a parameter is never used or may be derived by a 
comparatively simple rule from another parameter)? Just by subtracting a 
one and leaving a parameter? No optimizer, at no time in no future, as 
distant as thinkable, can do that! How could he? That would have a 
serious impact to the algorithm itself for which the program author is 
reliable not the compiler.

The third example is a bit more specific. Most processors have faster 
instructions for a conditional jump where the condition is a comparison 
with zero compared to those instructions where the condition may be any 
arbitrary integer. Under this condition it is every language for every 
compiler recommended to use count-down loops. The reason lies in the 
executing processor not in the compiler. How could any compiler modify 
the instruction set? It couldn't! It could select another instruction. 
But there are (for some reasons you should explore by yourself) no such 
instructions which have no speed penalty.


Optimization as you understand it, was never an objective. It had never 
the target to optimize away language inherent properties. Why should it? 
I wish I could bring it to a point, but that's not that easy.

Optimization mainly attempts to find code sequences which are doubled, 
or logically (mathematically) clueless and may be omitted. I don't want 
to go into depth, because that's a very, very, very wide field, but I 
talk about something like an attempt to multiply any arbitray integer a 
by one. Optimizers will remove the multiplication by one.

Kind regards...

von IA (Guest)


Rate this post
useful
not useful
Thanks for your detailed reply... Please let me place my arguments below 
(keeping into consideration that after all, I am not an expert 
programmer)


"The compiler in or before the year 2000 didn't know how to efficiently
transfer great structs by parameter"

- I meant (pointers) to large structures. AFAIK, it is recommended to 
keep structures within reasonable size for more efficient de-referencing 
of internal elements (less memory loads). For the same reason, it is 
also advisable to place smaller structure elements first (to avoid large 
elements at the beginning of the structure that may cause inefficient 
access to later smaller elements). Also, different orders of the 
elements within a structure result in different sizes of the overall 
structure in memory (different packing/padding). Compiler maybe able to 
split large structures to a hierarchy of structures? Also, it may 
reorder the elements of a struct for efficient memory 
footprint/bandwidth.


"How could any optimizer reduce are parameter list"

- Group them into structures, and pass a pointer to this structure 
rather than passing them as individual arguments?


"Under this condition it is every language for every compiler 
recommended to use count-down loops. The reason lies in the executing 
processor not in the compiler. How could any compiler modify the 
instruction set?"

- No... compiler won't change instruction set, but can it (for example, 
and where possible) replace a count-up loop with a count-down one to 
make use of the "free" comparison with zero.


- Finally, if I have C/C++ source code that runs efficiently on x86, do 
I need to consider modifying the source code if I need to recompile it 
to run efficiently on ARM?

von Sam P. (Guest)


Rate this post
useful
not useful
Thr rule is simple: Don't write optimized code.

Use efficient algorithms, but do not bother with processor-specific 
optimizations. Micro-optimizations cost a lot of time and make your code 
less readable. The time saved is better invested in finding a faster 
algorithm for your problem. That gives more speed-up.

There are important exceptions to that. But your very first thought 
should always be: Is that which I'm about to write still easy to 
understand after 12 hours of tiresome coding? By someone else than me?

If that is not the case, you will waste a lot of your time and introduce 
bugs, and your little optimizations may even become pointless.

Now to exceptional cases: Start optimizing when your program is too slow 
or uses too much RAM/Flash/whatever. Not earlier. Even then, first try 
to find a better algorithm. Only after you know for sure that no one 
ever devised a faster way of doing what you want to do, then go and add 
CPU-specific optimizations. Measure what parts of your programs are too 
slow and optimize only those. Anything else is a waste of your precious 
programmer time.

By the way, many micro-optimizations are the same across CPU types, so 
if you take a program written on a different CPU, it will perform quite 
similar. The remaining difference isn't worth your time (again, unless 
you measured it to be too slow for your use case)

Another side note: Compilers actually are able to perform many 
micro-optimizations automatically, including converting an up-counting 
loop to a down-counting one. In what cases and on which CPU this is 
supported is highly compiler- and version-dependent, however.

von IA (Guest)


Rate this post
useful
not useful
Thanks Sam for your reply - Fair enough!

von Christian B. (casandro)


Rate this post
useful
not useful
Felix von Leitner (aka. Fefe) has given a nice talk on the problem.

ftp://media.ccc.de/events/camp2007/video/m4v/cccamp07-en-1952-Know_your_ 
compiler.m4v

von (prx) A. K. (prx)


Rate this post
useful
not useful
IA wrote:
> "The compiler in or before the year 2000 didn't know how to efficiently
> transfer great structs by parameter"

I have not seen any significant pressure to optimize this. In part 
because people know to avoid this (yes, this is a self referential 
argument) and in part because passing structures by reference is a lot 
more efficient and likely will be for some time. I'd guess that until 
someone implements a copy-of-write mechanism for this, it won't change 
soon.

> "How could any optimizer reduce are parameter list"
>
> - Group them into structures, and pass a pointer to this structure
> rather than passing them as individual arguments?

The function call mechanism is defined by the ABI. This usually is a 
long term definition, not to be easily changed by some compiler author. 
While it sometimes happens, not count on ABI changes in the lifetime of 
a given architecture. This more likely happens when a new architecture 
is introduced, or an existing architecture changes significantly.

> - No... compiler won't change instruction set, but can it (for example,
> and where possible) replace a count-up loop with a count-down one to
> make use of the "free" comparison with zero.

GCC already does this for ages.

> - Finally, if I have C/C++ source code that runs efficiently on x86, do
> I need to consider modifying the source code if I need to recompile it
> to run efficiently on ARM?

Generally not, not unless you claim maximum achievable performance 
instead of acceptable performance. Not in case of those architectures. 
Things are somewhat different when we refer to 8-Bit processors like 
PIC, 8051 and AVR though, since these often have deficiencies 
implementing certain C language elements and concepts.

There are limits. When the absolute maximum speed is to be achieved and 
the original code was sort of hand tuned by looking at the code 
generated, it may still require some hand tuning for a different 
machine. Compilers are not perfect.

Also there are effects out of the compilers scope which have to be 
considered when writing efficient code and the details to optimize for 
can be different on different machines, even those sharing the same 
fundamental architecture. An example is closely coupled multiprocessing, 
which has to account for dirty stuff like cache line sizes and coherency 
mechanisms.

von Oliver (Guest)


Rate this post
useful
not useful
> - Finally, if I have C/C++ source code that runs efficiently on x86, do
> I need to consider modifying the source code if I need to recompile it
> to run efficiently on ARM?

If you can find a meaningfull definition of "efficiently", which 
considers the platform differences between a 4Ghz Quadcore x86 and a 
single core 500Mhz (or so) ARM, then no.

In case that you have hand-optimized inline Assembler for the x86 
platform, well, I would consider to stay with the x86...

Oliver

von (prx) A. K. (prx)


Rate this post
useful
not useful
Oliver wrote:
> If you can find a meaningfull definition of "efficiently", which
> considers the platform differences between a 4Ghz Quadcore x86 and a
> single core 500Mhz (or so) ARM, then no.

While the top ARM implementations are not even remotely within range of 
top Intel chips, especially with regard to single thread performance and 
overall throughput, things are different in the mobile world, where the 
various quad core ARMs are roughly comparable to Intel's Atom products.

von Hmm (Guest)


Rate this post
useful
not useful
Sam P. and especially A.K. (with unproncouncable nick "prx" :-)) had 
some important points which I agree with.

I must confess, that, while thinking about your (and A.K.s arguments, I 
found some proposals which I would seriously decline, some I have doubts 
about having positive impacts (like collection parameters into 
structures and) and some I find unconvinient (like loop sequence 
reversal) and I wonder how "any" optimization makes sense at all. :-)

For example: I (myself) missed a certain point which I consider 
relatively important. That is: I prefer what I (literally) wrote above 
what might be more efficient.

This is relevant in two cases you mentioned:
Reverse loop order would also reverse the sequence of operation. In case 
I rely on a "sideeffect", which could be accessing registers in a 
certain order (as with some registers of the AVRs) I desparately don't 
want the compiler to "optimize" because that would lead to a defective 
behavior.
The same is relevant in case of passing structures and the optimization 
you mentioned by reorder of the members. In case I communicate those 
structures I even do not wish that the compiler reorders them. (This 
point conflicts with the definition that in C one may rely in no respect 
to any particular order of the members. But this is usual to an extent 
which makes it relevant in my opinion).

Optimzation is not a "silver bullet". Its just a tool with its quirks.
It is advisable to see optimization as something with two faces (like 
Jekyll and Hyde). It may help but it may also be evil.

Additionally one may (as me) experience practical issues with certain 
compilers (I prefer to switch of optimisatons on IAR for the STM-ARMS). 
And furthermore I found that many (partly very experienced programmers 
like Karl Heinz, which you may meet quite often here) who state, that a 
program must function as well with and without optmizations. I myself 
did that too. But this is merely a hint for a beginner. An experienced 
programmer shall (my opinion) consider this more a rule of thumb instead 
of a general rule. It may give hints when optimization affects the 
programs behavior but it is not negative by itself.

But to come back to your question: I think it is a bit dispropionate to 
consider the degree by which compilers optimized in 2000 as "incomplete" 
or unsophisticated in comparison to todays optimizers. It may be true 
for some minor, very specific issues but not in general.

Very interesting discussion. Good night. :-)

von (prx) A. K. (prx)


Rate this post
useful
not useful
Hmm wrote:
> Reverse loop order would also reverse the sequence of operation.

Reversing the counter does not necessarily reverse the order of 
operations, as this would be a bit too agressive. If you look at GCC's 
loop optimizations, you may occasionally find a loop counter which 
looses its role as array index.

Example:
1
        for (i = k; i < n; ++i)
2
                g(a[i]);
may translate into
1
.L3:
2
        ldr     r0, [r5, #4]!
3
        bl      g
4
        add     r4, r4, #1
5
        cmp     r4, r6
6
        bne     .L3
where the loop counter (r4) is disconnected from its role as array index 
because an incremented pointer register (r5) is used instead. In this 
code, r4 is therefore free to count from n-k downward to 0 like in
1
.L3:
2
        ldr     r0, [r5, #4]!
3
        bl      g
4
        subs    r4, r4, #1
5
        bne     .L3

Note that optimizations roughly like these may also help in a more 
subtle but sometimes very effective way when the implementation uses 
deep out of order processing instead of a simple Cortex-M pipeline, 
because they separate data and control flows which could otherwise be 
linked creating unneccessary dependencies.

von Hmm (Guest)


Rate this post
useful
not useful
>Reversing the counter does not necessarily reverse the order of operations, ...
I see. That's a point, A.K.!
Should have another look again at gcc source.

von IA (Guest)


Rate this post
useful
not useful
> Felix von Leitner (aka. Fefe) has given a nice talk on the problem.

Thanks Christian, but the video does not play. I get a still image with 
the title of the talk, and there is no audio. Was anybody able to play 
the video?

von IA (Guest)


Rate this post
useful
not useful
Thanks A. K. for your detailed reply!

von IA (Guest)


Rate this post
useful
not useful
Thanks Oliver... I agree with A. K. on his reply to you

von IA (Guest)


Rate this post
useful
not useful
> I prefer what I (literally) wrote above what might be more efficient.

Agree... Readability of source code seems to be more important than 
low-level optimization as compilers are smart enough nowadays

> Reverse loop order would also reverse the sequence of operation. In case
> I rely on a "sideeffect", which could be accessing registers in a
> certain order (as with some registers of the AVRs) I desparately don't
> want the compiler to "optimize" because that would lead to a defective
> behavior.

Besides the example that A.K. has shown, there are some cases where you 
may even need to re-write the loop by incrementing the pointer rather 
than using an index (i.e. avoid using a[i]). This enables generating 
more efficient code by leveraging the post-increment addressing mode of 
ARM load/store instructions (hence saving the increment instruction).

I invite you to take a look at the book that I have mentioned in my 
original post. It has many of such tricks that I have tried myself. I 
have to admit that many of them have shown to be useless with setting 
the compiler to a high-level of optimization, but it is sometimes worth 
knowing what tricks the compiler may do.

> a program must function as well with and without optmizations

Sure, functionality should be guaranteed. Optimizations may just reduce 
memory footprint/bandwidth, and increase throughput

von IA (Guest)


Rate this post
useful
not useful
All,

After this fruitful discussion, I would come to the conclusion, that I'd 
better focus on the algorithm, where there is higher potential for 
performance improvement. It is still a plus to know how to optimize the 
source code at the low level. This may help in cases where a piece of 
code is detected to be a bottleneck, and low-level optimizing it becomes 
a requirement.

I confess that I have always been a passive participant in forums (I 
refer a lot to forums, but I almost never posted in any). I enjoyed the 
fact that I could open a discussion with a bunch of experts, and get 
educated as well as contribute to others. I should do that more often ;)

Thank you, and all the best!

von IA (Guest)


Rate this post
useful
not useful
Christian, I was able to play the video with VLC. Media player was not 
playing the audio. Will watch it some time soon. Thanks!

von (prx) A. K. (prx)


Rate this post
useful
not useful
IA wrote:
> may even need to re-write the loop by incrementing the pointer rather
> than using an index (i.e. avoid using a[i]). This enables generating
> more efficient code

Or less efficient code, depending on the implementation of the 
processor. If the increment operation is part of the load operation, it 
forces loads in successive iterations depend on each other, whereas they 
are independant if indexed addressing is used. Could make a significant 
difference for an out of order core.

This is exactly one of those optimizations, which may turn out to be 
de-optimizations when you want to run the code on a different machine 
later, maybe even within the same basic architecture.

With todays compilers, better use indexed operations in C and leave 
those kind of optimizations to the compiler, which is (hopefully) 
designed to optimize for the actual target hardware and avoids some 
pitfalls. The example above shows, that GCC already does optimize the 
way you would have done by hand, while it does not for x86, which does 
not have an autoincrement address mode but a cheap scaled index instead.

von IA (Guest)


Rate this post
useful
not useful
A.K.,

> With todays compilers, better use indexed operations in C and leave
> those kind of optimizations to the compiler, which is (hopefully)
> designed to optimize for the actual target hardware and avoids some
> pitfalls. The example above shows, that GCC already does optimize the
> way you would have done by hand, while it does not for x86, which does
> not have an autoincrement address mode but a cheap scaled index instead.
> Reply

That is the conclusion that I have reached. The chances that the 
compiler will do the job are high enough to prevent me from spending my 
time focusing on such HW-related low-level optimization details. If, I 
happen to fall into a situation where a piece of code is acting as a 
bottleneck, and it has to be optimized, I may need to optimize its 
source code (based on my knowledge of the HW core), as a step before 
thinking of writing it in Assembly.

Please log in before posting. Registration is free and takes only a minute.
Existing account
Do you have a Google/GoogleMail account? No registration required!
Log in with Google account
No account? Register here.