Forum: ARM programming with GCC/GNU tools Square root - how many cycles?

 Author: Anna Kowalska (nilee) Posted on: 2008-08-17 14:27

Rate this post
 0 ▲ useful ▼ not useful
Is this possible, that computing square root of 32-bit integer takes
about 4000 cycles? I'm using sqrt function (math.h), and str912
mikrocontroller (arm9 core). I tried that 4 cycle/bit C routine:
http://www.finesse.demon.co.uk/steven/sqrt.html (just out of curosity
because I need floating point result anyway) but still, it got only 2
times quicker.
I'm checking number of cycles with timer, which I start just before
calculating sqrt, and after that I read its value.
Does it really take so long?

 Author: Clifford Slocombe (clifford) Posted on: 2008-08-17 17:20

Rate this post
 0 ▲ useful ▼ not useful
Anna Kowalska wrote:
> Is this possible, that computing square root of 32-bit integer takes
> about 4000 cycles? I'm using sqrt function (math.h), and str912
> mikrocontroller (arm9 core). I tried that 4 cycle/bit C routine:
> http://www.finesse.demon.co.uk/steven/sqrt.html (just out of curosity
> because I need floating point result anyway) but still, it got only 2
> times quicker.
> I'm checking number of cycles with timer, which I start just before
> calculating sqrt, and after that I read its value.
> Does it really take so long?

So if you are running this at maximum speed for that part (96MHz) it is
taking about 42 microseconds? That sounds about right to me. sqrt() is a
double precision function and you have no FPU on that part, so it will
always be slow.

If you use the single precision version - sqrtf() - you will probably
find that it is about twice the speed because it has to move and process
half the amount of memory and a single precision value fits into a
single machine word, so there are fewer instructions required to process
them. Note if you use C++ and include <cmath> rather than <math.h> than
sqrt() is overloaded and will use the single or double precision
implementation as necessary determined by its argument type.

Note that Newlib is open source, if you want to look at the sqrt or
fsqrt implementations they are available, and you can replace them with
containing these functions ahead of the standard library and it will
override the standard implementation. Or you could of course just give
your function a different name, but overriding makes the code more
portable; on a platform with an FPU for example, you can simply remove
your override to use the native library.

If you have the potential for changing the device, the NXP LPC3xxx
series include VFP hardware, which will accelerate floating point
operations by 5 times in any case, but actually has double and single
precision SQRT instructions, which will obviously be far faster that a
software sqrt algorithm. The Newlib library will not use it by default,
so you will have to provide your own. I posted come code here to do that
a while back, but it seems to have been dropped from the forum. I can
dig it out if it is of use to you. A perhaps more expensive solution is
the Freescale iMX31 ARM11 with VFP.

Clifford

 Author: Anna Kowalska (nilee) Posted on: 2008-08-17 21:42

Rate this post
 0 ▲ useful ▼ not useful
Thanks for your reply. I hoped to read I did something wrong because it
couldn't take so long ;) Unfortunately I can't use another device so I
will have to change my algorithm and decrease the number of square
roots, and try to use sqrtf(). (Btw. is there any uniform acceleration
of step-motor algorithm without many floating point operations?)

I don't get one thing. Why those routines (from url I've posted)  are
described as 4 cycle/bit, for example? What is it supposed to mean if
not that calculating sqrt of 32-bit value takes about 120 cycles (not
2000)?

 Author: Clifford Slocombe (clifford) Posted on: 2008-08-17 22:32

Rate this post
 0 ▲ useful ▼ not useful
Anna Kowalska wrote:
> I don't get one thing. Why those routines (from url I've posted)  are
> described as 4 cycle/bit, for example? What is it supposed to mean if
> not that calculating sqrt of 32-bit value takes about 120 cycles (not
> 2000)?

Without knowing your test method, who can tell. There will be a software
overhead involved in measuring, which if you are timing just one
operation may be significant, and possibly larger than the operation
time. It is bet to measure the time for a large number of repeated
operations - but don't use a loop or you'll be measuring that overhead
too!

Use say 128 sequentially repeated operations in a loop (an unrolled
loop). That will minimise the loop overhead. measure a large number of
iterations (i) of these 128 operations, and divide the measured time by
the total number of operations (128 * i).

Clifford

• $formula (LaTeX syntax)$