Is this possible, that computing square root of 32-bit integer takes about 4000 cycles? I'm using sqrt function (math.h), and str912 mikrocontroller (arm9 core). I tried that 4 cycle/bit C routine: http://www.finesse.demon.co.uk/steven/sqrt.html (just out of curosity because I need floating point result anyway) but still, it got only 2 times quicker. I'm checking number of cycles with timer, which I start just before calculating sqrt, and after that I read its value. Does it really take so long?

Anna Kowalska wrote: > Is this possible, that computing square root of 32-bit integer takes > about 4000 cycles? I'm using sqrt function (math.h), and str912 > mikrocontroller (arm9 core). I tried that 4 cycle/bit C routine: > http://www.finesse.demon.co.uk/steven/sqrt.html (just out of curosity > because I need floating point result anyway) but still, it got only 2 > times quicker. > I'm checking number of cycles with timer, which I start just before > calculating sqrt, and after that I read its value. > Does it really take so long? So if you are running this at maximum speed for that part (96MHz) it is taking about 42 microseconds? That sounds about right to me. sqrt() is a double precision function and you have no FPU on that part, so it will always be slow. If you use the single precision version - sqrtf() - you will probably find that it is about twice the speed because it has to move and process half the amount of memory and a single precision value fits into a single machine word, so there are fewer instructions required to process them. Note if you use C++ and include <cmath> rather than <math.h> than sqrt() is overloaded and will use the single or double precision implementation as necessary determined by its argument type. Note that Newlib is open source, if you want to look at the sqrt or fsqrt implementations they are available, and you can replace them with your own implementations. Simply link object code or a library containing these functions ahead of the standard library and it will override the standard implementation. Or you could of course just give your function a different name, but overriding makes the code more portable; on a platform with an FPU for example, you can simply remove your override to use the native library. If you have the potential for changing the device, the NXP LPC3xxx series include VFP hardware, which will accelerate floating point operations by 5 times in any case, but actually has double and single precision SQRT instructions, which will obviously be far faster that a software sqrt algorithm. The Newlib library will not use it by default, so you will have to provide your own. I posted come code here to do that a while back, but it seems to have been dropped from the forum. I can dig it out if it is of use to you. A perhaps more expensive solution is the Freescale iMX31 ARM11 with VFP. Clifford

Thanks for your reply. I hoped to read I did something wrong because it couldn't take so long ;) Unfortunately I can't use another device so I will have to change my algorithm and decrease the number of square roots, and try to use sqrtf(). (Btw. is there any uniform acceleration of step-motor algorithm without many floating point operations?) I don't get one thing. Why those routines (from url I've posted) are described as 4 cycle/bit, for example? What is it supposed to mean if not that calculating sqrt of 32-bit value takes about 120 cycles (not 2000)?

Anna Kowalska wrote: > I don't get one thing. Why those routines (from url I've posted) are > described as 4 cycle/bit, for example? What is it supposed to mean if > not that calculating sqrt of 32-bit value takes about 120 cycles (not > 2000)? Without knowing your test method, who can tell. There will be a software overhead involved in measuring, which if you are timing just one operation may be significant, and possibly larger than the operation time. It is bet to measure the time for a large number of repeated operations - but don't use a loop or you'll be measuring that overhead too! Use say 128 sequentially repeated operations in a loop (an unrolled loop). That will minimise the loop overhead. measure a large number of iterations (i) of these 128 operations, and divide the measured time by the total number of operations (128 * i). Clifford