# [libre-riscv-dev] Libre RISC-V Requirements Specification document

Luke Kenneth Casson Leighton lkcl at lkcl.net
Thu Jan 10 13:04:40 GMT 2019

```On Thu, Jan 10, 2019 at 12:39 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> > > We would run 2 iterations of the lower loop per pipeline stage since that
> > > matches what you need for radix-4 division. The pipeline would be approx
> > 16
> > > stages long.
> >
> >  only 8 stages if you have 3 parallel comparators and do 2 bits at a time.
> >
> no, 16 since there's 2 bits/stage and you need 32 bits

ermm... ermermerm....  input is 32 bits, result is 16 bits.
therefore 16 bits *if the compare is done 1 bit at a time*.

i was referring to an augmented version of the algorithm where the
compare phase takes *two* bits (i.e. does the operation in Base 4).
this requires 3 parallel comparators.

i know this algorithm, as i learned it in Base 10 when i was
around... 15 i think :)  it took me another 20+ years to think of the
base 2 version (duh), which is exactly what you see in that wikipedia
page.

the base 10 version goes like this... let me see if i can remember it.

take the number 1024, and break it down into 2 parts, "10" and "24".

you go, "what number squared is less than 10?"  the answer is, 3.

so, you take 9 from 10 and you are left with "1".  you then "bring
down" the "24" and you put it next to the 1, to get "124".

now, you take the 3, and you double it (this is the 2a in b (2a +b)) to get 60.

now you ask the following questions (these are the "parallel compares"):

"is 60 times 0 less than 124?" (yes, continue)
"is 61 times 1 less than 124?" (yes. continue)
"is 62 times 2 less than 124?" (no! stop here)
"is 63 times 3 less than 124?" (no! you already stopped...)
...
...
"is 68 times 8 less than 124?" (no! you already stopped...)

as you stopped at 62x2 = 124, you put the "2" next to the "3" on the
top line, and the answer is "32".

so in Base 10, you do 9 compares.  you don't need to do 69x9 because
if all others failed, this one absolutely has to be the answer.

moving that algorithm to Base 4, you only need to do 3 parallel
compares to determine the point at which a 2-bit "b" can be "moved"
across the "b times (2a + b)" barrier.

in this way, a 32-bit number can be sqrt'd in only 8 stages, for a
cost of needing 3 comparators and one subtractor, compared to 1
comparator and one subtractor for doing it in binary.

i think.

there *might* be a "multiply by 1 2 or 3" in there, somewhere, which
can be done with a shift and an add.

if we *really* really wanted to do it faster (4 stages), that's
possible too.... however now we need 7 comparators (and maybe a
"multiply by between 1 and 7").

l.

```