[libre-riscv-dev] Libre RISC-V Requirements Specification document

Thu Jan 10 11:51:45 GMT 2019

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Jan 10, 2019 at 11:42 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> You will probably need another read port to read the masked-out elements
> from rd since the predicate may change often.

 hm, ok.  well, we'll find out pretty quickly.

> >
> > As a result we should be able to get away with 4 32 bit banks of 2R1W, even
> > for repeated FMAC, the proviso being there that the src accumulator must be
> > the dest of the previous FMAC. This case is the one I worked out how to
> > detect.
> >
> > Btw Daniel, Jacob, the scoreboard OoO system absolutely does not care what
> > the pipeline length is, or even if there isn't one. FSQRT and FDIV can
> > therefore be done as blocking units, without detrimental consequences.
> >
> We have to ensure that all the blocking units can be cleared without their
> previous state affecting the timing otherwise you can use them to leak data
> from mis-speculation.

 ok so now i get it.  ok, so that's pretty straightforward.

> For the divider, for both radix-4 and newton's algorithm we can share most
> of the logic between divide, sqrt, and inv-sqrt, so I think we should build
> a unified unit.
> Division is going to need to be done at least once per pixel, so I think we
> will need a pipelined divider or at least several non-pipelined dividers.
> We can share the cost of a pipelined divider between 2 cores by having them
> issue divides on alternate cycles.
>
> One of the sqrt algorithms I am thinking of is:
> https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_(base_2)

 hey, that's the method i was talking about that just basically does a
compare and an add!  it's based on a^2 + b^2 => a^2 + a(2b + b) and
you just move one bit at a time from a to b.

 obviously you can do 2 bits at a time, however you need 3 comparators
(one for 1x, one for 2x, one for 3x).  and you can do 4 bits at a
time, however there you need 7 comparators.

> We would run 2 iterations of the lower loop per pipeline stage since that
> matches what you need for radix-4 division. The pipeline would be approx 16
> stages long.

 only 8 stages if you have 3 parallel comparators and do 2 bits at a time.

 l.