[libre-riscv-dev] Libre RISC-V Requirements Specification document

Thu Jan 10 11:42:14 GMT 2019

On Thu, Jan 10, 2019 at 1:08 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> Yes, Daniel, feel free to go for it on an FPU (in nmigen, you good with
> that?). Want to write a draft  spec first so we all are happy with it, or
> shall I do one? Also it will be a really good standalone project, useful
> for other CPUs. More later.
>
> On Thursday, January 10, 2019, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> >
> > it that way, saving both power and area. We may want to include
> > partitioning all the way to 8 8x8 fma to support low-precision neural
> > networks.
>
>
> I have a friend who would be extremely interested in that.
>
>
> >
> >  Each of the multipliers would need to support signed*signed,
> > signed*unsigned, and unsigned*unsigned.
> > I think it's a good idea to build in the extra shifters to support
> denormal
> > numbers without slowing down, also allowing us to avoid data-dependent
> > timing allowing the fma units to be useful for cryptography and
> mitigating
> > spectre-class bugs as well.
> > Notably, avoiding data-dependent timing means we can't short-circuit
> things
> > like Infinity/NaN or division. It also means that we don't need to have
> as
> > many pipeline stages that can write to the register file allowing us to
> not
> > need as many write ports.
>
>
>  It's complicated... I came up with a way to do full "nameless" operand
> forwarding. It bypasses the regfile entirely and cleanly, without needing
> total state destruction and rollback if an exception occurs.
>
You will probably need another read port to read the masked-out elements
from rd since the predicate may change often.

>
> As a result we should be able to get away with 4 32 bit banks of 2R1W, even
> for repeated FMAC, the proviso being there that the src accumulator must be
> the dest of the previous FMAC. This case is the one I worked out how to
> detect.
>
> Btw Daniel, Jacob, the scoreboard OoO system absolutely does not care what
> the pipeline length is, or even if there isn't one. FSQRT and FDIV can
> therefore be done as blocking units, without detrimental consequences.
>
We have to ensure that all the blocking units can be cleared without their
previous state affecting the timing otherwise you can use them to leak data
from mis-speculation.

For the divider, for both radix-4 and newton's algorithm we can share most
of the logic between divide, sqrt, and inv-sqrt, so I think we should build
a unified unit.
Division is going to need to be done at least once per pixel, so I think we
will need a pipelined divider or at least several non-pipelined dividers.
We can share the cost of a pipelined divider between 2 cores by having them
issue divides on alternate cycles.

One of the sqrt algorithms I am thinking of is:
https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_(base_2)

Code from Wikipedia:
short isqrt(short num) {
    short res = 0;
    short bit = 1 << 14; // The second-to-top bit is set: 1 << 30 for 32
bits

    // "bit" starts at the highest power of four <= the argument.
    while (bit > num)
        bit >>= 2;

    while (bit != 0) {
        if (num >= res + bit) {
            num -= res + bit;
            res += bit << 1;
        }

        res >>= 1;
        bit >>= 2;
    }
    return res;
}

We would run 2 iterations of the lower loop per pipeline stage since that
matches what you need for radix-4 division. The pipeline would be approx 16
stages long. For 64-bit operations, we could feed them through the pipeline
twice and have the pipeline be 64-bits wide.

Jacob