[libre-riscv-dev] GPU design

Fri Dec 7 12:32:59 GMT 2018

On Fri, Dec 7, 2018, 03:37 lkcl <lkcl at libre-riscv.org wrote:

> On Fri, Dec 7, 2018 at 10:28 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > On Fri, Dec 7, 2018 at 1:19 AM Luke Kenneth Casson Leighton <
> lkcl at lkcl.net>
> > wrote:
>
> > >  the 6600 scoreboard rules - which are awesomely simple and actually
> > > involve D-Latches (3 gates) *not* flip-flops (10 gates) can be
> > > executed in parallel because there will be no overlap between
> > > stratified registers.
> > >
> > Yeah, I was a little surprised when I heard that. I do think, however,
> that
> > we should use flip-flops instead of latches since it makes it much easier
> > to design (not having to worry about glitches and stuff) and doesn't use
> > much more resources.
>
>  well, yosys has an option to disable generation of d-latches
> (substituting flip-flops instead).  so if the source code is designed
> to do d-latches and they do turn out to be problematic, they can be
> eliminated.
>
>  of course, that's if migen (or whatever we decide on) actually allows
> verilog that can *be* turned into d-latches.
>
>  we may have to be careful on this one (resource-wise), as we may end
> up with O(N^2) in several places: FU-to-FU dependency matrices for
> example.  just have to see.
>
> > I think we will need enough entries in the ROB that we have at least a
> few
> > more clocks than the latency of the divide unit when it's processing
> 32-bit
> > numbers (int or fp), so we'll probably need more than 8.
>
> the intel processors have 32 (and a separate Reservation Station table
> with the same order of size)
>
> if we also have 32, divided down modulo 4 (such that the first 2 bits
> of the ROB# *must* be equal to the Dest Reg#), we not only have a
> cleaner way to do 4-wide instruction issue, the bit-wdith of the ROB
> CAM is reduced from 256 bit (1 for INT/FP, 7 for Reg#) down to 6.
>
> > For a pipelined divider, if we give the divide unit 3-4 multipliers, then
> > we can shrink the latency to around 12 cycles by using the newton-raphson
> > method. Alternatively, we could implement a radix-4 pipelined divider
> that
> > would shrink the latency to around 16 cycles.
>
>  well, the nice thing is: whatever the time (and even if there's no
> pipelining) there's no knock-on design impact, as long as, yes, it's
> kept below the ROB size.
>
>  where things might go a little bit astray, here is: if a chain of
> divide operations get issued on the same stratification destination
> register (i.e. modulo 4 the dest regs come up with the same
> bank/lane).
>
>  in the scheme i propose, that *will* result in a lot of empty ROB
> slots.  is that ok? i honestly have no idea.  is it a likely scenario?
> i have no idea... however, hmmm, it should be easy enough to check,
> using the spike instruction trace analyser written by an IIT Madras
> student, known as "RiTA".
>
>
> > If we want to go with non-pipelined dividers, we will at least need more
> > than 1 since we need a division per pixel and 30-cycles per pixel will
> eat
> > up all our performance.
>
>  and it may be a good idea to do so, because we definitely want 1 per
> stratification layer.
>
> > If we do decide to use a pipelined divider, we could share 1 divider
> > between 2 cores, since that would be more than enough performance and
> would
> > increase the average latency by 1 cycle at most. If we do decide to share
> > the divider, we will need to take care that the division latency doesn't
> > become a side-channel that speculated instructions can leak info through.
>
>  interesting.
>
>  well, with the stratification proposal, the divider could
> hypothetically be shared across banks, instead.
>
I think sharing between pairs of cores will still work since with a
pipelined divider, you can do 1 divide per clock. As some perspective, a
quad-core haswell using avx instructions can do 2.29 (4 cores * 8 lanes /
14 cycles) fp32 divisions per clock and our quad-core GPU with a pipelined
divider per pair of cores can do 2 divisions per clock.

>
> > Note that it should be somewhat easy to add sqrt and recip-sqrt
> operations
> > to most divider designs. Recip-sqrt is particularly useful for
> normalizing
> > vectors, which is a common graphics operation.
>
>  i heard that, yes.  have you seen that hilarious approximation where
> you subtract from the magic number 0x5f3759df?
>    https://en.wikipedia.org/wiki/Fast_inverse_square_root
>
> what it does is alias a float to an int as a way to approximate
> log2(x).  that's then used as a way to approximate log2(1/sqrt(x)).
> back to float you get an approximation of pow(x,2), and it's accurate
> to around 3.5%.
>
> which is really funny.
>

Yeah, it's pretty neat.
Note that having the rv base integer and fp registers be part of the same
register file like I had suggested before allows us to save 2 clock cycles
with the fast sqrt algorithm since you can use the SV rename table to have
an integer register and a fp register renamed to the same underlying
register removing the need to move between int and fp registers.

Jacob