[libre-riscv-dev] GPU design
lkcl
lkcl at libre-riscv.org
Fri Dec 7 11:37:41 GMT 2018
On Fri, Dec 7, 2018 at 10:28 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Fri, Dec 7, 2018 at 1:19 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
> > the 6600 scoreboard rules - which are awesomely simple and actually
> > involve D-Latches (3 gates) *not* flip-flops (10 gates) can be
> > executed in parallel because there will be no overlap between
> > stratified registers.
> >
> Yeah, I was a little surprised when I heard that. I do think, however, that
> we should use flip-flops instead of latches since it makes it much easier
> to design (not having to worry about glitches and stuff) and doesn't use
> much more resources.
well, yosys has an option to disable generation of d-latches
(substituting flip-flops instead). so if the source code is designed
to do d-latches and they do turn out to be problematic, they can be
eliminated.
of course, that's if migen (or whatever we decide on) actually allows
verilog that can *be* turned into d-latches.
we may have to be careful on this one (resource-wise), as we may end
up with O(N^2) in several places: FU-to-FU dependency matrices for
example. just have to see.
> I think we will need enough entries in the ROB that we have at least a few
> more clocks than the latency of the divide unit when it's processing 32-bit
> numbers (int or fp), so we'll probably need more than 8.
the intel processors have 32 (and a separate Reservation Station table
with the same order of size)
if we also have 32, divided down modulo 4 (such that the first 2 bits
of the ROB# *must* be equal to the Dest Reg#), we not only have a
cleaner way to do 4-wide instruction issue, the bit-wdith of the ROB
CAM is reduced from 256 bit (1 for INT/FP, 7 for Reg#) down to 6.
> For a pipelined divider, if we give the divide unit 3-4 multipliers, then
> we can shrink the latency to around 12 cycles by using the newton-raphson
> method. Alternatively, we could implement a radix-4 pipelined divider that
> would shrink the latency to around 16 cycles.
well, the nice thing is: whatever the time (and even if there's no
pipelining) there's no knock-on design impact, as long as, yes, it's
kept below the ROB size.
where things might go a little bit astray, here is: if a chain of
divide operations get issued on the same stratification destination
register (i.e. modulo 4 the dest regs come up with the same
bank/lane).
in the scheme i propose, that *will* result in a lot of empty ROB
slots. is that ok? i honestly have no idea. is it a likely scenario?
i have no idea... however, hmmm, it should be easy enough to check,
using the spike instruction trace analyser written by an IIT Madras
student, known as "RiTA".
> If we want to go with non-pipelined dividers, we will at least need more
> than 1 since we need a division per pixel and 30-cycles per pixel will eat
> up all our performance.
and it may be a good idea to do so, because we definitely want 1 per
stratification layer.
> If we do decide to use a pipelined divider, we could share 1 divider
> between 2 cores, since that would be more than enough performance and would
> increase the average latency by 1 cycle at most. If we do decide to share
> the divider, we will need to take care that the division latency doesn't
> become a side-channel that speculated instructions can leak info through.
interesting.
well, with the stratification proposal, the divider could
hypothetically be shared across banks, instead.
> Note that it should be somewhat easy to add sqrt and recip-sqrt operations
> to most divider designs. Recip-sqrt is particularly useful for normalizing
> vectors, which is a common graphics operation.
i heard that, yes. have you seen that hilarious approximation where
you subtract from the magic number 0x5f3759df?
https://en.wikipedia.org/wiki/Fast_inverse_square_root
what it does is alias a float to an int as a way to approximate
log2(x). that's then used as a way to approximate log2(1/sqrt(x)).
back to float you get an approximation of pow(x,2), and it's accurate
to around 3.5%.
which is really funny.
>
> > i'll keep working on diagrams, and also reading mitch alsup's chapters
> > on the 6600. they're frickin awesome. the 6600 could do multi-issue
> > LD and ST by way of having dedicated registers to LD and ST. X1-X5
> > were for ST, X6 and X7 for LD.
> >
> Have fun!
i spilled coffee on them already :)
l.
More information about the libre-riscv-dev
mailing list