[libre-riscv-dev] Scoreboards
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Wed May 15 06:49:03 BST 2019
On Wed, May 15, 2019 at 3:38 AM Samuel Falvo II <sam.falvo at gmail.com> wrote:
> > if you have only combinatorial ALUs, that complete in one cycle, and
> > you have absolutely no need or desire to have overlapping read /
> > execute / write phases, instead making them completely and utterly
> > separate and running via an FSM, the answer is no, you have absolutely
> >
>
> I was working from a four-cycle issue/read/execute/writeback FSM-based FU.
okaay, so then, yes, you'd have absolutely no need of a scoreboard.
> > no need of an FU-Regs / FU-FU Matrix pair (remember, you need both),
>
> But, why? I, so far, haven't seen a sound reason to explain this yet, and
> I'm so far not sure why you'd want the information these provide.
if you write out the various cases for a multi-stage single-issue
pipelined design, it will help to illustrate. the reason i asked if
you could do that is because, um, well, i did actually do it already
(a few days ago), and you missed it. so if *you* outline all of the
dreadful things that a multi-stage single-issue pipeline has to do,
you'll be "engaged" (hopefully) and the benefits will "stick"
(hopefully) :)
> Please note that I'm not saying you're wrong, or that Mitch is wrong, or
> Cray is wrong; the CDC, K5, and 88000 are obviously all successful products
> (if not in popularity, then certainly in that they met their design goals
> and application needs). What I am suggesting is that what's written about
> them is not as clear as they could be, and even if it's written, I'm not
> seeing a clear use-case for why they would be useful.
ultimately they were sold on performance (and the CDC6600 on a much
higher bang-per-buck gate ratio), however if you do through the
exercise of writing out the list of
"silly-tricks-that-have-to-be-done" on a single-issue design, it will
become clear.
> > and yes, you will NEVER encounter read or write register file hazards,
> That's not actually true. ;) However, my 6502 thought-experiment simply
> refused to issue for as long as any WAW hazard existed, and refused to
> write-back as long as WAR hazards existed. That seemed to be sufficient.
ta-daaaa. that will be "silly-tricks-that-have-to-be-done" Numbers
One and Two. you had to add:
(a) WAW detection plus stall the *entire* design
(b) WAR detection plus stall the *entire* design.
so that's "two stupid tricks" - note in particular that you had to
*have* the WAW and WAR detection, this is very very important to keep
in mind in order to answer your question.
> performance will be dog slow, however that's not the point :)
> >
>
> As I indicated, I think I can just about double the 6502's performance, so
> while it may not be as fast as it could go, it's not *that* dog slow. ;)
> At the very least, excepting some of the more complex addressing modes,
> most instructions now run in 1 cycle instead of 2. For complex addressing
> modes, it looks like the time spent executing tends to be one cycle less
> than the 6502 itself. Not too shabby.
nice. btw you do know that the 6502 is a super-scalar out-of-order
design, right? :)
> > FPGAs typically max out at 2R1W.
>
> The work-around is to use multiple block RAMs in parallel, with write-ports
> tied together, but separate read ports. That way, you simulate a 2nR1W
> register file. OTOH, that eats into other resources where block RAMs can
> be useful (e.g., caches).
that's a neat trick. and you see that it basically more than doubles
the amount of resources required?
> Something like a 2nR1W is perhaps better implemented in an ASIC to keep
> things as tight as possible.
it's still enormous, unfortunately.
traditional vector processors avoid this by through sequential
multi-step reads on lower-ported SRAM. the trade-off is that the
latency is extended, and to get performance back you have to multiply
the parallelism by the ratio of the amount of multi-porting that you
*didn't* do.
so, if you use 2R1W and sequential multi-step reads instead of 10R5W,
you *have* to have *FIVE* way parallel element processing in order to
get the same performance as a 10R5W multi-issue engine.
(which, btw, is precisely what we're doing in the Libre RISC-V SoC...
just at an "acceptable" (low) trade-off point of only requiring 2x
parallelism)
l.
More information about the libre-riscv-dev
mailing list