[libre-riscv-dev] Scoreboards

Luke Kenneth Casson Leighton lkcl at lkcl.net
Wed May 15 06:49:03 BST 2019

On Wed, May 15, 2019 at 3:38 AM Samuel Falvo II <sam.falvo at gmail.com> wrote:

> > if you have only combinatorial ALUs, that complete in one cycle, and
> > you have absolutely no need or desire to have overlapping read /
> > execute / write phases, instead making them completely and utterly
> > separate and running via an FSM, the answer is no, you have absolutely
> >
> I was working from a four-cycle issue/read/execute/writeback FSM-based FU.

 okaay, so then, yes, you'd have absolutely no need of a scoreboard.

> > no need of an FU-Regs / FU-FU Matrix pair (remember, you need both),
> But, why?  I, so far, haven't seen a sound reason to explain this yet, and
> I'm so far not sure why you'd want the information these provide.

 if you write out the various cases for a multi-stage single-issue
pipelined design, it will help to illustrate.  the reason i asked if
you could do that is because, um, well, i did actually do it already
(a few days ago), and you missed it.  so if *you* outline all of the
dreadful things that a multi-stage single-issue pipeline has to do,
you'll be "engaged" (hopefully) and the benefits will "stick"
(hopefully) :)

> Please note that I'm not saying you're wrong, or that Mitch is wrong, or
> Cray is wrong; the CDC, K5, and 88000 are obviously all successful products
> (if not in popularity, then certainly in that they met their design goals
> and application needs).  What I am suggesting is that what's written about
> them is not as clear as they could be, and even if it's written, I'm not
> seeing a clear use-case for why they would be useful.

 ultimately they were sold on performance (and the CDC6600 on a much
higher bang-per-buck gate ratio), however if you do through the
exercise of writing out the list of
"silly-tricks-that-have-to-be-done" on a single-issue design, it will
become clear.

> > and yes, you will NEVER encounter read or write register file hazards,

> That's not actually true.  ;)  However, my 6502 thought-experiment simply
> refused to issue for as long as any WAW hazard existed, and refused to
> write-back as long as WAR hazards existed.  That seemed to be sufficient.

 ta-daaaa.  that will be "silly-tricks-that-have-to-be-done" Numbers
One and Two.  you had to add:

 (a) WAW detection plus stall the *entire* design
 (b) WAR detection plus stall the *entire* design.

so that's "two stupid tricks" - note in particular that you had to
*have* the WAW and WAR detection, this is very very important to keep
in mind in order to answer your question.

>  performance will be dog slow, however that's not the point :)
> >
> As I indicated, I think I can just about double the 6502's performance, so
> while it may not be as fast as it could go, it's not *that* dog slow.  ;)
> At the very least, excepting some of the more complex addressing modes,
> most instructions now run in 1 cycle instead of 2.  For complex addressing
> modes, it looks like the time spent executing tends to be one cycle less
> than the 6502 itself.  Not too shabby.

 nice.  btw you do know that the 6502 is a super-scalar out-of-order
design, right? :)

> >  FPGAs typically max out at 2R1W.
> The work-around is to use multiple block RAMs in parallel, with write-ports
> tied together, but separate read ports.  That way, you simulate a 2nR1W
> register file.  OTOH, that eats into other resources where block RAMs can
> be useful (e.g., caches).

 that's a neat trick.  and you see that it basically more than doubles
the amount of resources required?

> Something like a 2nR1W is perhaps better implemented in an ASIC to keep
> things as tight as possible.

it's still enormous, unfortunately.

traditional vector processors avoid this by through sequential
multi-step reads on lower-ported SRAM.  the trade-off is that the
latency is extended, and to get performance back you have to multiply
the parallelism by the ratio of the amount of multi-porting that you
*didn't* do.

so, if you use 2R1W and sequential multi-step reads instead of 10R5W,
you *have* to have *FIVE* way parallel element processing in order to
get the same performance as a 10R5W multi-issue engine.

(which, btw, is precisely what we're doing in the Libre RISC-V SoC...
just at an "acceptable" (low) trade-off point of only requiring 2x


More information about the libre-riscv-dev mailing list