[libre-riscv-dev] Scoreboards

Wed May 15 03:45:00 BST 2019

On Tue, May 14, 2019 at 10:54 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> if you have only combinatorial ALUs, that complete in one cycle, and
> you have absolutely no need or desire to have overlapping read /
> execute / write phases, instead making them completely and utterly
> separate and running via an FSM, the answer is no, you have absolutely
>

I was working from a four-cycle issue/read/execute/writeback FSM-based FU.

no need of an FU-Regs / FU-FU Matrix pair (remember, you need both),
>

But, why?  I, so far, haven't seen a sound reason to explain this yet, and
I'm so far not sure why you'd want the information these provide.

Please note that I'm not saying you're wrong, or that Mitch is wrong, or
Cray is wrong; the CDC, K5, and 88000 are obviously all successful products
(if not in popularity, then certainly in that they met their design goals
and application needs).  What I am suggesting is that what's written about
them is not as clear as they could be, and even if it's written, I'm not
seeing a clear use-case for why they would be useful.

I've got a nearly a full week to re-re-read this stuff, so maybe some more
meditation will help; but after several days worth already, I'm just not
getting it.

and yes, you will NEVER encounter read or write register file hazards,
>

That's not actually true.  ;)  However, my 6502 thought-experiment simply
refused to issue for as long as any WAW hazard existed, and refused to
write-back as long as WAR hazards existed.  That seemed to be sufficient.

 performance will be dog slow, however that's not the point :)
>

As I indicated, I think I can just about double the 6502's performance, so
while it may not be as fast as it could go, it's not *that* dog slow.  ;)
At the very least, excepting some of the more complex addressing modes,
most instructions now run in 1 cycle instead of 2.  For complex addressing
modes, it looks like the time spent executing tends to be one cycle less
than the 6502 itself.  Not too shabby.

 FPGAs typically max out at 2R1W.
>

The work-around is to use multiple block RAMs in parallel, with write-ports
tied together, but separate read ports.  That way, you simulate a 2nR1W
register file.  OTOH, that eats into other resources where block RAMs can
be useful (e.g., caches).

Something like a 2nR1W is perhaps better implemented in an ASIC to keep
things as tight as possible.

-- 
Samuel A. Falvo II