[libre-riscv-dev] [Bug 329] coriolis2 experiment layout for Dependency Matrices

Fri May 22 14:35:08 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=329

--- Comment #7 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jean-Paul.Chaput from comment #6)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > i just also added a second one to experiment8 as an illustration, it is an
> > FU-Regs Matrix with the following parameters:
> > 
> > * Number of Regs:                          4 (to keep it small as a demo)
> > * Number of Function Units:                3 (likewise)
> > * Number of Regfile Read ports protected:  2
> > * Number of Regfile Write ports protected: 2
> > 
> > these numbers can and will vary.  we have *four* Register Files to
> > protect with these Matrices.  five when we have FP support.
> 
> Hello Luke,
> 
> I'm looking closely to the examples but I still have problem understanding
> the matrix
> nature of the design. As a first I would like to concentrate on the FU-FU
> matrix, which,
> if I understand well, manage the read/write dependencies between FUs and
> generate the
> Go_Read & Go_Write signals towards the CU (and some other signals).

that's a pretty good understanding.  it however covers a bit more scope
than is needed here.  this needs understanding of the relationship 
between readable_o and go_read_o (and writable_o and go_write_o).

first: all of those are bitvectors.  every bitvector element in *all*
of *_i and *_o refers to the same FU number.

* go_rd1[0] is for FU 0
* issue_i[0] is for FU 0
* wr_pend_i[0] is for FU 0
* readable_o[0] is ... for FU...0

etc.

the FU-FU Matrix generates the *desire* (the readiness) to be readABLE
and also writABLE.  many of these could potentially be raised simultaneously,
however we have limited numbers of Regfile read ports and write ports.

we therefore need to pick one, because Regfile ports are a limited (broadcast,
global bus) resource.

thus, these signals go through a "Priority Picker" to select *which one*
of the readable_o is to be sent to the Register File as a GO_READ, and
also to notify *one* Function Unit that the data being broadcast from
the Register File Read Ports is *specifically* for that Function Unit and
that Function Unit only.

likewise writable_o goes through a *separate* Priority Picker.

you can see that process in this diagram here:

https://libre-soc.org/3d_gpu/group_pick_rd_rel.jpg

so for the purposes of this experiment we focus on readable_o and writeable_o
only: go_read *OUT* and go_write *OUT* are out of scope.

however... and here is where it gets to be fun:

those go_read and go_write signals as **INPUTS** to the FU-FU Matrix tell
us something very important:

* they tell us that, right now, on this cycle, one FU (which is some distance
  away from the FU-FU Matrix) *IS* receiving its Read Regists

* that therefore, the FU-FU Cell's job, which is to protect and preserve
  the Directed Acyclic Graph of read-write hazards, is NO LONGER required
  to protect that Read Dependency

* that therefore, that FU-FU cell may CANCEL the Read Dependency by pulling
  the Latch low.

a similar process occurs for go_write, obviously.  in addition, note
that go_die_i also fires *all* go_writes and *all* go_reads, because
it is a cancel / reset signal for that particular bitvector-numbered FU.

therefore,

* readable_o bitvector comes out, goes through PriorityPicker (out of scope of
  this layout), go_read_o bitvector comes out of that

* go_read_i bitvector (same NETLIST as go_read_o) comes back *IN* to the FU-FU
  Matrix.

note for layout:

* go_read_i bitvector wires should definitely be laid out horizontally.
* likewise go_write_i and go_die_i bitvectors.  these on the LEFT.
* there are *multiple* go_read_i bitvectors - go_rd1_i, go_rd2_i, go_rd3_i...
* because i used bitvectors (including in the SR Latches), the Matrix is
  effectively split into a 1D array of 1D bitvector "managers".  these
  are named dm0, dm1.. etc.
* dm0, dm1, dm2 .... etc. need to be laid out *vertically* in order to
  accept the horizontal input from go_read_i and go_write_i
* issue_i, rd_pend_i and wr_pend_i are also best done i believe as
  horizontal signals.

so that is inputs.

regarding outputs:

* wr_wait_o and rd_wait_o on the other hand: these are what go into those
  GreatBigORGates (fur_x1/2/3/4) and i *believe*, because dm0..3 need to
  be laid out vertically, the readable_o and writable_o bitvectors need
  to come out at the BOTTOM.

* on the other hand, if instead fur_x1/2/3/4 are done as very very skinny
  cells, practically empty, using available free space inside dm0/1/2/3,
  readable_o and writable_o could just as easily come out at the RIGHT.

either way does not matter for outputs: the inputs do matter though.

summary:

* anything as input (*_i) comes in on the LEFT and is wired HORIZONTAL
* anything as output (*_o) should go out on the BOTTOM and be wired VERTICAL
  however it may work if it is RIGHT and HORIZONTAL.  this is up to you.

> There is two "level" of matrixes:
> 
> 1. The architectural level (that is close to what you do) and the one I
> cannot clearly
>    guess. With 3 FU, is there a 3x3 matrix or 3 FU blocks only, and in the
> later case,
>    it may not be a matrix but just a row or a column.
>      But maybe I make confusion between FU and the dependency matrix of the
> FUs.

yes.  FUs are located. nowhere near the FU-FU matrix.  an FU may be a FMAC
for example, which will be... 20,000 gates, and there will be... 4 of those.

the FU-FU Matrix is tiny by comparison, and centralised, and you absolutely
do not want the two to be amalgamated.

* FU does the job of *computing* results based on operands

* FU-FU Matrix does the job of preserving the Directed Acyclic Graph (DAG) of
  relationships *BETWEEN* FUs.

> 2. The layout (cell level) into which the cells of *one* FU (or whatever
> sub-block)
>    are also arrayed in a matrix. As we may not put all the cells of a block
> in just
>    one row.
> 
> So would it be possible to send examples where one block of the matrix (in
> the
> sense of 1.) is clearly identified (best would be that it is put in a
> sub-block) ?

yaa of course.  (it is already in a sub-block).  added to experiment8,
test_fu_fu_matrix.il.  which you can re-generated to any size you wish
to using this:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/scoremulti/fu_fu_matrix.py;hb=HEAD

and altering the n_fu_cols and n_fu_rows (make sure they are the same).

> 
> Or, if it is really inconvenient due to the way the design is described at
> nMigne level,
> list me what I/O signals (which bit of vectors) are specific to one element
> of the
> matrix ?

ok.

so because of the recursive nature of a Directed Acyclic Graph, what you
ask in both (1) *and* (2) is, strictly speaking, not possible.

i may have this the wrong way round, please forgive me for that:

* a row in the FU-FU Matrix expresses that one FU has a *BLOCK* on other FUs
  (and each element in that row says which one)

  this would be "outputs" on a node in the Directed Acyclic Graph of
  read-write dependencies

* a column in the FU-FU expresses that one FU is *BLOCKED BY* another FU
  (and each element in that column says which one)

  this would be "inputs" on a node in the Directed Acyclic Graph of
  read-write dependencies

therefore, technically, it is not possible to "divide" them from each other,
from an inter-relationship perspective.

however... for convenience, what i have had to do is to *actually* divide
them into a 1D array of 1D arrays.  this because of the simulation speed
of nmigen.

a previous version, i had *actual* 2D cells.  each SR Latch was a single
bit (not a bitvector).  this was so horribly slow i could not tolerate it.

> A slow learner.

this is a complex topic.

> PS: I still not completely understand the color coding of the scoreboard
> schematic.
>     Is the size and position of the little blue/yellow/green squares inside
> the red one
>     significant?

bear in mind this is for the FU-Regs Matrix image (see attachments, here,
or see p23)

scoreboard_mechanics.1.pdf - section 10.5 p23 "A scoreboard using Dep Matrix"

In this diagram

* a red box denotes that this entry can read or write those registers.
* A blue box denotes that it is possible for an instruction in that
  Unit to write to this register.
* A yellow box indicates that an instruction in this Unit can read
  this register.
* Finally, a green box indicates that this Unit can either read
  this register for normal activities, or write to this register if a store
  is being performed.

For each of the boxes a clocked set-reset flip-flop is used to gate state
changes into the table so that subsequent instructions will see the state of
the current register data-flow dependencies.

-- 
You are receiving this mail because:
You are on the CC list for the bug.