[libre-riscv-dev] GPU design

Luke Kenneth Casson Leighton lkcl at lkcl.net
Tue Dec 11 14:23:06 GMT 2018

ok,so continuing some thoughts-in-order notes:

* scoreboards are not just scoreboards, they are dependency matrices,
and there are several of them:

 - one for LOAD/STORE-to-LOAD/STORE:
   + most recent LOADs prevent later STOREs
   + most recent STOREs prevent later LOADs.

 - one for Function-Unit to Function-Unit.
   + it exxpresses both RAW and WAW hazards through "Go_Write" and "Go_Read"
      signals, which are stopped from proceeding by dependent 1-bit CAM latches
   + exceptions may ALSO be made "precise" by holding a "Write prevention"
      signal.  only when the Function Unit knows that an exception is not going
      to occur (memory has been fetched, for example), does it release
the signal
    + speculative branch execution likewise may hold a "Write
prevention", however
       it also needs a "Go die" signal, to clear out the
incorrectly-taken branch.
    + LOADs/STOREs *also* must be considered as "Functional Units" and thus
       must also have corresponding entries (plural) in the FU-to-FU Matrix
    + it is permitted for ALUs to *BEGIN* execution (read operands are valid)
       without being permitted to *COMMIT*.  thus, each FU must store (buffer)
       results, until such time as a "commit" signal is received
    + we may need to express an inter-dependence on the instruction order
       (raising the WAW hazard line to do so) as a way to preserve execution
       order.  only the oldest instructions will have this flag
dropped, permitting
       execution that has *begun* to also reach "commit" phase.

   - one for Function-Unit to Registers.
    + it expresses the read and write requirements: the source and destination
       registers on which the operation depends.  source registers are marked
       "need read", dest registers marked "need write".
    + by having *more than one* Functional Unit matrix row per ALU it becomes
       possible to effectively achieve "Reservation Stations" orthogonality with
       the Tomasulo Algorithm.  the FU row must, like RS's, take and
store a copy
       of the src register values.

* we may potentially have 2-issue (or 4-issue) and a simpler issue and detection
  by "striping" the register file according to modulo 2 (or 4) on the
  register number

  - the Function Unit rows are multiplied up by 2 (or 4) however they are
    actually connected to the same ALUs (pipelined and with both src and
    dest register buffers/latches).
  - the Register Read and Write signals are then "striped" such that read/write
    requests for every 2nd (or 4th) register are "grouped" and will have to
    fight for access to a multiplexer in order to access registers that do not
    have the same modulo 2 (or 4) match.
  - we MAY potentially be able to drop the destination (write) multiplexer(s)
    by only permitting FU rows with the same modulo to write to that destination
    bank.  FUs with indices 0,4,8,12 may only write to registers similarly
  - there will therefore be FOUR separate register-data buses, with (at least)
    the Read buses multiplexed so that all FU banks may read all src registers
    (even if there is contention for the multiplexers)

* an oddity / artefact of the FU-to-Registers Dependency Matrix is that the
  write/read enable signals already exist as single-bits.  "normal" processors
  store the src/dest registers as an index (5 bits == 0-31), where in this
  design, that has been expanded out to 32 individual Read/Write wires,

  - the register file verilog implementation therefore must take in an
    array of 128-bit write-enable and 128-bit read-enable signals.
 - however the data buses will be multiplexed modulo 2 (or 4) according
   to the lower bits of the register number, in order to cross "lanes".

* with so many Function Units in RISC-V (dozens of instructions, times 2
  to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),
  we almost certainly are going to have to deploy a "grouping" scheme:

  - rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
    to MUL etc., instead we group the FUs by how many src and dest
    registers are required, and *pass the opcode down to them*
  - only FUs with the exact same number (and type) of register profile
    will receive like-minded opcodes.
  - when src and dest are free for a particular op (and an ALU pipeline is
    not stalled) the FU is at liberty to push the operands into the
    appropriate free ALU.
  - FUs therefore only really express the register, memory, and execution
    dependencies: they don't actually do the execution.


More information about the libre-riscv-dev mailing list