[libre-riscv-dev] GPU design
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Tue Dec 11 14:23:06 GMT 2018
ok,so continuing some thoughts-in-order notes:
* scoreboards are not just scoreboards, they are dependency matrices,
and there are several of them:
- one for LOAD/STORE-to-LOAD/STORE:
+ most recent LOADs prevent later STOREs
+ most recent STOREs prevent later LOADs.
- one for Function-Unit to Function-Unit.
+ it exxpresses both RAW and WAW hazards through "Go_Write" and "Go_Read"
signals, which are stopped from proceeding by dependent 1-bit CAM latches
+ exceptions may ALSO be made "precise" by holding a "Write prevention"
signal. only when the Function Unit knows that an exception is not going
to occur (memory has been fetched, for example), does it release
the signal
+ speculative branch execution likewise may hold a "Write
prevention", however
it also needs a "Go die" signal, to clear out the
incorrectly-taken branch.
+ LOADs/STOREs *also* must be considered as "Functional Units" and thus
must also have corresponding entries (plural) in the FU-to-FU Matrix
+ it is permitted for ALUs to *BEGIN* execution (read operands are valid)
without being permitted to *COMMIT*. thus, each FU must store (buffer)
results, until such time as a "commit" signal is received
+ we may need to express an inter-dependence on the instruction order
(raising the WAW hazard line to do so) as a way to preserve execution
order. only the oldest instructions will have this flag
dropped, permitting
execution that has *begun* to also reach "commit" phase.
- one for Function-Unit to Registers.
+ it expresses the read and write requirements: the source and destination
registers on which the operation depends. source registers are marked
"need read", dest registers marked "need write".
+ by having *more than one* Functional Unit matrix row per ALU it becomes
possible to effectively achieve "Reservation Stations" orthogonality with
the Tomasulo Algorithm. the FU row must, like RS's, take and
store a copy
of the src register values.
* we may potentially have 2-issue (or 4-issue) and a simpler issue and detection
by "striping" the register file according to modulo 2 (or 4) on the
destination
register number
- the Function Unit rows are multiplied up by 2 (or 4) however they are
actually connected to the same ALUs (pipelined and with both src and
dest register buffers/latches).
- the Register Read and Write signals are then "striped" such that read/write
requests for every 2nd (or 4th) register are "grouped" and will have to
fight for access to a multiplexer in order to access registers that do not
have the same modulo 2 (or 4) match.
- we MAY potentially be able to drop the destination (write) multiplexer(s)
by only permitting FU rows with the same modulo to write to that destination
bank. FUs with indices 0,4,8,12 may only write to registers similarly
numbered.
- there will therefore be FOUR separate register-data buses, with (at least)
the Read buses multiplexed so that all FU banks may read all src registers
(even if there is contention for the multiplexers)
* an oddity / artefact of the FU-to-Registers Dependency Matrix is that the
write/read enable signals already exist as single-bits. "normal" processors
store the src/dest registers as an index (5 bits == 0-31), where in this
design, that has been expanded out to 32 individual Read/Write wires,
already.
- the register file verilog implementation therefore must take in an
array of 128-bit write-enable and 128-bit read-enable signals.
- however the data buses will be multiplexed modulo 2 (or 4) according
to the lower bits of the register number, in order to cross "lanes".
* with so many Function Units in RISC-V (dozens of instructions, times 2
to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),
we almost certainly are going to have to deploy a "grouping" scheme:
- rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
to MUL etc., instead we group the FUs by how many src and dest
registers are required, and *pass the opcode down to them*
- only FUs with the exact same number (and type) of register profile
will receive like-minded opcodes.
- when src and dest are free for a particular op (and an ALU pipeline is
not stalled) the FU is at liberty to push the operands into the
appropriate free ALU.
- FUs therefore only really express the register, memory, and execution
dependencies: they don't actually do the execution.
l.
More information about the libre-riscv-dev
mailing list