[libre-riscv-dev] IEEE754 FPU turning into ALU with Reservation Stations
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Sat Mar 9 03:57:01 GMT 2019
Ok I managed to create a subdivision between the code dealing with the
inputs and the output, with a separate internal module for the add.
The internal add module is its own completely separate FSM, with classes
for individual stages that are chained together in a pipeline. However
where John Dawson's original code had a stage (state) for operand A and
another for B, requiring 2 clocks to proceed, the internal add module
requires A and B to be available simultaneously, only requiring 1 clock.
The "external" add is therefore nothing more than an A, B and Z handler. In
essence, it is a Reservation Station, handling the communication with the
"internal" Function Unit.
The next stage of development is to make it possible to feed an array of
MULTIPLE AB inputs through the one Function Unit, to be output to an array
of MULTIPLE Z storage latches.
I have already added a multiplex ID which is passed through unmodified,
through every single stage as the inputs pass through and are turned into a
result. This ID is basically the index of the array of RSs, which, once the
output is available, allows the result to be placed in the correct Z
storage array.
The interesting thing is, if the pipeline length is dynamically variable,
it does not matter, as the relevant indexed Z storage latch will be
reserved no matter what the amount of time the pipeline takes.
This is particularly relevant for the divider because it will remain a
State Machine, but more than that even the Add has special cases which jump
out early from the pipeline. The only thing being, here, the Z latches
will effectively need to be dual ported, because 2 results can be produced
simultaneously on a clock: one from the specialcases path, and one from the
ADD path that happened to have been initiated a few clocks earlier.
The phase after that will be to turn the internal adder into a pipeline,
rather than a FSM. At the moment, even though the stages of the internal
adder module are connected together in a pipeline chain, and are fully
capable of handling simultaneous data, the FSM arrangement *only* activates
one of the stages at a time.
That needs to change, and the previous work to make each stage do something
in only a single cycle was a necessary prerequisite.
Also, the number of stages needs to be reduced. I deliberately made each
module a combinatorial block, where its *use* is done with sync. The
separation of stages into combinatorial blocks will allow several of them
to be chained together behind a single sync.
For example, there are at present separate rounding, correction and packing
combinatorial blocks: these are simple enough that they can be chained into
just the one pipeline phase. In this way it will be possible to get the
total number of pipeline stages down to only 4 (for both the ADD and the
MUL).
There is one other niggle, and it's how STB and ACK work. The passing of
data between phases results in quite a delay, introducing a clock cycle
delay due to the STB needing an ACK before its data is passed on. I don't
know how to deal with that. It does not however affect functionality, just
the performance, so can be deferred until later, investigating the ZipCPU
for insights.
It is still a heck of a lot to do however by comparison integer operations
are trivial, involving single stage Function Units. Turning the above into
an ALU will be a simple matter of passing in the operand in, next to the ID.
I would like to make it possible to create arbitrary ALUs by having
specifications that allow each instance to select the number and type of
operators to put into the ALU, not just the size of the Reservation Station
Arrays. We don't yet know precisely what number of operations will be
needed, so flexibility is just... sensible.
For example the effect of interaction between the number of RS's and the
number of operations within each ALU is unknown at the moment. We know
however that the total number of RS's needs to be kept down otherwise the
size of the FU-FU dependency matrix could get seriously out of hand, but if
not made large enough, would create a bottleneck and also leave the ALUs
significantly underutilised.
It is... complex! So flexibility to be able to dynamically adjust and
investigate will be key.
Two other tasks remain: making it possible to share 2x 32 bit FMUL pipeline
stages to create a 64 bit FMUL, and adding in a 3rd operand to turn FMUL
into a FMAC. Oh, and create a sqrt and inv-sqrt.
Quite a lot to get done.
L.
--
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
More information about the libre-riscv-dev
mailing list