[libre-riscv-dev] crazy idea using tomasulo algorithm

Tue Dec 4 00:14:54 GMT 2018

[transferring discussion to list]

On Mon, Dec 3, 2018 at 6:16 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> For GPU workloads FP64 is not common so I think having 1 FP64 alu would be sufficient.

 yes agreed.  what i like about tomasulo is, it's possible to drop as
many (independent types of and number of ) ALU units onto the bus as
is needed.

> Since indexed loads and stores are not supported, it will be important to support 4x64
> integer operations to generate addresses for loads/stores.

 sounds reasonable.

> I was thinking we would use scoreboarding to keep track of operations and
> dependencies since it doesn't need a cam per alu.

 http://users.utcluj.ro/~sebestyen/_Word_docs/Cursuri/SSC_course_5_Scoreboard_ex.pdf

 it looks like scoreboarding will stall the pipeline as a way to keep
things in order.  that doesn't give me a warm fuzzy feeling :)   in
particular, if you look at the example towards the end of those
slides, the last operation, which is a div, takes a whopping *forty*
cycles to complete.

 now, if that's accurate, it would mean basically stopping the entire
core dead in its tracks, until that *one* ALU had completed.

 it also means that we still have to find ways to solve the other
issues (which tomasulo takes care of).  i'll write up a secttion in
the microarchitecture doc about this, basically, i think if we extend
the ROB tag with an 8-bit bitfield (one bit per byte) and use it as a
"predicate-style mask", we can drop SIMD-style ALU units onto the data
bus for dealing even with 8-bit operations.

> We should be able to design it to forward past the register file to allow for 0-latency forwarding.
> If we combined that with register renaming it should prevent most war and waw data hazards.

 what i particularly like about the tomasulo algorithm is: it takes
care of *all* WAR and WAW data hazards.  they're completely gone.

> I think branch prediction will be essential if only to fetch and decode operations since it will
> reduce the branch penalty substantially.

 ok. it makes me slightly nervous, however it's been done for decades,
is well-understood, and, importantly, the logic that's used for
exceptions (which has to be there) which simply (in the case of
tomasulo) just erases the reorder buffer and reservation stations, is
quite straightforward.

> This really should be on the mailing list or added to the design docs.

 agreed.  noted here https://libre-riscv.org/3d_gpu/microarchitecture/

l.