[libre-riscv-dev] crazy idea using tomasulo algorithm
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Tue Dec 4 00:14:54 GMT 2018
[transferring discussion to list]
On Mon, Dec 3, 2018 at 6:16 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
> For GPU workloads FP64 is not common so I think having 1 FP64 alu would be sufficient.
yes agreed. what i like about tomasulo is, it's possible to drop as
many (independent types of and number of ) ALU units onto the bus as
is needed.
> Since indexed loads and stores are not supported, it will be important to support 4x64
> integer operations to generate addresses for loads/stores.
sounds reasonable.
> I was thinking we would use scoreboarding to keep track of operations and
> dependencies since it doesn't need a cam per alu.
http://users.utcluj.ro/~sebestyen/_Word_docs/Cursuri/SSC_course_5_Scoreboard_ex.pdf
it looks like scoreboarding will stall the pipeline as a way to keep
things in order. that doesn't give me a warm fuzzy feeling :) in
particular, if you look at the example towards the end of those
slides, the last operation, which is a div, takes a whopping *forty*
cycles to complete.
now, if that's accurate, it would mean basically stopping the entire
core dead in its tracks, until that *one* ALU had completed.
it also means that we still have to find ways to solve the other
issues (which tomasulo takes care of). i'll write up a secttion in
the microarchitecture doc about this, basically, i think if we extend
the ROB tag with an 8-bit bitfield (one bit per byte) and use it as a
"predicate-style mask", we can drop SIMD-style ALU units onto the data
bus for dealing even with 8-bit operations.
> We should be able to design it to forward past the register file to allow for 0-latency forwarding.
> If we combined that with register renaming it should prevent most war and waw data hazards.
what i particularly like about the tomasulo algorithm is: it takes
care of *all* WAR and WAW data hazards. they're completely gone.
> I think branch prediction will be essential if only to fetch and decode operations since it will
> reduce the branch penalty substantially.
ok. it makes me slightly nervous, however it's been done for decades,
is well-understood, and, importantly, the logic that's used for
exceptions (which has to be there) which simply (in the case of
tomasulo) just erases the reorder buffer and reservation stations, is
quite straightforward.
> This really should be on the mailing list or added to the design docs.
agreed. noted here https://libre-riscv.org/3d_gpu/microarchitecture/
l.
More information about the libre-riscv-dev
mailing list