[libre-riscv-dev] Why The Dual ISA

Sun Jan 19 13:55:14 GMT 2020

On Sunday, January 19, 2020, Jacob Lifshay <programmerjake at gmail.com> wrote:

> On Sun, Jan 19, 2020, 01:08 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > the "state" would also include the *value* of predication bits, rather
> than
> > the *register* in which that predication is stored.
> >
>
> I think we should also support running a side-effect free instruction (no
> stores, no loads not in L1) before the predication bits are known, then
> skip writing unneeded results once the predicate bits are known, this will
> allow us to increase performance by utilizing ALUs that otherwise would run
> a NOP, reducing latency.

i apologise, it's been a year now already and i git sidetracked into the
IEEE754FPU for several months.  below is from memory.

what i had implemented, or planned, was something that involves "shadowing"
that is not similar to branch prediction shadowing, it is identical to
branch prediction shadowing.

the Function Unit on which the (unknown predicated) instruction is to be
executed is "reserved".  the instruction can in fact be executed
immediately, regardless of the predicate bit being known or not (just like
branch prediction).

specifically: the reservation of the FU *absolutely must* occur, whilst the
actual execution does not have to.

once the register lookup for the predicate is performed, then any FUs which
have "zeros" are given the "Go Die" signal.  this firstly cancels any
execution in the pipeline, and secondly frees up the Function Unit
*managing* the *result* from that pipeline.

anything with "ones" the shadow is released and the execution permitted to
proceed as any other normal (nonpredicated) execution does.

so there isn't exactly a NOP (empty slot) as there is speculative execution
which could be wasted resources, could be not.

the reason i say "maybe" is because there ideally needs to be a larger
number of Function Units than there are stages in the pipeline.

to explain that, 6600 DM managed pipelines work as follows:

* a DM exists with registers on the row and Function Units (FUs) on the
columns.
* an instruction "reserves" an FU, with the read and writes it depends on
marked as "bits".
* the FUs are *not* the same as pipelines.  they are just the FRONT (and
back) of a pipeline "data sync management"

so if you have a 4 stage pipeline, you need at least 4 FUs, one to manage
each "timeslot" in that pipeline.

now, if we don't want to waste resources (no NOPs or cancelled execution)
we need *more* than 4 FUs, so that the cancellation can occur *even before
the backlog of operations waiting to go into the pipeline occurs*.

in other words by having 4 always-waiting FUs you have 4 cycles in which
you can look up the predicate, cancel the ones that have zero bits, and yet
the pipeline (which can only handle 4 simultaneous outstanding FUs because
it is 4 timeslots long) will remain 100% occupied.

ta-daaaa

if you have mitchs book chapters, jacob, this is what mitch calls
"concurrent pipelines" except he allocates 4 FUs for a 4 stage pipeline,
under which such circumstances, the above speculation is *guaranteed* to
end up with empty pipeline slots.

only when more FUs exist than there are pipeline stages is it possible to
run 100%.

whewww

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68