[libre-riscv-dev] [Bug 187] SimpleV vector operation semantics: reading scalar inputs before writing any outputs

Tue Feb 25 01:21:17 GMT 2020

http://bugs.libre-riscv.org/show_bug.cgi?id=187

--- Comment #7 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #6)
> First, the reason to change the semantics:
> 
> It could increase performance a lot (at least several %) by reducing
> compiler-generated copies and reducing register pressure in hot loops (due
> to the compiler not needing to copy the scalar inputs out of the way first).

generally i have found that when suggestions like this come up, the suggestion
does not topologically change or alter the overall complexity of the State
Machine that encompasses the hardware-software combination.

what it does instead is: *move* some of the required resources to maintain that
FSM from hardware into software or vice versa.

the fact that the scalar inputs need cipying should tell you that this comokex
resource intensive task is being requested to be moved - *in full* - into
hardware.

and that is not a good idea.

not least, it is not a good idea in a key area that is already complex and we
are under time pressure to meet the Oct 2020 deadline.

> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > first, it breaks the design of SV, entirely.  SV is no longer described as
> > "a macro unrolling for loop at the hardware level"
> > 
> > it would have to be described as, "a hardware forloop except for LD which is
> > niw complicated because it reads the address of the first register and uses
> > it as a base address".
> 
> The reading scalar/subvector (henceforth just called scalar) inputs before
> writing any outputs would be applied to all instructions uniformly, not just
> LD.

if this had been raised even three months ago there would be plenty of time to
discuss such a drastic change to SimpleV, leaving me then with the months of
time needed to think through the implications properly and fully.

> You can consider this conceptually as if all scalar inputs are copied to the
> context-switching state CSRs as an additional operation which executes
> before the hardware for loop, it can be considered to be a new operation
> inserted into the issue queue before the vector element operations. Also,
> the vector element operations conceptually read all the scalar inputs from
> the context-switching state CSRs written by the new inserted operation
> instead of directly from the input registers.

that it is multiple CSRs, not just the one, has me even more concerned.

like i said: the overall state remains the same: pressure on regfile to save
state becomes pressure on CSRs to save state, with the detriment being that the
hardware is made far more complex.

> This can be implemented in HW without any additional delay by sending the
> value of the scalar input registers to the CSRs and to the element
> operations simultaneously, they don't have to wait for the CSR write to
> complete.

which is yet more hardware complexity.

> 
> When resuming from a context-switch with a partially-executed SimpleV
> instruction (vstart != 0), the copy to the CSR is omitted, and the element
> operations just read the scalar inputs from the CSRs.

which is even more complexity in the data routing paths.

when we are already under pressure to implement the first prototype.

> > this is the point at which i say "that is a bad idea", not least because in
> > the current design it is completely unnecessary, it also complicates the
> > design *and* increases context switch latency *and* increases the number of
> > CSRs.
> > 
> > so, to recap:
> > 
> > * SV as a conceot, the simplicity is destroyed.
> 
> It's much less additional conceptual complexity then the register renaming
> tables you had wanted to add.

that turns out to be a misunderstanding given that we were unaware that the
6600's FUs *automatically* provide register renaming.

if i had known that, i wouldn't have gone, "we need reg renaming!" :)

> > * context switch latency is increased
> 
> Yee, but it could increase performance a lot (at least several %) by
> reducing compiler-generated copies and reducing register pressure in hot
> loops. I'm sure that's an acceptable tradeoff for increasing context-switch
> state by 512-bits (worst-case estimate).
> 
> > * hardware complexity is increased
> 
> not by much, a mux to allow substituting the CSR values at the ALU (and
> other ops) inputs and the logic to write to the CSRs.

too much.

as in: it's just too much for me to integrate into the design at this very late
stage, doing what is effectively a full redesign and review of something that
took me nearly eight months to understand and, from the ISA perspective, took
us 18 months to discuss and review.

if we try to implement this, right now, it will set us back about five months.

no really that is not an underestimate.

it would be about 2 weeks of discussion, minimum, for us, doing a full review.

i will need about a month to think about it, to fully understand in my head.

i will also need to write out proper diagrams (floor plans) and study them. twi
weeks minimum.

it will take about four to five weeks just to go through the spec, reviewing
each of the types of instructions to add the new proposed stage to.  probably
longer.

then the simulator would need updating (3 weeks).  unit tests written (3
weeks).

then after all that, which is about four months, i will have a clear idea of
how it works, and all the implications.

*then* i will be able to spend time on the hardware, which will be about a
month.

so - five months - because it is such a massive conceptual core redesign,
because it adds an entirely new layer to SV (the capture and storage of scalars
in CSRs).

none of the above steps look unreasonably long.

bittom line, we simply have not got the time at this late phase to do such a
massive intrusive redesign.

-- 
You are receiving this mail because:
You are on the CC list for the bug.