[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Luke Kenneth Casson Leighton lkcl at lkcl.net
Fri Feb 1 09:16:46 GMT 2019


---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Jan 31, 2019 at 8:44 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Thu, Jan 31, 2019, 07:53 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:
>
> > On Thu, Jan 31, 2019 at 8:09 AM Jacob Lifshay <programmerjake at gmail.com>
> > wrote:
> > > we could also use some method to encode sign/zero extension for scalar
> > > vl-mul=1 results (have 2 vl-mul=1 encodings in vlp?).
> >
> >  unlike elwidth, zero/sign extending bit comes straight from the
> > opcode, in all cases.  surprisingly, if ".W" it's sign-extend (i had
> > to do a full audit for SV-orig).
> >
> I don't think that will work for the same reason that we can't use OP/OP-32
> as a bit of elwidth:
> if you want a sign-extending scalar xor.b:

 ... ah, the point being: there isn't a scalar sign-extended xor in
RISC-V, therefore there would not be a vectorised one either.  thus,
it would require 2 ops: an xor.b followed by a sign-extended MV.


> Ok. I still can't think of any reason to sign-extend since we would be
> almost always accessing it as a vector of vl-mul elements so
> sign/zero/nan-boxing shouldn't matter.

 that's good, because it's a pain :)

> not implementing sign extension from
> every byte still saves both some gates and instruction encoding space
> (combining with vlp).

 ack.

> We may still need a lot of the gates for sign-extending conversions from
> vectors of i8 to i16/i32/i64 vectors though.

 *sigh* basically, the i8 has to go into the ALUs of width of the
destination.  it's unavoidable.


> > > For vector rd, I agree that elements past VL should be left unchanged. It
> > > will make it more difficult for register-renaming/tomasulo
> > implementations
> > > though since they will need to read from rd for unpredicated cases (they
> > > need to read from rd for predicated cases anyway).
> >
> >  it was complicated as hell, however i managed to create a workable
> > register scheme that did not involve overwriting (or even reading) of
> > the end of a register file entry.
> >
> Ok.
>
> >
> >  it requires byte-level write-enable lines, dividing the register file
> > into 32-bit banks, having pairs of those banks shared across 2
> > 32-bit-wide Function Units, and a separate set of Function Units with
> > 8/16-bit ALUs.
> >
> I think we should probably just use SIMD-like micro-ops for elwidth < 32
> since we can still pass a separate byte-write-mask. That way we can save on
> FU count, saving both area and power. The extra energy required by the ALUs
> should be negligible compared to the extra area, complexity, and power used
> by having dedicated 8/16-bit FUs. the 32-bit FUs will just pass the elwidth
> as another input to the ALU. If we have each 32-bit alu write to a separate
> register bank, like originally planned,

 ... which hasn't changed...

> we won't be able to repack 8/16-bit
> operations in almost all cases anyway since the active lanes would be the
> lsb lanes. we would have to match an op with the lsb lanes predicated off
> (uncommon) to be able to repack.

 ok so the way i envisage it is that the routing occurs on the src
side, and there are 4 FUs, one for each byte.

> we will still need 64-bit FUs unless you think implementing matched pairs
> of 32-bit FUs is reasonable.

 they're a necessity, otherwise tt gets to be absolute hell.

 bear in mind that FUs are *NOT* the same thing as ALUs.  multiple FUs
share a pipelined ALU (in both the 32 bit FU side as well as the 8-bit
FU side).

 if you have a look at mitch alsup's book chapters, see the diagram on
p38, section 11.4.9.3.  it describes how 4 FUs each with completely
separate src operand and dest result latches share the same pipelined
ALU.

 for 64-bit ops, that would be *pairs* of 32-bit-wide src operands
farming through the same 64-bit ALU out to *pairs* of 32-bit-wide dest
result latches.

 for 8/16-bit ops i expect that pipelines would be completely
unnecessary as it is highly likely for results to be calculable in a
single cycle.  thus, the 8/16-bit ALUs can be duplicated rather than
waste gates on routing.

 in the same vein, the reason why the 8-bit src operands are routed to
the *byte* that they correspond to in the dest reg is so that, again,
gates do not need to be wasted on post-result routing.

 once calculated the byte (or pair of bytes) may be *directly* thrown
at the register file, along with the corresponding write-enable byte
signal, to write *directly* into the correct byte(s) of the register.

 likewise, the same trick may be applied to 32-bit ops (and paired
32-bits for 64-bit ops).  with 4 32-bit banks, this requires that the
src operands be *pre-routed* to the *correct* bank.

 the implications for when scalar-only 64-bit operations are thrown at
the engine are a bit... strange.  if we have 2R1W on all 4 banks, the
maximum number of instructions per clock cycle that can be handled is
*TWO* not four, as long as the destination registers of the 2
instructions are odd and even.

 if the dest regs of the 2 instructions are both even (or both odd),
*both of them have to go to the same bank*, and consequently the
instructions per cycle reduces to *one*.

 however, when doing 32-bit vectorised ops, the elements hit
independent banks, and we're back up to a rate of 4 32-bit ops per
clock.

 yes, it's really odd :)

l.



More information about the libre-riscv-dev mailing list