[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Jacob Lifshay programmerjake at gmail.com
Fri Feb 1 11:01:22 GMT 2019


On Fri, Feb 1, 2019 at 1:17 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
> On Thu, Jan 31, 2019 at 8:44 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > On Thu, Jan 31, 2019, 07:53 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> > wrote:
> >
> > > On Thu, Jan 31, 2019 at 8:09 AM Jacob Lifshay <
> programmerjake at gmail.com>
> > > wrote:
> > > > we could also use some method to encode sign/zero extension for
> scalar
> > > > vl-mul=1 results (have 2 vl-mul=1 encodings in vlp?).
> > >
> > >  unlike elwidth, zero/sign extending bit comes straight from the
> > > opcode, in all cases.  surprisingly, if ".W" it's sign-extend (i had
> > > to do a full audit for SV-orig).
> > >
> > I don't think that will work for the same reason that we can't use
> OP/OP-32
> > as a bit of elwidth:
> > if you want a sign-extending scalar xor.b:
>
>  ... ah, the point being: there isn't a scalar sign-extended xor in
> RISC-V, therefore there would not be a vectorised one either.  thus,
> it would require 2 ops: an xor.b followed by a sign-extended MV.
>
>
> > Ok. I still can't think of any reason to sign-extend since we would be
> > almost always accessing it as a vector of vl-mul elements so
> > sign/zero/nan-boxing shouldn't matter.
>
>  that's good, because it's a pain :)
>
> > not implementing sign extension from
> > every byte still saves both some gates and instruction encoding space
> > (combining with vlp).
>
>  ack.
>
> > We may still need a lot of the gates for sign-extending conversions from
> > vectors of i8 to i16/i32/i64 vectors though.
>
>  *sigh* basically, the i8 has to go into the ALUs of width of the
> destination.  it's unavoidable.
>
>
> > > > For vector rd, I agree that elements past VL should be left
> unchanged. It
> > > > will make it more difficult for register-renaming/tomasulo
> > > implementations
> > > > though since they will need to read from rd for unpredicated cases
> (they
> > > > need to read from rd for predicated cases anyway).
> > >
> > >  it was complicated as hell, however i managed to create a workable
> > > register scheme that did not involve overwriting (or even reading) of
> > > the end of a register file entry.
> > >
> > Ok.
> >
> > >
> > >  it requires byte-level write-enable lines, dividing the register file
> > > into 32-bit banks, having pairs of those banks shared across 2
> > > 32-bit-wide Function Units, and a separate set of Function Units with
> > > 8/16-bit ALUs.
> > >
> > I think we should probably just use SIMD-like micro-ops for elwidth < 32
> > since we can still pass a separate byte-write-mask. That way we can save
> on
> > FU count, saving both area and power. The extra energy required by the
> ALUs
> > should be negligible compared to the extra area, complexity, and power
> used
> > by having dedicated 8/16-bit FUs. the 32-bit FUs will just pass the
> elwidth
> > as another input to the ALU. If we have each 32-bit alu write to a
> separate
> > register bank, like originally planned,
>
>  ... which hasn't changed...
>
Yup.

>
> > we won't be able to repack 8/16-bit
> > operations in almost all cases anyway since the active lanes would be the
> > lsb lanes. we would have to match an op with the lsb lanes predicated off
> > (uncommon) to be able to repack.
>
>  ok so the way i envisage it is that the routing occurs on the src
> side, and there are 4 FUs, one for each byte.
>
I think we should avoid the extra complexity of having a separate FU per
byte, a 32-bit FU is good enough since we don't have routing individual
bytes on the dest side of the ALU means that we can only use the byte lanes
that we write to the destination anyway.

>
> > we will still need 64-bit FUs unless you think implementing matched pairs
> > of 32-bit FUs is reasonable.
>
>  they're a necessity, otherwise tt gets to be absolute hell.
>
 I was thinking that ensuring that the pairs of 32-bit FU's always issue
simultaneously would be a pain.

>
>  bear in mind that FUs are *NOT* the same thing as ALUs.  multiple FUs
> share a pipelined ALU (in both the 32 bit FU side as well as the 8-bit
> FU side).
>
Yup.

>
>  if you have a look at mitch alsup's book chapters, see the diagram on
> p38, section 11.4.9.3.  it describes how 4 FUs each with completely
> separate src operand and dest result latches share the same pipelined
> ALU.
>
>  for 64-bit ops, that would be *pairs* of 32-bit-wide src operands
> farming through the same 64-bit ALU out to *pairs* of 32-bit-wide dest
> result latches.
>
Operand packing doesn't work for 8/16-bit operations when writing directly
to the register file (like you need for predication, since you need the
values from the masked off lanes) since the ALUs don't have result routing.

For pairs of 32-bit FUs acting as a 64-bit FU, we need to ensure that they
always execute simultaneously for things such as carry-propagation even
when one half has nothing it's waiting on but the other half is waiting.
That is why I thought we might want separate 64-bit FUs.

>
>  for 8/16-bit ops i expect that pipelines would be completely
> unnecessary as it is highly likely for results to be calculable in a
> single cycle.

Not necessarily, we have fmul for f16, mul for i16/i8 and other slow ops
(div/mod/sqrt/etc.).

>   thus, the 8/16-bit ALUs can be duplicated rather than
> waste gates on routing.


>  in the same vein, the reason why the 8-bit src operands are routed to
> the *byte* that they correspond to in the dest reg is so that, again,
> gates do not need to be wasted on post-result routing.
>
>  once calculated the byte (or pair of bytes) may be *directly* thrown
> at the register file, along with the corresponding write-enable byte
> signal, to write *directly* into the correct byte(s) of the register.
>
Yup. This is why you can't easily pack multiple 4x8 or 2x16 ops into a
single ALU since almost all ops use all or only the lower lanes and we
don't have the post-ALU circuitry to route the upper lanes to the lower
lanes of a different register.

>
>  likewise, the same trick may be applied to 32-bit ops (and paired
> 32-bits for 64-bit ops).  with 4 32-bit banks, this requires that the
> src operands be *pre-routed* to the *correct* bank.
>
This is not the same as 8/16-bit: 32-bit and 64-bit ops can be issued
together since the register file can access different registers for each
32-bit slice.

>
>  the implications for when scalar-only 64-bit operations are thrown at
> the engine are a bit... strange.  if we have 2R1W on all 4 banks, the
> maximum number of instructions per clock cycle that can be handled is
> *TWO* not four, as long as the destination registers of the 2
> instructions are odd and even.
>
The engine is not optimized for scalar ops, so if it can execute 2 scalars
per cycle in a large proportion (50%?) of cases, I'd say we're doing well.
Since we have an out-of-order processor design and we can speculate past
branches, we will be able to run appreciably faster than in-order
processors like Rocket or ARM A5, meaning that the 800MHz clock speed
doesn't hobble us as much for non-vector programs.

>
>  if the dest regs of the 2 instructions are both even (or both odd),
> *both of them have to go to the same bank*, and consequently the
> instructions per cycle reduces to *one*.
>
Yeah. Kind of odd.

>
>  however, when doing 32-bit vectorised ops, the elements hit
> independent banks, and we're back up to a rate of 4 32-bit ops per
> clock.
>
>  yes, it's really odd :)
>
:)

Jacob


More information about the libre-riscv-dev mailing list