[libre-riscv-dev] Divider pipeline structure

Sun Feb 3 11:21:25 GMT 2019

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Sun, Feb 3, 2019 at 10:11 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> >  we can't share betweeen cores: resource contention == spectre.  *sigh*.
> >
> My proposed scheduling algorithm trivially resolves those problems, since
> you can think of the pipeline as 4 separate pipelines, one assigned to each
> core, each being able to issue an operation every 4 clock cycles.

 that works... it does mean that some [otherwise unnecessary] latency
is introduced (average of 2 cycles).

> Note that this requires each core to only use it's own slot, so we can't
> get smart about the scheduling and try to steal another core's slot if it
> doesn't use it.

 agreed / understood.

> I think we should share the divider because it's not used enough that we
> need 6.4 GFlops of f32 divide performance.

 :) awww

> Also, the divider is rather big (32 64-bit subtractors for 32/64-bit
> div/rem only, probably 50% bigger when we add in everything else).

 eek.

> Alternatively, we could have a divider per-core and implement a 4 or
> 8-stage pipeline that uses the same
> feed-instructions-through-multiple-times scheme for wider operations.

 yeah that makes sense.  i wonder if it's possible to check if the
remainder is zero and terminate early (i.e. not need multiple
feed-throughs)

> >  basically, FU to ALU-function (ADD, MUL etc.) is not a one-to-one
> > relationship, it's a many-to-many relationship with duplicated ADDs,
> > duplicated MULs, and in this way (a) you don't get resource
> > bottlenecks and (b) the amount of data routing (which is extremely
> > costly) is reduced.
> >
> I was envisioning having a (possibly shared) div/rem/sqrt/rsqrt 64-bit wide
> ALU

 the case for that makes sense.

> and an everything-else 128-bit wide ALU that has 2 outputs for slow
> (like mul) and fast (like xor) operations (ignoring load/store, CSR, and
> other misc instructions).

 lost me :)  can you draw it (or ASCII-art it)?

> I think it would work better with 1 output and 2 inputs, the slow and fast
> inputs. Basically, a fast instruction can start unless the slow instruction
> needs the last ALU stage. this prevents stalling older instructions, since
> they are already partially executed.
>
> The 128-bit ALU would have 4 separate micro-op inputs, one for each 32-bit
> portion.