[libre-riscv-dev] Divider pipeline structure
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Sun Feb 3 11:21:25 GMT 2019
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
On Sun, Feb 3, 2019 at 10:11 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> > we can't share betweeen cores: resource contention == spectre. *sigh*.
> >
> My proposed scheduling algorithm trivially resolves those problems, since
> you can think of the pipeline as 4 separate pipelines, one assigned to each
> core, each being able to issue an operation every 4 clock cycles.
that works... it does mean that some [otherwise unnecessary] latency
is introduced (average of 2 cycles).
> Note that this requires each core to only use it's own slot, so we can't
> get smart about the scheduling and try to steal another core's slot if it
> doesn't use it.
agreed / understood.
> I think we should share the divider because it's not used enough that we
> need 6.4 GFlops of f32 divide performance.
:) awww
> Also, the divider is rather big (32 64-bit subtractors for 32/64-bit
> div/rem only, probably 50% bigger when we add in everything else).
eek.
> Alternatively, we could have a divider per-core and implement a 4 or
> 8-stage pipeline that uses the same
> feed-instructions-through-multiple-times scheme for wider operations.
yeah that makes sense. i wonder if it's possible to check if the
remainder is zero and terminate early (i.e. not need multiple
feed-throughs)
> > basically, FU to ALU-function (ADD, MUL etc.) is not a one-to-one
> > relationship, it's a many-to-many relationship with duplicated ADDs,
> > duplicated MULs, and in this way (a) you don't get resource
> > bottlenecks and (b) the amount of data routing (which is extremely
> > costly) is reduced.
> >
> I was envisioning having a (possibly shared) div/rem/sqrt/rsqrt 64-bit wide
> ALU
the case for that makes sense.
> and an everything-else 128-bit wide ALU that has 2 outputs for slow
> (like mul) and fast (like xor) operations (ignoring load/store, CSR, and
> other misc instructions).
lost me :) can you draw it (or ASCII-art it)?
> I think it would work better with 1 output and 2 inputs, the slow and fast
> inputs. Basically, a fast instruction can start unless the slow instruction
> needs the last ALU stage. this prevents stalling older instructions, since
> they are already partially executed.
>
> The 128-bit ALU would have 4 separate micro-op inputs, one for each 32-bit
> portion.
More information about the libre-riscv-dev
mailing list