[libre-riscv-dev] Divider pipeline structure

Sun Feb 3 10:11:31 GMT 2019

On Fri, Feb 1, 2019 at 7:20 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
> On Sat, Feb 2, 2019 at 1:19 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > I propose having a radix-4 pipelined div/rem/sqrt/rsqrt unit with 16
> stages
> > (plus a few for fp rounding and misc stuff) that is 64 bits wide and can
> be
> > partitioned into 2x32, 4x16, and 8x8
>
>  like it.  the partitioning is going to need to be a general
> requirement for all ALUs (the 8/16 ALUs are for non-radix-aligned
> "finishing of vectors").
>
> > with the plan being that it can be
> > shared between 2 or 4 cores
>
>  we can't share betweeen cores: resource contention == spectre.  *sigh*.
>
My proposed scheduling algorithm trivially resolves those problems, since
you can think of the pipeline as 4 separate pipelines, one assigned to each
core, each being able to issue an operation every 4 clock cycles.

| Cycle  | 0      | 1      | 2      | 3      | 4      | 5      | 6      | 7
    | 8      | ... |
|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-----|
| Pipe 0 | Core 0 | Empty  | Empty  | Empty  | Core 0 | Empty  | Empty  |
Empty  | Core 0 | ... |
| Pipe 1 | Empty  | Core 1 | Empty  | Empty  | Empty  | Core 1 | Empty  |
Empty  | Empty  | ... |
| Pipe 2 | Empty  | Empty  | Core 2 | Empty  | Empty  | Empty  | Core 2 |
Empty  | Empty  | ... |
| Pipe 3 | Empty  | Empty  | Empty  | Core 3 | Empty  | Empty  | Empty  |
Core 3 | Empty  | ... |

We're just saving HW, but have the exact same scheduling algorithm:

| Cycle | 0      | 1      | 2      | 3      | 4      | 5      | 6      | 7
    | 8      | ... |
|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-----|
| Pipe  | Core 0 | Core 1 | Core 2 | Core 3 | Core 0 | Core 1 | Core 2 | Core
3 | Core 0 | ... |

This works as long as we have a simple pipeline, where each operation
progresses to the end, not affecting previous or later stages.

Note that this requires each core to only use it's own slot, so we can't
get smart about the scheduling and try to steal another core's slot if it
doesn't use it.

For operations that need to go through the pipeline more than once, they
can steal their own core's slot when they get to the end of the pipeline.

>
> > and would support 64-bit operations by passing
> > through the pipeline twice?
>
>  funny, that's how MIPS advocated a special variant of DIV that would
> do 12-bit accuracy (good enough for 3D), and could be "finished off"
> with a 2nd instruction to better accuracy.
>
It would still be only one instruction, it would just go through the
pipeline twice.

>
> > It would implement:
> > - div/rem for i8, i16, i32, and i64
> > - fdiv/fsqrt/frsqrt for f16, f32, and f64
> > - maybe fmod/frem for f16, f32, and f64
> >
> > If needed, we could have sqrt/rsqrt be radix-2 and take 2 trips through
> the
> > pipeline for fp32, 1 for fp16, and 4 for fp64.
> >
> > If shared between 4 cores, it would still have a 32-bit throughput of 1/2
> > operation per clock per core, which is sufficient.
>
>  given that routing tends to be more costly (gates and power-wise)
> than the ALUs that data is routed to, and given that spectre is based
> around resource contention, i'm inclined towards not having shared
> ALUs.
>
I think we should share the divider because it's not used enough that we
need 6.4 GFlops of f32 divide performance.
Also, the divider is rather big (32 64-bit subtractors for 32/64-bit
div/rem only, probably 50% bigger when we add in everything else).

Alternatively, we could have a divider per-core and implement a 4 or
8-stage pipeline that uses the same
feed-instructions-through-multiple-times scheme for wider operations.
This is actually starting to seem like the best option, because it will
have lower latency for vector division operations.

This would give us (assuming 4 cores, 800MHz clock, and 8 stage 64-bit wide
radix-4 pipelines) 12.8GFlops of f16 fdiv, 3.2GFlops of f32 fdiv, 0.8GFlops
of f64 fdiv, 51.2Giops of i8 combined div/rem, 25.6Giops of i16 combined
div/rem, 6.4Giops of i32 combined div/rem, and 1.6Giops of i64 combined
div/rem.
For a 4-stage pipeline, all numbers but i8 (since 4 stages gives 8 bits)
would be half the rate.
We shouldn't just switch to a bunch of serial dividers because that just
recreates the routing problem.

>
>  mitch alsup pointed out that the processors he designed, he always
> made sure that every ALU had an adder, for example, and thus every FU
> would have an adder, such that it did not matter which FU the data
> went to (as far as ADD is concerned).  page 15-18 of the 2nd chapter
> of his book covers the analysis of the percentage of instructions.
>
>  basically, FU to ALU-function (ADD, MUL etc.) is not a one-to-one
> relationship, it's a many-to-many relationship with duplicated ADDs,
> duplicated MULs, and in this way (a) you don't get resource
> bottlenecks and (b) the amount of data routing (which is extremely
> costly) is reduced.
>
I was envisioning having a (possibly shared) div/rem/sqrt/rsqrt 64-bit wide
ALU and an everything-else 128-bit wide ALU that has 2 outputs for slow
(like mul) and fast (like xor) operations (ignoring load/store, CSR, and
other misc instructions).
I think it would work better with 1 output and 2 inputs, the slow and fast
inputs. Basically, a fast instruction can start unless the slow instruction
needs the last ALU stage. this prevents stalling older instructions, since
they are already partially executed.

The 128-bit ALU would have 4 separate micro-op inputs, one for each 32-bit
portion.

Jacob Lifshay