[libre-riscv-dev] buffered pipeline

Jacob Lifshay programmerjake at gmail.com
Wed Mar 13 08:27:02 GMT 2019


note that in my pipeline stage design, succ_accepting to pred_accepting
doesn't go through a flip-flop so it isn't delayed a clock cycle, meaning
that a stage can block all predecessor stages in a single clock cycle,
eliminating the need to have extra stage registers.

I didn't include the table in the email, but I did check all combinations
of succ_accepting, pred_sending, and data_valid and it works just fine.

I'm assuming our pipelines aren't going to be shorter so that we won't need
to start worrying about the fan-in on the gates in the *_accepting path.

Jacob

On Tue, Mar 12, 2019, 19:38 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Tue, Mar 12, 2019 at 3:11 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > the strategy I'm planning on using for the simple barrel processor is
> just
> > to have the pipeline never stop, if we encounter a reason an instruction
> > can't proceed in the current cycle, it is shunted into a delay pipeline
> to
> > be retried the next time around.
>
>  dan's post contains some other strategies that may help here.  i will
> be implementing the IEEE754 FPU pipeline as a non-stoppable design
> (potentially adding detection to see if anything is in any stage, and
> stopping the whole pipe if it isn't), with a variation of the
> single-stage buffered pipe to take *multiple* inputs (multiple strobe
> lines) and multiplex a given input group to the output (along with its
> multiplexer ID).
>
>  dan, this is probably extremely similar to wishbone or AXI N-to-1 bus
> arbitration.
>
>  that's what this is about:
>
> https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/add/nmigen_add_experiment.py;h=f53037d1a88c912566cd13fd32db1945346a1751;hb=HEAD#l81
>
>  except... due to using john dawson's STB/ACK strategy, it can only
> handle one incoming set of operands every 2 clock cycles.
>
>  my point is, jacob: to handle the delay-shunting you'll almost
> certainly need to deploy the exact same strategy (and hence could use
> exactly the code that i am writing).
>
>  the requirements of a barrel processor (with a delay phase) are:
>
>  * to have a round-robin test of whether an instruction shall be
> passed into the pipeline
>  * to have no delays except if an instruction cannot proceed
>  * if an instruction cannot proceed, it must not be lost (buffered)
>  * all other instructions must continue unaffected
>  * on detection of no longer being busy, the buffered instruction must
> rejoin the round-robin scheduling
>  * it must be possible for MULTIPLE instructions to be busy (and buffered).
>
> so you need an *array* of instruction store/delay buffers, an *array*
> of STB and BUSY lines to look after them, where unstalled instructions
> are to be multiplexed to a single output of data, STB, and BUSY.
>
> that's *exactly* what i am working on, right now.
>
> the code that i'm writing specifically meets these very precise
> requirements, with the exception that i am using a priority encoder
> instead of a round-robin selection strategy.
>
>
> > For stallable pipelines, I think we should name the pipeline control
> > signals pred_sending, succ_sending, pred_accepting and succ_accepting.
>
>  funnily enough i added prefix letters as the first thing when writing
> the first unit test, i named them i_p_stb, o_n_stb, o_p_busy and
> i_n_busy, and wrote this ascii art which is now in the docstring:
>
>         stage-1   i_p_stb  >>in   stage   o_n_stb  out>>   stage+1
>         stage-1   o_p_busy <<out  stage   i_n_busy <<in    stage+1
>         stage-1   i_data   >>in   stage   o_data   out>>   stage+1
>                               |             |
>                               +------->  process
>                               |             |
>                               +-- r_data ---+
>
>  the shortened names need a seconds' thought, however i believe
> they're clear, and, crucially, do not result in line-wrap to use them.
> also, "STB" for "Strobe" is a standard hardware convention
> synchronously indicating "data ready right now".
>
> > A simple example stage:
> >
> > module stage(clk, rst, pred_sending, pred_accepting, pred_data,
> > succ_sending, succ_accepting, succ_data);
> >     input clk;
> >     input rst;
> >     input pred_sending;
> >     output pred_accepting;
> >     input [63:0] pred_data;
> >     output succ_sending;
> >     input succ_accepting;
> >     output [63:0] succ_data;
> >
> >     reg data_valid;
> >     reg [63:0] data;
> >     wire next_data_valid;
> >
> >     assign succ_sending = data_valid;
> >     assign pred_accepting = ~data_valid | succ_accepting;
> >     assign next_data_valid = pred_sending | (~succ_accepting &
> data_valid);
> >
> >     assign succ_data = data + 1; // stage operation
> >
> >     initial data_valid = 0;
> >     initial data = 0;
> >
> >     always @(posedge clk or posedge rst) begin
> >         if(rst) begin
> >             data_valid <= 0;
> >             data <= 0;
> >         end
> >         else begin
> >             data_valid <= next_data_valid;
> >             data <= pred_data;
> >         end
> >     end
> > endmodule
>
>  from what i understand, data will be lost, here, under certain
> conditions. or, it will be sub-optimal (result in unnecessary delays).
> i'm not skilled enough in logic analysis to identify which.
>
>  dan's original post makes it clear that there are 4 cases involved
> (it's not quite as straightforward as it first appears).  there's a
> situation where the input has valid data (and the next stage is busy
> so a stall must happen), yet because this is a;; based on clocks,
> there's not yet been an opportunity to *tell* the input "please stop
> sending".
>
>  so due to that one-clock delay where you are *going* to tell the
> input "please stop sending", you absolutely must buffer the input
> data, otherwise it's irrevocably lost.  at the same time, you tell the
> input that on the next clock, "please stop sending".
>
>  now, when the next stage is no longer busy, the processing must
> "flip" to process the *stored* data, *not* the incoming data.  the
> stage's attention is therefore effectively multiplexed between the
> input and the buffer.
>
>  in other words it's quite a complex state machine, for such a
> seemingly-innocuously-simple set of requirements.
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>


More information about the libre-riscv-dev mailing list