[libre-riscv-dev] buffered pipeline

Wed Mar 20 10:32:33 GMT 2019

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Wed, Mar 20, 2019 at 8:52 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Wed, Mar 20, 2019, 00:58 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > On Wed, Mar 20, 2019 at 7:34 AM Jacob Lifshay <programmerjake at gmail.com>
> > wrote:
> > >
> > > I'm in the midst of converting the pipeline stages I wrote to using
> > Record.
> >
> >  jacob, please: i already said that dependence on a specific class or
> > data structure is undesirable, and restricts options as well as
> > increases code size.  removal of SignalGroup and replacing it with
> > Record will not solve that particular problem.
> >
> >  i've also *already developed* an example that uses Record, and shown
> > that the pipeline infrastructure and the API associated with it can
> > support the use of Records.
> >
> I created some record utility functions:
>
> visit_records function that calls a callback once for each signal in a
> record that behaves like zip() does on iterators
>
> connect_records: creates a list of assignments, can be told to assign in to
> out or out to in.
>
> functions that manipulate and check the direction of all signals in a
> Layout.
>
> a function to create a reversed Layout

 sounds handy.  will likely be very useful in the pinmux, because it's
looking like Record is a good way to specify things like SPI, SD/MMC,
SDRAM buses and so on.


>
>
> Yeah, I had seen that. I may have gotten a little carried away, because I
> think the API of my redesigned stage classes is really nice:

 appreciated... please do bear in mind we're running the project along
the lines of unanimous consensus (which i'm required to abide by just
as much as anyone else).  and it's making me twitchy that i sometimes
don't get a reply from you.

 one company i worked for, they hired someone who was intended to
replace me.  i had already written a web-based GUI control (manually
generating a png from a simple set of real-time input data, to display
CPU usage).  took about 800 lines.  this was in a high-security
environment, so code review was critically important.

 my "replacement" was given the task of writing a "better" version of
that code.  in under six weeks we were shocked to find that he'd
written fourteen THOUSAND lines of code, doing the same job: display
CPU usage, graphically in a web page.  he'd brought in an entire web
framework (without consulting anyone), he'd written a full
Model-View-Controller compliant application...

...and yet there wasn't anything that his code ultimately did that was
*actually* functionally different from the incredibly simple 800 lines
i'd done in about one day.

it was not my responsibility, as i was just a contractor (he was an
employee) so he should have been under way *way* better supervision.
and it wasn't his fault either, plus, it was within the "grace
period".

so... i appreciate this is a learning experience for you, and that
you're using it for the important dual / secondary purpose of
exploring python (and doing extremely well, btw, discovering features
of python that's taken me 20 years to become familiar with).

however some of those features are so obscure that, as i have found in
the past, if you use them too liberally (or even in some cases *at
all*), the resultant code is quite literally impossible for anyone
else to understand.

so just... please bear that in mind ok!  even the code here which uses
a common feature of python (introspection: overloading of __getattr__
and __setattr__) can be particularly challenging to understand,
despite the example code being really clear:

https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/add/pipeline_example.py;h=544b745b0a5d7b710b7d9eea38397acab5f4799a;hb=dd44faf5abdf0205aa35c3a774ea60d8467e8e7e



> code for simple pipeline:
>
> m = Module()
> input_stage = InputStage() # user defined
> add_stage = CombStage([('src1', 64), ('src2', 64)],
>     [('data', 64)],
>     lambda pred, succ, m: # can also be member fn that reads from self.prev
> and writes to self.next
>         m.d.comb += succ.data.eq(pred.src1 + pred.src2))

 ok so CombStage takes 3 arguments:

 * input spec
 * output spec
 * a function (which can be a lambda function) which creates the result

 i believe it should be possible to support that by creating a class
that conforms to the ispec/ospec/process/setup API (aka "pattern" in
modern software engineering terminology).

 the one thing i'm not keen on is the explicit assignment to m.d.comb,
and the requirement to set succ.data.  we *know* it's got to go into
succ.data, and we *know* it's got to go into m.d.comb (for this style
of pipeline).  therefore, m is an unnecessary argument to pass in, as
is succ.  which means that the lambda may simply return "pred.src1 +
pred.src2'.

 if you look at the code that i wrote which is based directly on the
code that you  wrote, you'll find that this is what it does.  the
result-stage functions return an expression, and it is the *pipeline*
class's responsibility to carry out the assignment.

 in addition: what about a different type of pipeline class, that
conforms to the same API?  what about BufferedPipeline?  that does
*not* assign to succ.data, it assigns to an intermediary local
variable which is not even a member of succ.

 should we force users to write two near-identical stage classes that
do exactly the same job?  or worse, trap them in an API that, in order
to convert to a different pipeline class, requires that they rewrite
hundreds of lines of code?


> input_stage.succ.connect(add_stage.pred, m)
> reg_stage = RegStage(add_stage.succ.data.layout)
> reg_stage.pred.connect(add_stage.succ, m)
> output_stage = OutputStage() # user defined
> output_stage.pred.connect(reg_stage.succ, m)

 this looks reasonable / natural, as part of what a pipeline is.

 i'd really really like to work towards the simplicity of that
pipeline_example.py (even though it's a non-buffered pipeline), with
buffering being transparently hidden from the user via underlying
infrastructure.

> >  i now have *four* different examples all of which conform to the same
> > API

 five. i just added a class example, and fixed a bug as a result


and, to my eye, i believe are already extremely clear and simple,
> > and i've found that efforts to further simplify or clarify them have
> > disadvantages, and actually restrict the flexibility of the API.
> >
> >  please can you stop for a moment, and take a look at ExampleStage,
> > ExampleAddStage, ExampleAddRecordStage and LTStage, and let me know if
> > there's a way that the API which supports *all* of these examples may
> > be improved?
> >
> I think having the input and output signals be named is a good idea.

there's some definite use-cases for adding names: when interacting
with verilog, it's essential.  we can't have names of verilog modules
changing on us.  and, also, debugging (use of gtkwave) is a lot
clearer if there are meaningful names.

i added names to the (anonymous) signals in one of the examples, and
it worked well:

--- a/src/add/example_buf_pipe.py
+++ b/src/add/example_buf_pipe.py
@@ -322,10 +322,10 @@ class ExampleStage:
     """

     def ispec():
-        return Signal(16)
+        return Signal(16, name="example_input_signal")

     def ospec():
-        return Signal(16)
+        return Signal(16, name="example_output_signal")

it resulted in the attached graph: you can see example_input_signal in
the top left, and example_output_signal in the top far right.  beyond
that: prefixes / hints couuuld be added... i'm not hugely keen on the
idea though, as it's a complication of the API (ispec and ospec taking
a name hint argument, just to change the name of intermediary
variables).

> Also, support creating a pipeline stage from a lambda.

 lambdas are appropriate when there's no additional context.  in this
case, there is additional context: the input format spec and the
output format spec, without which the lambda has no meaning.

 if it was possible to *only* pass in a lambda function, i'd agree immediately.

 however, the very fact that the lambda function has to be passed in
to a class and assigned as a member of that class should tell you what
you need to know, namely that it is far, far clearer to *derive* from
CombStage.

 and, if it's clearer to *derive* from CombStage, then it's just as
easy to set up a "pattern" (i.e. you don't actually need to derive
from CombStage at all, as long as the resultant non-derived class has
the *appearance* of looking *like* a CombStage).

 this progressive chain of logical reasoning is why i came up with the
ospec / ispec / process API (aka "pattern").

> I think we should also have a separation of concerns:
> my combstage can be connected in a row and all of them have 0 clocks of
> latency because succ and pred's ready and valid signals are wired directly
> together.

 it's not that simple: take a look at the graphviz.  you'll see that
both succ o_valid and prev o_ready are both critically dependent on
_data_valid, which is a sync'd signal *not* a combinatorial one, and
_data_valid is itself dependent on next i_ready.

 therefore a chain of stages are not combinatorial at all: they're
clock-synchronised.

 now, the generation of the *results* of each stage are entirely
combinatorial, that *is* true.  so yes it is possible to connect
individual CombStages together to create one massive long
CombStage-out-of-CombStages.

 this is a trick that i did in the IEEE754 FPU code, after
successfully isolating the various stages by their inputs and outputs,
and turning them into combinatorial modules.  once that had been done
i was able to create a new combinatorial module which comprised a
chain of three other combinatorial modules: normalisation, special
cases, alignment and so on.

 so... i'm not sure what you may be referring to.

> my regstage does nothing but be a 1 clock latency register.
> my ReadyIsolateStage does nothing but isolate the ready line by inserting a
> register when ready is low, it has 0 cycles of delay when ready is high.
> each of them has a separate function:
> you don't need to supply a comb operation when you just want a regstage.
> you don't need a cycle of delay when you want a combinatorial function
> (combstage).

 i 've not seen the code so can't say.

> >
> >  bear in mind that both CombPipe and BufferedPipe may be used as
> > mix-ins with all of those four examples, to create *interoperable*
> > pipelines with different characteristics, as both CombPipe and
> > BufferedPipe conform to the *exact* same pipeline / control API.
>
> your combpipe is equivalent to my regstage followed by my combstage.

 the combpipe.  i don't use personal pronouns when referring to code
that's developed by a group.  it creates artificial barriers.

> your bufferedpipe

 the bufferedpipe.

> is equivalent to my regstage followed by my
> ReadyIsolateStage followed by my CombStage
>
> using my stage designs, if you have a pipeline already split into stages
> and you want to make it take less clock cycles, you just have to remove
> some regstage instances. with your design, you have to do some major
> refactoring, because the pipeline registers are built into your stage
> classes.

 funnily enough i've been thinking about exactly that, as i'm not
impressed with quite how much of a dog's dinner mess the glue logic is
in the IEEE754 FPU code.  each combinatorial "stage" module comes with
an associated helper-class that has to manage the setup /
inter-connect between the FPADD module and the "stage" module.

 i *believe* it may be possible to create a MetaStage class that can
be passed a list of API-compliant stage class instances, and, from
their input and output specifications, chain them together
automatically.

 once that's done, the end result is that the MetaStage class instance
would simply be passed in to one single Pipeline instance: end result,
only one clock cycle.

> Note: none of the code I referred to is up on salsa yet, I haven't finished
> updating the tests yet.

 ok will take a look when it's available.

l.