[libre-riscv-dev] [Bug 60] N-stage 64-bit multiplier pipeline needed (signed/unsigned)

Sat May 18 03:35:50 BST 2019

http://bugs.libre-riscv.org/show_bug.cgi?id=60

--- Comment #14 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Jacob Lifshay from comment #7)
> > I don't have any tests for having pipeline registers in the multiplier yet.
> 
>  ok, so this is for multi-stage multiply?
it has an input that allows you to specify which levels of the carry-save adder
tree have registers and which levels are wired straight through.
> 
> > I also didn't yet add the adder input for FMA, that should be relatively
> > trivial.
> 
>  yes.
> 
> > I also need to add additional documentation.
> > 
> > The multiplier supports every combination of aligned 8/16/32/64 that fits in
> > 64-bits. For each part, it supports all 4 RISC-V multiply ops: mul, mulh,
> > mulhsu, and mulhu.
> 
>  excellent.  it likely qualifies as "completed milestone".  we can move
>  more extensive unit tests to a separate milestone.
> 
> 
> > Review and comments welcome.
> 
>  the formatting is a little... weird.
> 
>  it's generally good practice to use vertical alignment to make it easier
>  for the reader to spot errors and differences vertically:
> 
>         lanes = [SIMDMulLane(True,
>                              True,
>                              8,
>                              True),
>                  SIMDMulLane(False,
>                              False,
>                              8,
>                              True),
>                  SIMDMulLane(True,
>                              True,
>                              16,
>                              False),
>                  SIMDMulLane(True,
>                              False,
>                              32,
>                              True)]
> 
>  is almost impossible to see what the differences are between the 4 cases.
> 
>         lanes = [SIMDMulLane(True, True, 8, True),
>                  SIMDMulLane(False, False, 8, True),
>                  SIMDMulLane(True, True, 16, False),
>                  SIMDMulLane(True, False, 32, True)]
> 
>  is trivial to see what the differences are
> 
>         lanes = [SIMDMulLane(True,  True,  8,  True),
>                  SIMDMulLane(False, False, 8,  True),
>                  SIMDMulLane(True,  True,  16, False),
>                  SIMDMulLane(True,  False, 32, True)]
> 
>  is extremely clear what the differences between the four cases are.
yeah, I'll fix that.
> 
> 
> ----
> 
>             def async_process() -> AsyncProcessGenerator:
>                 for a_signed in False, True:
> 
>  is so heavily indented (due to it being a nested function) that again,
>  readability is compromised.
> 
>                 yield from self.subtest_mul8_16_32_64([SIMDMulLane(False,
>                                                                    False,
>                                                                    32,
>                                                                    True),
> 
>  moving it to a global function would reduce the indentation by a whopping
>  3 levels, allowing the code size to be hugely reduced in the process
unfortunately it can't be a global function since it uses self from the
surrounding function but it can't take any arguments
> 
> 
> ----
> 
>  all of these:
> 
>         self._intermediate_output = Signal(128)
>         self._part_8 = [Signal(name=f"_part_8_{i}") for i in range(8)]
...
>         self._neg_lsb_b_term_64 = Signal(128)
> 
>  can be moved *into* the elaborate function, the self. removed *and* the
>  underscore.
> 
>  they're entirely local to the function named elaborate: therefore there
>  is absolutely no need to expose them (at all) to the public API.
They're exposed so the test framework can access them, otherwise they are
private.
> 
>  that they're in the __init__ means that it is almost impossible to
>  work out what the inputs and outputs of the function are.
> 
>  unless there is an intention to derive from Mul8_16_32_64 at some point
>  in the future?
> 
>  commenting and grouping the inputs and outputs is pretty essential.
> 
>         # are these inputs, are they outputs? or intermediaries?
>         # it's impossible to tell
>         self.part_pts = PartitionPoints()
>         for i in range(8, 64, 8):
>             self.part_pts[i] = Signal(name=f"part_pts_{i}")
>         self.part_op = [Signal(2, name=f"part_op_{i}") for i in range(8)]
> 
>         # input operand
>         self.a = Signal(64)
>         self.b = Signal(64)
> 
>         # output operand
>         self.output = Signal(64)
all of part_pts, part_op, a, and b are inputs, output is an output.
> 
> ----
>  
> 
>  add_term is small enough (in this case) to keep as a locally-scoped
> function.
>  the use of typing however is still interfering with readability, by forcing
>  the unnecessary single-line-per-function-argument that most python
> developers
>  are used to.
> 
>  
> -----
> 
>     def __init__(self, partition_points: PartitionPointsIn = {}):
>         super().__init__()
> 
>  no!  it is *absolutely essential* never to declare a dictionary as an
>  argument to a function!
> 
>  this is the pattern for creating a *singleton*!
> 
I get that, however, partition_points is not modified or stored in an
attribute, so having a singleton default argument doesn't hurt and actually
might save memory, since an extra dict doesn't need to be allocated each time.

> ----
> 
>         for not_a_term, \
>             neg_lsb_a_term, \
>             not_b_term, \
>             neg_lsb_b_term, \
>             parts in [
>                 (self._not_a_term_8, 
>                  self._neg_lsb_a_term_8,
>                  self._not_b_term_8,
>                  self._neg_lsb_b_term_8,
>                  self._part_8),
>                 (self._not_a_term_16,
>                  self._neg_lsb_a_term_16,
>                  self._not_b_term_16,
>                  self._neg_lsb_b_term_16,
>                  self._part_16),
>                 (self._not_a_term_32,
>                  self._neg_lsb_a_term_32,
>                  self._not_b_term_32,
>                  self._neg_lsb_b_term_32,
>                  self._part_32),
>                 (self._not_a_term_64,
>                  self._neg_lsb_a_term_64,
>                  self._not_b_term_64,
>                  self._neg_lsb_b_term_64,
>                  [self._part_64]),
>                 ]:
> 
> 
>  yuck :)
I should probably change them to arrays or dicts, so there would be a single
self._not_a_term initialized to {8: Signal(...), 16: Signal(...), ...}.
similarly for _neg_lsb_a_term, _not_b_term, _neg_lsb_b_term, and _part (which I
might name parts)
> 
>  and unfortunately, the names of those variables are too long (even when
>  self._ is removed) to include on one line.
> 
>  reducing them to "notaterm_8"  or removing the word "term" - not_a_8,
>  neg_lsb_a_32, etc. would allow that to happen.
> 
>  this would improve readability greatly and reduce code length (again,
>  increasing readability)
> 
> 
> ---- (continuing from above)
> 
>  the fact that they are "terms" may be commented above the block
>  (which has a comment missing explaining what the for-loop is for)
> 
>         # something about this for-loop, which is impossible to
>         # determine from the contents of the loop itself as it's
>         # too code-dense, which deliberately includes the word "term"
>         # to make it clear that the 4 things are "terms"
> 
>         for not_a, neg_lsb_a, not_b, neg_lsb_b, parts in [
>            (not_a_8, neg_lsb_a_8, not_b_8, neg_lsb_b_8, part_8),
>            ....
>            ....
> 
> ----
> 
>  generally, there are comments missing which explain each code block
>  and allow the reader to understand what is going on.  in particular,
>  any code-heavily-dense block it is essential to have an introductory
>  explanation as to what that block is for.
> 
yeah, I was just writing code to get it to pass the tests, rather than writing
lots of comments. I'll be fixing that.
> 
> ----
> 
>         groups = self.full_adder_groups()
>         if len(groups) == 0:
>             if len(self.inputs) == 0:
>                 m.d.comb += self.output.eq(0)
>             elif len(self.inputs) == 1:
>                 m.d.comb += self.output.eq(self._resized_inputs[0])
>             else:
>                 assert len(self.inputs) == 2
>                 adder = PartitionedAdder(len(self.output),
>                                          self._reg_partition_points)
>                 m.submodules.final_adder = adder
>                 m.d.comb += adder.a.eq(self._resized_inputs[0])
>                 m.d.comb += adder.b.eq(self._resized_inputs[1])
>                 m.d.comb += self.output.eq(adder.output)
>             return m
> 
>  a comment is needed to divide this block from the code above it.
>  also, why does the module return early if len(groups) == 0?
it returns early because this is the base case in a recursive module
instantiation. if it didn't, there would be a stack overflow at elaborate-time.

basically, each invocation converts the input terms to 2 * (len(inputs) // 3) +
len(inputs) % 3 terms for the next recursive call using len(inputs) // 3
carry-save adders, where len(groups) is len(inputs) // 3.

Once len(inputs) < 3, then it executes the base case, which adds the last 2
inputs together using a traditional adder.

I'll be adding that or similar to a comment.

-- 
You are receiving this mail because:
You are on the CC list for the bug.