[libre-riscv-dev] multiplier 8x8 products

Fri Aug 23 22:25:44 BST 2019

On Fri, Aug 23, 2019, 14:12 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> Hi Jacob,
>
> The partitioned multiplier hits every cycle with a massive 64 8x8
> multiplies. Can you think of a way to reduce that?
>
one part that will reduce that is that multiplies are commutative, so that
reduces to 32 8x8 multiplies. I would be surprised if yosys doesn't already
share the multipliers after running the SAT sharing pass.

>
> Also, I have been looking at the Dadda Tree algorithm, it is amazingly
> elegant.
>
> However, adapting it to be early out capable has me slightly puzzled.
>
> Here is what I am thinking: having a suite of 8x8 multiplies (as straight
> DSP blocks) that would go directly out (early), but for 16 bit values those
> 8x8 products would go into a Dadda tree that produced 16 bit outputs, again
> early out.
>
> Then those would *again* be added to yet more of the 8x8 products, another
> suite of Dadda Trees, this time to create a 32 bit mul.
>
> Finally the 64 bit phase.
>
> It would not be as efficient as a dedicated Dadda 64 bit Mul, because at
> the 16, 32 and 64 phases a full 128 bit adder is needed.
>
> A straight 64x64 Dadda would only need the one full 128 bit adder.
>
> Thoughts?
>

I think you may be too fixated on early-out: I would guess that the initial
8x8 multiplies take up around half of the multiplier delay and the adders
afterwards take the other half. For all but 8x8 multiplies, I think we'll
end up taking 2 clock cycles and the 8x8 multiplies might fit in 1 clock
cycle. For all the other cases, early-out would vastly increase gate count
without being able to output the results much earlier (less than a clock
cycle). Also, each additional early-out adds lots of additional signalling
required for routing the output and control signals.

Even for 8x8 multiplies, trailing additions to handle signed/unsigned are
required.

Jacob Lifshay