[libre-riscv-dev] [Bug 178] first coriolis2 tutorial, workflow and "test project" page

Wed Feb 26 12:30:08 GMT 2020

http://bugs.libre-riscv.org/show_bug.cgi?id=178

--- Comment #129 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Staf Verhaegen from comment #127)

> Existing algorithms will also no be able to cope with it either (proprietary
> placers allow to have multi-row macros for for example 4-bit registers).

interesting.  so a 64 bit register latch would be done as a batch of 16 4-bit
Standard Library Cells because clearly those go together.

> 
> And it is not only optimization in placing that you block but also
> optimization in synthesis. If you have for example an inverter on the output
> of such a low level block that is connected to the input of another block
> with an inverter on it these two inverters will be removed during synthesis
> after flattening the design. 

understood.

we may just have to eat inverters, then, in some high level cases.

in the crossover (we were writing at the same time :) ) i explained that it is
unlikely that we will go all the way manual to the leaf nodes.

so each FPU pipeline stage (5000 gates maybe) we will do a block on that.

these blocks we *know* in advance, they are *only* connected by register
latches.

no inverters or other opportunities for synthesis optimisation.

not even buffers needed [actually not true because there are global
"cancellation" lines needed, which go to every stage in the pipeline, saying
"please discard result with ID 0b01101, right now"]

> This can be generalized in that you block
> synthesis optimization in each path that goes over the block boundary if you
> don't flatten.

500,000 gates, flattened, it's just not going to work.  you can check for
yourself by increasing the chip size block in, say, ao68000, to 10,000 x 10,000

or one of the bench tests with an ioring.py change the ARM chip size to 20000 x
20000 for example.

the completion time will jump from 5 minutes to about... 2 hours or more, each
place of a Standard Cell taking *minutes*

fortunately there are known clear boundaries, and the lower levels we can
flatten, the top levels then do not matter so much.

> I can agree that if you make a multi-core chip it may make sense to do the
> P&R on one core and manually place the different instances of the cores in
> the floorplan.

the IEEE754 FPMUL we need 4 of them.

that is around 40,000 gates *just one FPMUL*!

it is a monstrous cascade Wallace Tree (we actually need to replace with the
Dadda algorithm).

likewise, FPDIV/SQRT/RSQRT is an 8 stage pipeline, and we need 4 of those, they
are all identical.

the LD and ST units, 4 of those (possibly 8).

for VPU processing we also need bit manipulation, and we may also have to add
multiple DCT blocks as part of the ALU.

this is a *massive* chip with a lot of regular blocks, to meet the expected
performance levels of 3D and Video processing.

> But the use case I am focusing on is people that develop their design using
> HDL like nmigen or SpinalHDL on a FPGA and then order an ASIC for that. For
> them the ASIC compilation should be a fully automated process and they
> should not have to take care of floorplanning.

if the entire chip in such designs is even as high as 100,000 gates, like
jeanpaul said, it would take a long time but would still be fine.

this design is a massive regular repetition of ALU and SIMD resources.  these
computation resources far exceed the size of the main processor core and even
the L1 caches.

therefore doing them as repeatable blocks makes sense to me.

-- 
You are receiving this mail because:
You are on the CC list for the bug.