[Libre-soc-bugs] [Bug 199] Layout using coriolis2 main core, 180nm

Thu Jul 30 17:37:11 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=199

--- Comment #47 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
On Thu, Jul 30, 2020 at 4:51 PM bugzilla-daemon--- via libre-soc-bugs
<libre-soc-bugs at lists.libre-riscv.org> wrote:
>
> https://bugs.libre-soc.org/show_bug.cgi?id=199
>
> --- Comment #46 from Jean-Paul.Chaput at lip6.fr ---
> (In reply to Luke Kenneth Casson Leighton from comment #45)
> > http://www.aholme.co.uk/6502/Main.htm
> >
> > apparently there is an algorithm called "SubGemini" which solves the
> > recursive netlist walking issue: finding sub-circuit instances in a
> > larger circuit.
> >
> > 1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini:
> > Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In
> > Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.
>
>   I will look into it if needs be. Thanks for the tips.

i mention it because it is likely an actually *designed* and properly
researched version of that recursive netlist-cell-netlist-cell algorithm
i described and implemented.

>   I'm almost done for the P&R of the sub-blocks of test_issuer, and now
>   it would be very helpful if you could provide me with a rough floorplan
>   of the blocks:
>
>   * fus (maybe an ordering of the FUs, but not mandatory).

all in a line, left-right, all the same "height".

the ordering will need to be worked out, based on how close they are to their
respective register files.  at some point this could be determined by a
1-Dimensional algorithm which optimises them however right now that's not a
high priority.  once you have committed something i can take a look and
experiment.

each of the FUs should definitely be flattened though: alu0, logical0, etc.

* all registers (srcN_i, destN_i, *_ok) should be on SOUTH.
* oper_i_* should be on WEST (or EAST, your choice)

one addition: for ldst0 the port interface (pi) to data bus should be on NORTH.

>   * int
>   * fast

each of these flattened (and spr, and xer, and cr as well, they are all
regfiles) - i expect all of their inputs and outputs to be on the NORTH side.

>   * pdecode2

flattened.  raw_opcode_in would be on one side (from imem): LOTS of signals go
out, and this i know is a problem that needs to be solved - but iteratively.

>   * l0

again flattened: Port Interface (pi) to go on SOUTH (so that LDST can attach to
it) and the Wishbone D-Bus on NORTH which will go out of the whole block.

>   Other blocks at core level, like the priority pickers are too small
>   to be taken into account as "blocks to place separately".
>
>   And maybe some hint about the big busses...

a clear space in between the FUs and L0 (top half), and the regfiles below them
(bottom half).  priority pickers _should_ end up placed arbitrarily in that
same middle space.

pdecode should probably be right in the middle, either at the top or pretty
much dead centre, and the i-bus come in at the top middle as well (aka imem).

l0 should definitely be at the top, somewhere along the top edge, with the
d-bus coming into it, and its SOUTH port connected directly to the NORTH of the
ldst0.

>   I'm finishing this because I'm stubborn, but it is already clear at 99%
>   that it will gives results *much* worse than the "flat" approach:

it does however show clearly the places where routing does not "work"?

>   As we place each block indepandantly, we create huge contention points
>   at the border of most blocks due to the amount of large buses. Then we
>   have to route those buses *between* the blocks, forcing us to push
>   them farther apart, 

yes.  192 wires in some cases.  i have a plan to reduce that to only 32 but it
requires quite a bit of work: each FU will have its *own* decoder and receive
*only* the 32-bit instruction.

> not even talking about the capacitance/drive problem.

hmmm.

>   Moreover, the box a block can stray too far from a square factor if we
>   want the placer to work (that is an AR between 0.5 and 2.0). 

ah.  this i was expecting - idealistically - to "work" i.e. not be a problem. 
the "long" ones (alu0 for example, or spr0), i expected it to be possible to
auto-Place them efficiently even if they were long-ratio rectangles.

if this becomes a problem then potentially we can look at merging some of them
together, if they have similar enough register profiles.

> There are
>   exceptions, but that's the general idea. It would be a problem for the
>   clock tree as it's depth may vary between blocks of different sizes.
>   And lastly, to reduce the size of the channels, we would need a careful
>   analysis of where to place the buses (and "combing" the bits to avoid
>   to "flip" a whole bus), which is a lengthy task.
>    So, if we compare a "flat" block with maybe up to 20% of margin space
>   and the sum of blocks at 5% to 20% of free space plus channels, the
>   winner is clear. Staf wins again.

:)

it is more to be able to point, clearly, "here is the regfile, here is the
logical pipeline" etc.

but....

when we add the GPU version of the DIV/RSQRT/SQRT, and add the *MULTIPLE*
IEEE754 FPUs, this will make the layout ***TEN*** times larger than it
currently is.

at that point any bus space inefficiencies will be absolutely dwarfed by the
size of *TWO* partitioned FP64 multiplier blocks and so on.

*one* of the 64-bit DIV/RSQRT/SQRT pipelines takes the size up from 75,000 to
**200,000* cells, all on its own!  when you were on holiday i experimented and
i managed to get it down to "only" 130,000 cells.

then, when we go to multi-issue it becomes even *more* interesting.  remember
for the GPU version (single-core) we are expecting a size of around 300,000 to
400,000 cells.

at that point any hope of iterative development in a reasonable timeframe is
out the window.  hence this exploration.

-- 
You are receiving this mail because:
You are on the CC list for the bug.