[libre-riscv-dev] [Bug 376] Assess 40/45 nm 2022 target and interfaces

Fri Jun 12 23:55:18 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=376

--- Comment #20 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #12)
> Assuming we're building a higher than 2W version, I think we should double
> the int/fpmul to 8x32-bit per core or maybe even quadruple it to 16x32-bit
> and add more cores to 8 or 16 or more cores.

ok.  so expanding the number of cores is easy (less hassle) than expanding
the number Function Units (which directly correlates with the size of the
Dependency Matrices).  so there is that in its favour.

the problem is that if you significantly increase the number of cores, SMP
coherency meets a point of diminishing returns unless you start doing very
advanced L1/L2 cache design and interconnect.  so we had better take that
into consideration and budget and plan for it, appropriately.

using OpenPITON for example, this is a full "L1, L1.5, L2" write-through cache
strategy, which although it scales massively, might not provide good enough
localised performance.

basically if you double to 8x 32-bit per core, that means 4x 64-bit LD/ST
operations per clock cycle, because we would have 64-bit 2xFP32 SIMD units
(dynamic partitioning if we can do it, otherwise literally just 2x FP32
units).

the L0CacheBuffer will take care of amalgamating those into 128-bit cache-line 
requests (cache line width to be evaluated, below).

4x 64-bit LD/ST operations, we will have out-of-order requests for at least
one pair of LDs and STs "in flight", simultaneously, therefore we need to
be able to hold *both* in the Reservation Stations, and preferably the
next set of LDs (at least), as well.

that's *twelve* LD/ST Computation Units.

that's *just* the LD/ST Computation Units - it does not include the
FP Computation Units as well.  4x 64-bit SIMD operations being processed,
let us assume those are FMACs, and assume a 4-stage pipeline, we would
need say 8 Reservation Stations because we want 4 in operation and another
4 "in-flight" to be held (so that we do not get a stall after processing
the previous ST).

- 4x LD                        in RS, being processed
-   4x 64-bit SIMD FP32        in RS, waiting
-     4x ST                    in RS, waiting
-       4x LD                  in RS, waiting
-          4x 64-bit SIMD FP32 in RS, waiting
-             4x ST            stall (only 12 LD/ST RS's)

the next clock cycle, after the 1st LDs are complete:

- 4x LD                        done
-   4x 64-bit SIMD FP32        in RS, waiting
-     4x ST                    in RS, waiting
-       4x LD                  in RS, waiting
-          4x 64-bit SIMD FP32 in RS, waiting
-             4x ST            now in RS, waiting (1st 4 LD/ST RS's were free)

so that's 20 Reservation Stations: 12 for LD/ST, 8 for FP.

let's go back to the 12x LD/ST operations.  each would have 2x PortInterfaces:
one for aligned, one for misaligned.  the L0CacheBuffer would therefore have
12 rows, 2 side-by-side odd/even addresses, receiving 24x PortInterfaces in
total.

each PortInterface would be around 160 wires wide (64 data, 64 addr, control)
that' 3840 wires into the L0CacheBuffer.

this is one hell of a lot of wires going into one small piece of silicon.

now let's see if 2x 128-bit L1 Caches are ok.

* 4x 64-bit requests will result in 2x 128-bit cache line requests.
* with 2x 128-bit Caches (one odd, one even), we are *just* ok

this as long as those requests can be made in a single cycle.  this
does however mean that we need 4x 64-bit Wishbone Buses down to the
L2 Cache.

it also means that, because of the pass-through nature of GPU workloads
(data in, process, data out), we might need to sustain those 4x 64-bit
pathways *right* the way through to memory.

actually... no.  this might not be enough.  or, if it is, it's barely
enough.

we need the in-flight requests to be in-flight because they constitute
"advance notice" to the L1 and L2 caches.  with the LDs and STs being
matched pairs, we *might* need just the one more LD set (20 LD/ST
RSes) and 1 more FP set (12 FP RSes) so as to be able to do this:

- 4x LD                              in RS, being processed
-   4x 64-bit SIMD FP32              in RS, waiting
-     4x ST                          in RS, waiting
-       4x LD                        in RS, waiting
-         4x 64-bit SIMD FP32        in RS, waiting
-           4x ST                    in RS, waiting
-             4x LD                  in RS, waiting
-               4x 64-bit SIMD FP32  in RS, waiting
-                 4x ST              stall (only 20 LD/ST RS's)

this would give us *advance notice* of the *next* set of 4x LD/ST,
which would do:

* Effective Address Computation (no problem)
* then the next phase push through to the L0CacheBuffer and
* pass through the requests to the L1 Cache and TLB and
* initiate a L2 Cache and L2 TLB lookup

whilst the request to memory might take a while to complete, at least
it would be outstanding, the reservation of the L1 Cache Line and
the L2 cache line would be made.

and for that to work, i think that 2x 128-bit cache lines isn't going
to cut it: we'd need 4.

so i think the L0CacheBuffer would need to do 2-bit striping:

* bits 4 and 5 == 0b00  Bank 0, L1 Cache number #1
* bits 4 and 5 == 0b01  Bank 1, L1 Cache number #2
* bits 4 and 5 == 0b10  Bank 2, L1 Cache number #3
* bits 4 and 5 == 0b11  Bank 3, L1 Cache number #4

this is basically monstrous.

for 16x 32-bit FP, i would not recommend going to 3-bit striping (8 L1
caches), i'd recommend expanding 4x L1 caches to 256-bit wide cache lines, 
instead.

we would also need 48 LD/ST Reservation Stations, and 20 FP RSes.

that's starting to get a little scary.  we could conceivably split
the Regfiles into odd-even numbering so that this number could be
halved.

whilst there would still be 48 LD/ST RSes and 20 FP RSes in total,
there would be 2 separate (square) Dependency Matrices @ (24+20) 40 wide.

honestly i would not recommend trying it, not without a much larger
budget.

even 8x FP32 is a little scary.  those 40 RSes have to be joined by
Branch FUs, Condition Register FUs, predication FUs and so on.

which is why i said, it's probably much simpler to double the number
of SMP cores, instead.

-- 
You are receiving this mail because:
You are on the CC list for the bug.