[libre-riscv-dev] cache SRAM organisation

Wed Mar 25 12:33:24 GMT 2020

On Wednesday, March 25, 2020, Staf Verhaegen <staf at fibraservi.eu> wrote:

> Libre-SOC developers,
>
> That discussion is mainly on system level and I don't want to get too deep
> into this as I don't have time for that.
>

implicitly understood, it's why i linked to the (two) relevant discussions
in which Mitch mentioned Cache SRAM design, and summarised.

if you are busy, no need to reply below, it is mostly "informative" for
other readers.

> I am providing the SRAM blocks and then it is up to the system guys to see
> how they use them. In this case you guys (libre-soc + LIP6) are the system
> guys.
>

appreciated and accepted.

> On ASICs commonly three types of SRAM are provided: a single port RAM, a
> 2-port RAM and a dual port RAM. Currently for NLNet only a single port SRAM
> is foreseen as this is the most common, the smallest in area per bit and
> the fastest.
>

yes.  i deduced that from Mitch's description: appreciate the summary.

> A single port SRAM has one port where you can do a read or a write each
> clock cycle. The 2-port one has one read port and one write port so you can
> do a read and write each clock cycle. The dual port one now has two ports
> that each can do a read or write each clock cycle. So you can do two reads,
> two write or a read+write each clock cycle.
> For each of them you can have a synchronous or an asynchronous version. A
> synchronous RAM has a clock input and the address and data inputs are
> latched on that clock signal. It thus means that the FFs are integrated in
> the SRAM, e.g. thus very close :) . The RAM currently being developed in my
> NLNet project is a synchronous SRAM as this is easier from timing point of
> view because all the timing can be related to the clock. A synchronous RAM
> actually functions as an addressable bunch of FFs and the synthesis and P&R
> tools know how to handle them.
>

ok.  later, we will definitely need an aysynchronous version.

this because it turns out that asynchronous SRAM can act, when used in a
Register File, as if it was a (separate) Register Bypass / Forwarding
Port.  with the Out-of-Order Engine being a huge cyclic feedback loop
between ALUs and RegFile, clock delays are an impediment, and having
completely separate (extra) Regfile Bypass ports dramatically increases the
number of wires and Multiplexers.

(actually it is highly likely that we will need to do the Register File as
FFs, initially, because we need i think a minimum 3R1W, plus the
forwarding.  also we *may* do unary addressing - not binary addressing -
which i understand it is pointless to use SRAM cells when you have unary
individual row-enable, already).

> Given this building block you can now make blocks that look to the outside
> world as higher number port blocks. You do this by instantiating multiple
> RAM blocks and make sure that the content is mirrored between all the
> blocks. This way you can read from the different blocks in parallel.
> Writing in the blocks still has to happen to all the blocks at the same
> time.
>

i've heard of this trick being used in FPGAs, usually combining pairs of
2R1W to get 4R1W.

> So if you take four single port SRAM blocks you can make a four port SRAM
> block. Each cycle you can do 1-4 reads or 1 write but you can't read and
> write at the same time. With four 2-port RAMs you can do 4 reads and 1
> write each clock cycle. With four dual port RAMs you can do 4 reads or 3
> reads + 1 write or 2 reads + 2 writes each cycle.
>

ok i always wondered how you'd get 4R2W.

ok so actually we *could* do 3R1W or 4R1W with the single-port SRAM blocks,
for the Regfile.

I will provide the single block, the combining of the block has to happen
> in RTL/HDL. For Libre-SOC this means in nmigen and using Coriolis for
> placement and connecting the single blocks.
>

ok.  one for Jean-Paul to help advise with, when we get to it.

> Although the SRAM does an operation each clock cycle the clock frequency
> could be different from the rest of the logic. If the RAM is fast enough it
> could run at double the frequency of the core so basically a single port
> RAM could look like a dual port RAM to the rest of the logic which is
> running at half the frequency. If the RAM is not fast enough wait  states
> need to be implemented for each operation. The maximum clock frequency will
> go down when you increase the size of a RAM block. So on CPU typically L1
> cache runs at the same clock frequency as the core without any wait states
> and higher level caches are bigger but also introduce more wait states for
> accessing them.
>

hmmm...

> If you are thinking about having different clock frequencies in your
> design you have to first discuss this with Jean-Paul/LIP6 as doing multi
> clock designs is opening it's own can of worms (cross clock domain problems
> etc).

yes.  no, we're not crossing that chasm in the first chip.  i know nmigen
can do it: we *may* be forced to do it for some peripherals, and i know
nmigen supports clock domains, however the main core? no, we reaaally want
to avoid it.

For the October prototype I feel we need to stick with use of single port
> SRAM block and run the whole chip from the same clock. IMO, on this
> prototype you should take any performance implication this has.
>

yes.  we have enough to deal with.  the important thing for the 180nm ASIC
is to cover as much ground as is practical, and prove several aspects of
the design without taking on too much.

thanks for clarifying, Staf - we'll adapt accordingly.  will let you get on.

l.