[libre-riscv-dev] cache SRAM organisation

Wed Mar 25 10:54:20 GMT 2020

Libre-SOC developers,

That discussion is mainly on system level and I don't want to get too deep into this as I don't have time for that.
I am providing the SRAM blocks and then it is up to the system guys to see how they use them. In this case you guys (libre-soc + LIP6) are the system guys.

On ASICs commonly three types of SRAM are provided: a single port RAM, a 2-port RAM and a dual port RAM. Currently for NLNet only a single port SRAM is foreseen as this is the most common, the smallest in area per bit and the fastest.
A single port SRAM has one port where you can do a read or a write each clock cycle. The 2-port one has one read port and one write port so you can do a read and write each clock cycle. The dual port one now has two ports that each can do a read or write each clock cycle. So you can do two reads, two write or a read+write each clock cycle.
For each of them you can have a synchronous or an asynchronous version. A synchronous RAM has a clock input and the address and data inputs are latched on that clock signal. It thus means that the FFs are integrated in the SRAM, e.g. thus very close :) . The RAM currently being developed in my NLNet project is a synchronous SRAM as this is easier from timing point of view because all the timing can be related to the clock. A synchronous RAM actually functions as an addressable bunch of FFs and the synthesis and P&R tools know how to handle them.

Given this building block you can now make blocks that look to the outside world as higher number port blocks. You do this by instantiating multiple RAM blocks and make sure that the content is mirrored between all the blocks. This way you can read from the different blocks in parallel. Writing in the blocks still has to happen to all the blocks at the same time.

So if you take four single port SRAM blocks you can make a four port SRAM block. Each cycle you can do 1-4 reads or 1 write but you can't read and write at the same time. With four 2-port RAMs you can do 4 reads and 1 write each clock cycle. With four dual port RAMs you can do 4 reads or 3 reads + 1 write or 2 reads + 2 writes each cycle.
I will provide the single block, the combining of the block has to happen in RTL/HDL. For Libre-SOC this means in nmigen and using Coriolis for placement and connecting the single blocks.

Although the SRAM does an operation each clock cycle the clock frequency could be different from the rest of the logic. If the RAM is fast enough it could run at double the frequency of the core so basically a single port RAM could look like a dual port RAM to the rest of the logic which is running at half the frequency. If the RAM is not fast enough wait  states need to be implemented for each operation. The maximum clock frequency will go down when you increase the size of a RAM block. So on CPU typically L1 cache runs at the same clock frequency as the core without any wait states and higher level caches are bigger but also introduce more wait states for accessing them.
If you are thinking about having different clock frequencies in your design you have to first discuss this with Jean-Paul/LIP6 as doing multi clock designs is opening it's own can of worms (cross clock domain problems etc). For the October prototype I feel we need to stick with use of single port SRAM block and run the whole chip from the same clock. IMO, on this prototype you should take any performance implication this has.

greets,
Staf.
Luke Kenneth Casson Leighton schreef op di 24-03-2020 om 22:32 [+0000]:
> https://groups.google.com/d/msg/comp.arch/cbGAlcCjiZE/mgMZVINVIAAJ
> Staf can i ask you the favour of reviewing Mitch's comments about cache design?
> in particular the comments about the possibility of using multiported SRAM cells as long as only 1R or 1W is done on any given cell?
> also something about doing the FFs yourself, close to the SRAM cells?
> l.
> 
>