[libre-riscv-dev] TLB Initial Proposal

Mon Jan 21 10:26:00 GMT 2019

On Mon, Jan 21, 2019 at 6:20 AM Daniel Benusovich
<flyingmonkeys1996 at gmail.com> wrote:

> I read over a paper discussing TLBs and believe we could have 2 64-entry
> fully associative TLB caches (128 entries total) using CAMs (Content
> Addressable Memory). One cache would be used as an "active" list and the
> second as an "inactive"  list . Linux uses a "Two-List Strategy" (which is
> where I am pulling this from) in evicting cache entries.

 cool.

> All translations when initially called would be placed into the active
> list. Entries in the inactive list would be moved into the active list when
> hit.
>
> If the active table fills up or gets to large the head entry should be
> popped off and added to the inactive list. If both (active and inactive)
> lists are full then: pop the head entry from the inactive list into the
> ether, pop the head entry from the active list into the inactive table, and
> place the new translation into the active list.
>
> When lulls in requests occur and the inactive list exceed a given
> threshold, popping off should occur to ensure that both lists never fully
> fill up.
>
> The benefit to this is the need to only maintain a tail and head pointer
> for both lists. This would use 7 bits per pointer and total 28 bits for the
> 4 pointers which is nice.

 cool!

> Alternatively, A single 128 or 64 entry fully associative TLB is also
> possible using a more standard LRU.

 the less power used, the better.  larger CAMs has a performance
penalty, obviously, however smaller CAMs would result in TLB
thrashing.  which is bad.

> This logic would have to be implemented in software not hardware correct?

 the consensus is that hardware is better: the performance hit of a
software TLB is awful.  hundreds of CPU cycles of latency, instead
of... err... 2.

 if you can investigate precisely what the latency hit would be, and
what effect that would be, in terms of performance, that would be
good.

 if it's say... only... mmm... 0.5% of the total CPU time spent doing
TLB lookups, that's probably tolerable.  if however it is 10% or even
2%, that's not really acceptable.

> Most of the design papers I read have the OS perform the logic for cache
> misses and controlling what goes where. This is a RISC styling. Older
> designs had all logic controlled in the hardware. Which is a CISC styling.

 interesting.

> I am not sure if this changes for a mobile application, as all what I read
> was quite general purpose.

 general purpose OS: usually linux-kernel-based, however there's no
reason why microkernels or RTOSes should be used.

> A few questions appear from this:
>
> 1. What page size we will be supporting?

 4K is the standard.  anything other than this requires too great an
amount of work on pretty much all software, at a very low level.

> 2. What is the maximum physical memory we will be supporting?

 i'd really like to go up to 32GB, however this is a "mobile class"
processor (initially).  connecting more than 4GB to a single DDR3/4
32-bit-wide channel becomes really really difficult: expensive on both
the RAM ICs, as well as needing an unusual fly-by topology (multiple
RAM ICs on the *same data bus*).

 there's a standard for RISC-V 64-bit OSes, called a "platform", where
the number that i remember are 39-bit and 48-bit.  39 is probably the
physical RAM limit, 48-bit the virtual memory limit.

> 3. Will the operating system on the chip

 there won't be an operating system on the chip.  there will be an
absolutely tiny boot ROM, well under 16k in size: that's all.

>  be reserving any part of the
> virtual memory space as kernel memory space?

 there shouldn't be... or, if there is, there should be no
"dependence" or restriction.  it's a general-purpose processor.

> Any feedback and or guidance would be much appreciated!
>
> Hope you are having a good one and possibly had a chuckle reading this,

 :)