[libre-riscv-dev] store computation unit

Wed Jun 5 00:51:09 BST 2019

On Wednesday, June 5, 2019, Mitchalsup <mitchalsup at aol.com> wrote:

>
>
>
>
> LD<->LD are independent in order (reading does not harm)
>
>
> Except for IO Domain. I am seriously considering just doing LD - LD deps
> for this reason, and to actually just reuse the FURegs Matrix source code,
> the "registers" being the LDs and STs.
>
> I don't know what your PTE/TLB looks like,
>

Ariane. Standard RISCV 4 levels. Still being developed.

>
> but basically anything that is not-cacheable falls into the
> category where it "must run in program order". If I were designing this, I
> might be tempted to perform
> a micro-exception, back up to the non-cacheable instruction, and rerun it
> with the new knowledge that
> it and several friends are going to need hard memory ordering. Since we
> have the ability to take faults
> and back up, and since I/O is painfully slow and dangerous to do OoO upon,
> slow and safe is the proper
> mental model.
>

 The recommended approach to IO Domains in the linux world is to have the
PhysAddr in fixed ranges.

I cannot recall immediately if this information propagates up to the ISA
when it comes to AMO operations.

I believe that if the order really matters that LRSC and AMO are supposed
to be used, and FENCE instructions used as hints that inform the execution
about the memory order.

Haven't investigated in enough depth yet

> Note: the memory SB will be dimensionally 1/3×1/3 to 1/2×1/2 the size of
> the reg-reg SB.
> 1/3×1/3 is the smallest reasonable size, 1/2×1/2 is the largest.
>

That makes sense.

>
>
>
> ST<->LD are dependent so the ST happens before the LD
>                when ST[address] mismatches LD[address] the load becomes
> independent.
>
>
> This is where the parallel between FURegs and LDST is different.
>
> FURegs does not have the cache miss case--it just has pure latency between
> Go_Rd and
> Go_Wt. FUMem has to deal with unknown and unbounded latency.
>

That I do not believe affects the connections from the first MD Matrix to
the second. It shouldn't.

I appreciate it is weird because the 2 Mem Matrices are in effect near
identical to each other, it seems ridiculous to have 1 Matrix with nearly
the same information, OR the row data together and pass it in as the input
to the 2nd.

I get it, though.

>
>
>                OR when the ST is performed the LD is enabled
> LD<->ST are dependent so the LD happens before the ST
>                when LD[address] mismatches ST[address] the store becomes
> independent.
>                OR when the LD is performed the ST is enabled.
>
>
> These ORs are the direct mirror of the big Read_Pending OR gates for
> FURegs.
>
> These ORs are dependence removal times. The top statement tells you to set
> dependencies at issue.
> The second states you relax dependences to lines that "cannot be the same
> as" mine, and the last
> one states that once the MemRef has been performed, you are no longer
> dependent upon it.
>

Yes. And the ADDR clash detection is like a "oh look, we didn't know what
the reg# was but now we do and it's different so drop the dependency" kind
of thing.

Virtual addresses have the same lower bits as physical addresses, up to 4k
> [bit 12].
>
> So we have to take up to those 4k, because if we start going beyond that
> it would be necessary to incorporate the PA bits into the miss, and that
> means going through the cache, and consequently DELAYING things and
> generally making it massively complicated.
>
> I am hoping that Jacob can confirm that if Vulkan has data structures on
> fixed 1MiB boundaries, we do not need to take that into account.
>
> The important thing to remember is that when the vast majority mismatch at
> address check, you get the
> vast majority of the perf gains while ensuring proper order on those you
> "can't tell".
>

Yehyeh, it's really smart and effective.

>
> Vulkan, Jacob informs me, requires texture data structures (which are
> absolutely massive and very regular) to be accessed in sequential order.
>
> Texture is often accessed in Z-order which is not sequential. But texture
> is all READ and no writes in
> the same scope as the Reads. GPUs have special Texture caches to help with
> the order and with the
> known properties.
>

Nggh we are being hit by the fact that this is a hybrid CPU-VPU-GPU. There
*is* no separate GPU, here. So even the 3D Vulkan API is being executed in
*userspace* as opposed normally, even with MALI or Vivante "embedded" GPUs,
to being handed over to another (binary incompatible) core.

The textures are therefore in the *userspace* memory.

We did agree when we started the project that additional features and
instructions would be added as necessary.

We have already seen the light on creation of Z buffer instructions and on
creating a separate pixel tile memory area.

I will talk to Jacob about a Texture Mem Area.

> Try this
> https://www.youtube.com/watch?v=qw03mtr6uOg
>
>
Funniest film ever, they just don't make em like that any more.

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68