[libre-riscv-dev] power pc

Tue Oct 22 17:08:19 BST 2019

On Tuesday, October 22, 2019, Jacob Lifshay <programmerjake at gmail.com>
wrote:

> On Sat, Oct 19, 2019 at 7:11 AM Luke Kenneth Casson Leighton
> <lkcl at lkcl.net> wrote:
> >
> > I took a look at the ISA (P1146) and we do not need vectors (OP4 or
> OP60).
> > If tdi is moved to OP56, twi to OP60, and mulli to OP11, the entire 000
> row
> > of 8 is clear for use as Compressed and escape-sequences for 48, 64 bit
> and
> > VBLOCK.
>
> I think we should avoid moving instructions to improve compatibility.
> Will look through the ISA tables later.

The problem with that is that we would need to "redirect" 8 spare major
opcodes through a lookup table.

It would be much cleaner to use ISAMUX/NS.

The NS would activate only in 3D mode (and possibly VPU as well) then drop
back to "standard" Power ISA afterwards.

> > Jacob what's the deal with c++11 memory models? Why does that matter and
> > how is Power not able to cope?
>
> The problem is that Power requires quite a few expensive instructions
> for common atomic operations:
>
> from https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
>
> Load Relaxed: ld
> Load Acquire: ld; cmp; bc; isync
> Load Seq Cst: hwsync; ld; cmp; bc; isync
> Store Relaxed: st
> Store Release: lwsync; st
> Store Seq Cst: hwsync; st
> Cmpxchg Relaxed (32 bit): _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop;
> _exit:
> Cmpxchg Acquire (32 bit): _loop: lwarx; cmp; bc _exit; stwcx.; bc
> _loop; isync; _exit:
> Cmpxchg Release (32 bit): lwsync; _loop: lwarx; cmp; bc _exit; stwcx.;
> bc _loop; _exit:
> Cmpxchg AcqRel (32 bit): lwsync; _loop: lwarx; cmp; bc _exit; stwcx.;
> bc _loop; isync; _exit
> Cmpxchg SeqCst (32 bit): hwsync; _loop: lwarx; cmp; bc _exit; stwcx.;
> bc _loop; isync; _exit
> Acquire Fence: lwsync
> Release Fence: lwsync
> AcqRel Fence: lwsync
> SeqCst Fence: hwsync

Ok so they do have them. These look like they have been designed "a la
RISC" i.e intended to be macro op fused.

>
> Through testing on Compiler Explorer (both clang and gcc; using power9
> as the cpu, so including recent extensions), Power requires a lr/sc
> loop for atomic add, which is another very common atomic operation.
>
> Also, note all the isync operations in the above code (used for
> acquire/seqcst loads), these also synchronize the icache -- further
> slowing the operations down. (though the isync instructions may never
> be executed due to being branched over, not sure.)

The implementation cost of RV atomics is extremely high. Both RocketChip
and Ariane refused to try to implement them *in the core*, they actually
put them in the L2 cache!

There is only the one AMO ALU, and access to it is done through L2 Bus
Contention, with special messages sent over AXI4 (TileLink in the case of
RocketChip), to start an operation and get access to the results.

> Most of the above operations have a single instruction on RISC-V,
> making them potentially faster due to being able to do the atomic
> operation at the cache/memory controller instead of at the CPU.

Except that doesn't scale. Remember, IBM are smart buggers. Cell Processor:
hundreds if not thousands of PowerISA cores.

With separate cache coherency instructions, and a properly designed cache
(one that scales), the atomic operations can proceed with zero contention
if there is none, whether there are 4 cores or 4,000.

If we ever want to go "big GPU" (AI accelerator, NVIDIA CUDA) this will I
feel make more sense.

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68