[libre-riscv-dev] power pc

Tue Oct 22 02:19:11 BST 2019

On Sat, Oct 19, 2019 at 7:11 AM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> I took a look at the ISA (P1146) and we do not need vectors (OP4 or OP60).
> If tdi is moved to OP56, twi to OP60, and mulli to OP11, the entire 000 row
> of 8 is clear for use as Compressed and escape-sequences for 48, 64 bit and
> VBLOCK.

I think we should avoid moving instructions to improve compatibility.
Will look through the ISA tables later.

> It will be very tight.
>
> Jacob what's the deal with c++11 memory models? Why does that matter and
> how is Power not able to cope?

The problem is that Power requires quite a few expensive instructions
for common atomic operations:

from https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Load Relaxed: ld
Load Acquire: ld; cmp; bc; isync
Load Seq Cst: hwsync; ld; cmp; bc; isync
Store Relaxed: st
Store Release: lwsync; st
Store Seq Cst: hwsync; st
Cmpxchg Relaxed (32 bit): _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:
Cmpxchg Acquire (32 bit): _loop: lwarx; cmp; bc _exit; stwcx.; bc
_loop; isync; _exit:
Cmpxchg Release (32 bit): lwsync; _loop: lwarx; cmp; bc _exit; stwcx.;
bc _loop; _exit:
Cmpxchg AcqRel (32 bit): lwsync; _loop: lwarx; cmp; bc _exit; stwcx.;
bc _loop; isync; _exit
Cmpxchg SeqCst (32 bit): hwsync; _loop: lwarx; cmp; bc _exit; stwcx.;
bc _loop; isync; _exit
Acquire Fence: lwsync
Release Fence: lwsync
AcqRel Fence: lwsync
SeqCst Fence: hwsync

Through testing on Compiler Explorer (both clang and gcc; using power9
as the cpu, so including recent extensions), Power requires a lr/sc
loop for atomic add, which is another very common atomic operation.

Also, note all the isync operations in the above code (used for
acquire/seqcst loads), these also synchronize the icache -- further
slowing the operations down. (though the isync instructions may never
be executed due to being branched over, not sure.)

Most of the above operations have a single instruction on RISC-V,
making them potentially faster due to being able to do the atomic
operation at the cache/memory controller instead of at the CPU.

Jacob