[libre-riscv-dev] power pc

Mon Oct 28 02:27:41 GMT 2019

On Tue, Oct 22, 2019 at 9:08 AM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> On Tuesday, October 22, 2019, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Sat, Oct 19, 2019 at 7:11 AM Luke Kenneth Casson Leighton
> > > Jacob what's the deal with c++11 memory models? Why does that matter and
> > > how is Power not able to cope?
> >
> > The problem is that Power requires quite a few expensive instructions
> > for common atomic operations:
> >
> > from https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> >
> > <snip>
>
>
> Ok so they do have them. These look like they have been designed "a la
> RISC" i.e intended to be macro op fused.

Power does have the necessary operations, however, because C++11's
memory model doesn't map simply to Power's memory fence instructions
and because Power doesn't have non-macro-fused atomic RMW
instructions, implementations have a much harder time making the
atomic operations efficient.

>
>
> >
> > Through testing on Compiler Explorer (both clang and gcc; using power9
> > as the cpu, so including recent extensions), Power requires a lr/sc
> > loop for atomic add, which is another very common atomic operation.
> >
> > Also, note all the isync operations in the above code (used for
> > acquire/seqcst loads), these also synchronize the icache -- further
> > slowing the operations down. (though the isync instructions may never
> > be executed due to being branched over, not sure.)
>
>
> The implementation cost of RV atomics is extremely high. Both RocketChip
> and Ariane refused to try to implement them *in the core*, they actually
> put them in the L2 cache!

>From what I can tell, RocketChip actually implements the AMO ALU in
the L1 d-cache.

The reason to put the AMO ALU in the L2 cache is not because of the
high implementation cost (it's about as expensive as a simple integer
ALU -- similar cost to 2-3 64-bit adders), but because that's the spot
where contended atomic operations are most efficient to execute -- run
the operation where the data is rather than moving the whole cache
block to the core just to move it to a different core through the L2
cache for the next atomic operation.

Atomic operations can be implemented in several different places which
balance different aspects of performance:
in the L1 cache/in the core: uncontended atomic operations are
fastest, contended atomic operations are slower
in the L2 (or L3) cache: uncontended atomic operations are much
slower, contended atomic operations are faster due to not needing to
move the whole cache block

basically, moving the AMO ALU(s) closer to the cores improves
performance of uncontended atomics due to parallelism and faster
signalling to/from the cores, while moving the AMO ALU(s) closer to
the memory crossbar (or other shared structure) improves performance
of contended atomics, since the data doesn't need to move and locking
a block for the atomic operation becomes less expensive due to needing
less synchronization.

>
> There is only the one AMO ALU, and access to it is done through L2 Bus
> Contention, with special messages sent over AXI4 (TileLink in the case of
> RocketChip), to start an operation and get access to the results.
>
>
> > Most of the above operations have a single instruction on RISC-V,
> > making them potentially faster due to being able to do the atomic
> > operation at the cache/memory controller instead of at the CPU.
>
>
> Except that doesn't scale.

If there are AMO ALUs at each of the cores/L1 caches and at the L2
cache(s) (and lower levels), it scales better due to getting the best
of both worlds: fast uncontended atomics due to a AMO ALU at each
core/L1 and faster contended atomics due to transporting the operation
& synchronization to the data rather than transporting the much larger
cache block to the core and back.

> Remember, IBM are smart buggers. Cell Processor:
> hundreds if not thousands of PowerISA cores.

Actually, the playstation 3 only has 1 power core, the rest are not
actually power cores and can't even address main memory without using
DMA.

>
> With separate cache coherency instructions, and a properly designed cache
> (one that scales), the atomic operations can proceed with zero contention
> if there is none, whether there are 4 cores or 4,000.
>
> If we ever want to go "big GPU" (AI accelerator, NVIDIA CUDA) this will I
> feel make more sense.

Yes, hence why I am concerned about PowerISA, since it doesn't have
the right set of cache coherency/atomic instructions. Either we have
to macro-op fuse a 4 (or more) instruction sequence or have to execute
multiple cache coherency/atomic operations where we would only need
one.

Additionally, unless we macro-op fuse the 6 (!) instruction sequence
for atomic add [1] (one of the most common atomic RMW operations), the
operation must be done at the core rather than being able to do it at
the L2 cache due to the LR/SC loop being executed on the core.

1: https://gcc.godbolt.org/z/3q4dYX -- lwsync through isync

Also, the above atomic add code sequence *does* execute the isync
instruction (when the loop exits), causing a needless i-cache
flush/sync.

Jacob Lifshay