[Libre-soc-dev] bigmul and the difference between designing an ISA and designing an *implementation* of an ISA

Luke Kenneth Casson Leighton lkcl at lkcl.net
Mon Sep 25 15:16:39 BST 2023


am just reading this:
https://libre-soc.org/irclog/%23libre-soc.2023-09-18.log.html#t2023-09-18T20:28:04

sadoon this is a very common mistaken assumption, which actually is an
assumption made about SIMD ISAs vs SIMD implementations just as much
as it is about Vector ISAs and Vector implementations.

the concept of "Vector Chaining" was first introduced by Cray. basically
as long as all elements are sequential in nature and independent you
can do *one element Load one element process one element store*
as a full pipelined operation..

    *or any number of elements in parallel at your discretion*

of course one might naturally think that the maximum number will be the size of
the Vector Length, but *this* turns out not to be the case either
because you can
always do Multi-Issue *even on Vector operations* and even mix-and-match the
inddivual elements as to the number carried out at any one time.

AMD just applied the 50+ year old "Vector Chaining" concept to their recent
implementation of AVX-512, for example.

so assuming "there WILL be full parallelism", this is the wrong approach.

from an ISA perspective we have to *think through* the implications of every
possible internal Micro-architecture: right the way from a 0.1 IPC Finite State
Machine all the way through to a massive Multi-Issue GBOoO back-end and
make damn sure that whatever goes into the ISA *in no way* is compromised
by the implementor's choices.

therefore as far as the assembly is concerned *you literally dont care*:
its not your problem as to what the "performance" will be. that's down to
the hardware implementor to make that decision.

all assembler written should *in no way* even *consider* what its "performance"
might be.

which is very weird and disconcerting, but the analogy would be writing a
portable function and to start asking questions "the performance MUST be
parallel even on an STM32F and as fast as it will be on a 48-core 4.8 Ghz
XEON, right? because it's the same function, right?"

see how weird that is? :)

l.

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68



More information about the Libre-soc-dev mailing list