[Libre-soc-dev] funding & go-ahead requested for #1134

Tue Aug 22 01:00:07 BST 2023

On Fri, Aug 18, 2023, 19:55 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

>  * v3.1 prefixed (if we decide to do that): plf[s/d]
>
> i already said NO repeatedy on that, didn't i?

I put it in cuz it was on the list. I'll remove them.

Regarding, v3.1 PO1 prefixed instructions, I'm fine with not doing them as
part of this grant because we don't have enough time/money/etc.

however I do think that rejecting PO1 prefixes forever is a serious
mistake. If you are rejecting them because you think the Apple M1 goes
faster due to having a only-32-bit instruction set, you are mistaken.
According to some very interesting research I found:
https://dougallj.github.io/applecpu/firestorm.html
https://drive.google.com/file/d/1WrMYCZMnhsGP4o3H33ioAUKL_bjuJSPt/view

the Apple M1 fetches 64 bytes (16 aarch64 instructions) from the L1I cache,
and then decodes 32 bytes (8 aarch64 instructions), *however* those 32
bytes are really decoded into some 32-bit single instructions and some
64-bit fused-pair-instructions. Those 64-bit fused instructions, for nearly
all purposes, behave identically (in the cpu frontend at least) to PO1
prefixed instructions in PowerISA.

I expect, though haven't verified, that many of the PO1 prefixed
instructions (mostly paddi and load/stores) have a substantial speedup over
the several instructions that they replace, even assuming they lower the
number of instructions that can be decoded due to them taking up twice the
decoder-input space.

Additionally, we already have to pay the costs of supporting 64-bit
instructions due to the PO9 prefix, the incremental cost of supporting PO1
is substantially smaller.

If you think you can avoid the cost of decoding 64-bit instructions by
making SVP64 prefixes be their own instruction which sets a register which
by being set modifies the next instruction, this can be done nearly
identically for PO1 prefixes too. though note that, for a wide OoO cpu,
treating the PO9/1 prefix as a separate instruction is likely more complex
(due to needing complex forwarding/prediction kinda stuff for speed) than
supporting natively 64-bit instructions in the first place.

>
> why did i say NO and state very clearly and repeatedly
> that it was a final amd irrevocable decision?
>

Because you thought that the M1 got a lot of its speed from a uniform
32-bit instruction set (which isn't actually true due to instruction
fusion), and because adding PO1 prefixes is too much work for right now
(fine, have it your way, but this should be just for now, not forever), and
because you had other unstated reasons (you implied you can't state some of
them due to OPF NDAs).

>
> please now elaborate (here) the suite of tasks needed (the full
> chain, which requires ISACaller modification as well
> as unit tests for appx 20 instructions)

We're not doing PO1 prefixes now, all I want is that we will later
seriously consider adding them, and that you will not base your opinion on
our previous mistaken understanding of how the Apple M1 works.

Therefore, listing all the tasks needed for doing it right now is wasting
our time, especially because our cpu architecture will have changed (e.g.
actually having OoO superscalar support) by the time we actually want to
add PO1 support, so what we need to do will have also substantially changed.

and also summarise
> the damage caused to our chances of reaching high performance
> with a simple design in nanoscale geometries (3nm or less).
>

Our chances of getting high performance at small geometries are actually
improved somewhat by instruction fusion and supporting PO1 64-bit
instructions due to the lowered number of instructions programs would need
to run.

If we use a design like I suggested a while ago (trying to explain in email
doesn't work well due to needing diagrams), our design complexity will not
be seriously impacted, because basically all we need is a feedback 32-bit
register in the decode pipe (to propagate from the previous fetch block to
decode a 64-bit insn that crosses fetch blocks), and a length calculation
stage (which can easily overlap the first stage of the decoder so not
requiring extra latency), a decoder for every 32-bits of input (needed
anyway, so nearly no extra cost), and the logic to suppress issuing
instructions from the unused decoder for the other 32-bits of the 64-bit
instructions. Basically, every 64-bit instruction would take up 2 32-bit
wide fetch-decode pipe slots.

Jacob