[Libre-soc-bugs] [Bug 1135] add FPSCR and Rounding classes to ieee754fpu

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Thu Aug 10 21:37:43 BST 2023


https://bugs.libre-soc.org/show_bug.cgi?id=1135

--- Comment #6 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #1)
> (In reply to Jacob Lifshay from comment #0)
> > the idea is that the cpu will have all three parts in separate registers and
> > will speculatively execute fp insns with the current value of the sticky
> > part register (not the one from the previous instruction, but the one from
> > the register, avoiding needing a dependency chain),
> 
> please do not do that.

I'm not saying we need that now, but later we will, because otherwise all fp
arithmetic ops both read and write potentially the sticky bits (since they
or-in their detected exceptions) so if you have a fp add take 3 cycles, then:
VL = 64
sv.fadd *0, *0, *64

will take at least *192 cycles*, no matter *how wide the SIMD units are or how
many there are* because the first element potentially writes the sticky bits
which the second reads, so the second element has to wait for the first
element, the second element potentially writes the sticky bits which the third
element reads, so the third has to wait for the second, and so on.
some of those bits (inexact flag) can't be calculated till the very last
pipeline stage since they're calculated as part of the final rounding.

>  if there is a dependency chain it is just tough luck.
> the programmer is already warned in the spec "some things might be slower"
> and surprise, that's what they get.

basically every other OoO cpu has the sticky bits handled specially for exactly
the reason that I explained above. SO can be handled using dependency chains
since writing to it is uncommon, the sticky bits can't because nearly every fp
insn potentially writes to them. other cpus often have special hardware that
accumulates the sticky bits from all instructions currently being run, so fp
ops can run at full speed since the accumulation can be done for many insns per
clock, the slow part is *reading* from that special accumulation hardware,
because the cpu has to wait for *all* prior fp ops to complete first, often
doing a full cpu flush.

> 
> > and then will cancel and
> > retry all later insns if it turns out that the insn changed the sticky part
> > (which is rare).
> 
> no, you REALLY do not want to be doing that.

that's exactly what we need much later, though for now we can just use
dependency chains and just have really slow fp.

> 
> follow EXACTLY how XER works, please, starting with adding FPSCR as
> "its own register file".

none of the FPSCR registers are being added by this bug or any of the
ieee754fpu work, that all happens in soc.git, later.

> 
> do NOT attempt repeat DO NOT attempt to add "speculation" of ANY KIND.

none of that is being added here, I'm just planning ahead for when we'll need
it much later.

> please follow this procedure:
> 
> * split the FPSCR-regfile into the four (or more) parts that you advocated
> * pass in the parts of FPSCR that *might* be written to, as "read operands"
>   (these will be written-out *if* needed)
> * pass in an immediate operand (in the Record)
>   "fp_overflow_just_like_xer_so_overflow"
>   - this if clear is how you know that the copy of FPSCR will not be
>     read, and consequently not be written to

that doesn't really work because, unlike OE=1 which is usually switched off,
*all* fp computation ops *always* generate sticky bits outputs, that need to be
or-ed into FPSCR.

some of those sticky outputs are extremely commonly set (such as inexact, which
is set whenever there is any rounding error whatsoever), the reason FPSCR
doesn't commonly change is because those corresponding flags will usually have
already been set in FPSCR, so or-ing in more 1s doesn't change the 1 that's
already there.

>   - however if set then you pass through the copy of the FPSCR bits
>     right the way through all pipeline stages.

I'm planning on just passing the FPSCR parts through the pipeline stages,
modifying the parts as needed.

> * and EXACTLY as is done with XER.SO when overflow is enabled,
>   have the final stage of the pipeline set or clear the "data.ok"
>   bit.

i can do that, and just set the .ok bits when those FPSCR parts needs to
change. the volatile part will need to change nearly every time (but doesn't
usually get read so is fine), the sticky part rarely (but insns can't easily
tell until the last pipe stage), and the control part only for specific control
insns.

> BEFORE BEGINNING please can you describe in your own words precisely and
> exactly how XER.SO XER.CA/32 and XER.OV/32 work, and how they are part
> of a special "regfile".

I'm simplifying slightly since I don't want to write 10 pages of text:
SO/CA[32]/OV[32] are passed as inputs from registers/dependency-tracking to all
relevant ALUs, those ALUs check OE=1, which if set, then they or-in their
overflow output and signal that SO/OV[32] need to be written. dependency
tracking then checks if the output is set as written and if so delays until the
output is computed, then writes that output to the registers and/or other insn
inputs as necessary. if the output is not set as written (computable at decode
time, but i think we delay for some insns), then the dependency tracking uses
the old SO/OV[32]/CA[32] and forwards that from registers/etc. to later insns.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-soc-bugs mailing list