[libre-riscv-dev] IEEE754FPU pipeline API and its use

Tue Aug 6 09:04:14 BST 2019

A question was just asked on hw-dev about pipelining in verilog, with the newly announced hardfloat-verilog library.

The ieee754fpu library is being developed for not just the Libre RISCV CPU/GPU, it is for general purpose use and as such is being designed for maximum reconfigureability, readability, and contains comprehensive testing (using John's excellent softfloat-3 with sfpy python bindings).

The ieee754fpu library is based on combinatorial building blocks, that can be chained together either combinatorially or using IO modules that create pipeline registers.

Stage API:
http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/iocontrol.py;h=04e4123839a03b8737f92d4a9cd8b2a0cc2dd1e5;hb=HEAD#l10

IO Handling Base class:
http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/iocontrol.py;h=04e4123839a03b8737f92d4a9cd8b2a0cc2dd1e5;hb=HEAD#l68

With very few exceptions (data multiplexing), both the combinatorial blocks and the pipeline IO "Data Handlers" know anything about each other. They are entirely opaque to each other.

Different "rules" (modules/classes) can be applied to give different behaviour, such as buffered pipelines which have an extra register (a single entry FIFO in effect) to allow at least one "stall" at the end of the pile not causing the *start* to have to also stall immediately (see zipcpu post further down, for details).

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/singlepipe.py;h=7f146585ad95924faa8290a545d4603f59abfe40;hb=HEAD#l324

Another class provides global "cancellation" capability. This one is suitable for both speculative and SIMD pipelines. Each stage is given both an "active" mask and a global "stop" mask. Only data which has a non-zero "active" mask is passed from register to register, however if at any time the global "stop" mask is activated, ongoing ripple will cease.

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/singlepipe.py;h=7f146585ad95924faa8290a545d4603f59abfe40;hb=HEAD#l407

Both masks are unary encoded, and the width of the masks must equal the total pipeline length. In this way it is possible to not only uniquely identify each result as it is being computed by the pipeline, it is possible to individually cancel all *and any* partial result(s) in a single cycle.

For anyone familiar with out of order speculative design, in particular with Mitch Alsup's augmented "precise" 6600 Dependency Matrix design, this mask/cancellation capability will be immediately recognised as being extremely useful.

Two additional very important modules were also written: fan-in and fan-out routing. These basically allow, as expected, many to one and one to many data routing. These classes are where the separation between data and its "handling" presently break down (to be fixed, shortly). Fascinatingly, they're combinatorial, so do not create a synnchronous clock delay.

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/multipipe.py;h=c0d9127c37b8feab0ed79a1e4033e7a5d06b1f38;hb=HEAD#l375

The multi-in, multi-out modules are extremely important for the creation of AXI4/Wishbone data routing units, as well as for creating Reservation Stations for out of order designs.

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/concurrentunit.py;h=da63d32209612023a0e5e1fa171fd245313c950c;hb=HEAD#l40

There are also much simpler Data Handling IO modules as well, and also one that uses a Queue (FIFO).  This particular module is based on a direct line-for-line translation of the chisel3 Queue module into nmigen.

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/queue.py;h=069aa22cc33d8507a0644157e70bf0a39abbf450;hb=HEAD#l34

Surprisingly, its WEN/RDY/REN/RDY characteristics fitted perfectly and directly (bar some name-changing shim logic) with the Data Handling IO API.

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/singlepipe.py;h=7f146585ad95924faa8290a545d4603f59abfe40;hb=HEAD#l782

All of this gives a firm fundamental basis for creating really powerful ALUs and other units. It would be flat-out impossible to do in pure verilog, as it critically relies on the OO capabilities and multiple inheritance features of the python programming language.

[with the *output* from nmigen being verilog, we get the best of both worlds: tool verification associated with verilog, without the pain of writing in a language that has its roots firmly in the late 80s...]

The separation between data creation and data handling is really what has made this flexibility possible. The FPDIV pipeline for example takes the number of acceptable combinatorial blocks that may be chained together as a python class parameter, along with the FP width as another, plus the "radix" (number of bits of the result that each combinatorial block shall produce) is a third.

The FP width (16, 32, 64) is passed down one OO inheritance tree, whilst the two parameters are passed down another. The result is that a *single* codebase *automatically* generates a pipeline of the required length, creating latches/registers at the appropriate points.

All of this from general-purpose combinatorial building blocks that know nothing about how they are to be connected together, conforming to a simple stable OO API.

Also important to note is that the usual strict standards that go with libre software development intended for public usage have been applied (please do inform us of discrepancies!). That means strict pep8 compliance, docstrings on all modules and classes (sigh not all of them), and, crucially, MASSIVE amounts of unit tests.

These unit tests serve two key different purposes: firstly to check the correctness of the code, and secondly to serve as examples.  Anything found in a test/ subdirectory is runnable and should not fail. If it does, please let us know immediately.

These are the tests that run tens of thousands of conformance tests for FPMUL:

http://git.libre-riscv.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/fpmul/test;h=0d9d34ef7887d88c0b8ada2ff54ac2a14dc89a86;hb=HEAD

Each IEEE754 subdirectory similarly contains such comprehensive unit tests covering all bitwidths, and NaN, Inf, Zero, subnormal, nearly-zero, and random numbers.

These tests for example demonstrate how to use the pipeline API:

http://git.libre-riscv.org/?p=ieee754fpu.git;a=tree;f=src/nmutil/test;h=c8e2bdf7e4b2dda431a316cf3e7449da9a832057;hb=HEAD

In test_buf_pipe.py the most immediately obvious thing that should be apparent is the sheer number of data formats that the Stage API supports: classes, straight signals, Records and more.  This down to the capabilities of python OO programming techniques.

One last extremely important strategic design decision actually required going to stackexchange to get some assistance:

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/dynamicpipe.py;h=f9c649c4ced0c132767d448becd9e7925873ae38;hb=HEAD#l23

Some developers may not want the "cancellation" capability which is needed for OoO architectures. They may instead want "stall" capability, for an in-order design. They may wish to make a Data Handling IO module that does the stalling on a "global" basis, rather than the rippling (travelling enable/data) model. More on this difference is here:

https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html

However to expect developers to cut/paste fifty classes that they may not be familiar with is unreasonavle. Therefore, a decision was taken to make it possible to *change* the mixin class that the entire pipeline inherits from, as a *single parameter*.  An example of where this has been deployed is in FPDIV:

http://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/fpdiv/pipeline.py;h=8b9e108639bdcfcde975ec2fb49777f7129c2c38;hb=HEAD#l179

Here, the default class is overridden and the "cancellation" class used instead. This turns the entire pipeline from an in-order design into a speculative cancellable OoO one, with a few lines of code.

Lastly, it is worth pointing out that the IEEE754FPU is about a couple of months away from full RV64G capabilities. It already has DIV, ADD, MUL, SQRT, RSQRT, FCLASS, FCVT from int to float and vice versa, at 16, 32 and 64 bit capability, all as "pipelined" designs.

FP128 could also easily be trivially added (one line of code,.in one location, specifying the IEEE754 FP128 mantissa and exponent bitwidths), and SIMD capability (with predication) is also to be added shortly and, due to the modular design, is expected to be very simple and straightforward.

http://git.libre-riscv.org/?p=ieee754fpu.git;a=tree;f=src/ieee754;hb=HEAD

The features missing which are under development are tininess, rounding modes, FP flags, FMAC (complicated), FSGN (trivial to do), and the addition of multi stage multiply so that it is not necessary to have a full 53 bit multiply unit producing a 108 bit result in a single cycle.

The full list of issues being tracked for the IEEE754 FPU is here:
http://bugs.libre-riscv.org/show_bug.cgi?id=48

If anyone would like to assist, you are most welcome: we do have funding available, thanks to the NLNet Foundation Grant under their Privacy and Enhanced Trust Programme. http://nlnet.nl/PET

Contributions and participation are covered by the Libre RISCV Charter
https://libre-riscv.org/charter/

L.