[libre-riscv-dev] Migen Conversions and Update

Thu Jan 3 03:28:25 GMT 2019

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Jan 3, 2019 at 2:09 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Wed, Jan 2, 2019, 17:04 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:
>
> > Ok Daniel so would you like to take a look at nmigen, the code is quite
> > small, output v. tidy, you need latest yosys from git master. I really love
> > the "with" statement, see nmigen examples, ah we need to check that it can
> > generate a regfile, in migen there is a "memory.py" example which does all
> > the right things, including multiporting, passthrough of writes to reads,
> > and byte level write enable lines, i am unsure if that has been ported over
> > yet.
> >
> > Jacob, about virtual regs table. VRegs table does not have to be unary, it
> > can be binary. However as unary it is automatically the "enable" lines on
> > each individual reg file memory cell, which is really nice.
> >
> I think we're going to be hard pressed to get a sram without an attached
> decoder since all the HDLs I'm aware of only support a binary address for
> memory blocks.

 i wonder if the auto-generators would support a degenerate 1-bit (or
even 0-bit) address?  only one way to find out! :)

 well, if nothing else, we just either put in a binary address table
(VReg-to-RealReg), or an (otherwise unnecessary) unary-to-binary
encoder.

 i do want to make sure that for FPGAs we use available 2R1W on-board SRAM.

> > Hypothetically this would mean independent simultaneous writes or reads to
> > each reg entry in one clock, however in practice it would mean 128 64bit
> > data buses and we aint doing that :)
> >
> > If binary then that binary address has to be decoded inside the regfile
> > "box" itself. The unary matrix VReg to RealReg does away with that, so less
> > gates.
> >
> > Also size of table not so bad as it is split to 4 banks and each 128 regs
> > get their own muxes. Remember, muxes are on src but not on dest.
> >
> > Mitch Alsup did point out that line driving of 128 gates results in
> > significant latency, however our target is 800mhz certainly not 2 to 3 ghz,
> > also as it is driving unary encoders (128 mutually exclusive latches, only
> > 1 of those may ever be set at any given time) it is really not as bad as it
> > initially looks.
> >
> I would really like to try to design for being able to run at >1GHz so that
> we can run at higher clock speeds for less power critical applications, for
> short performance bursts, and for non-gpu execution, especially if the gpu
> is going to also be the cpu.
> I think 1.5GHz or 2GHz is a good target for max clock speed when ignoring
> power.

 i've been working out how to make it a minimum of dual-issue (even
for 64-bit CPU workloads), with quad-issue for 32-bit vectorised, and
8-issue for 16 and 8-bit.  it's doable.

 so there's less pressure, in a first iteration, to have a super-fast
clock rate.  additionally, power being a square law, dual-issue 800mhz
is *half* the power consumption of single-issue 1600mhz.

 that said: yes it would be nice to reach the highest clock rate
possible, and we should not *deliberately* make decisions that
restrict that.  on the other hand if trying to reach higher clock
rates places a limit on the architectural decisions, i'm inclined
towards prioritising the architectural layout and backing off from the
secondary goal of a higher clock rate, at least for this first
revision.

 the only thing that stops us going usefully beyond dual-issue @
64-bit for CPU workloads is the 4-bank hi/lo odd/even 32-bit 2R1W
register file arrangement.  or, more to the point: we *can* go
quad-issue for 64-bit CPU workloads... it's just that the 2R1W 4-banks
will need 2 of the 32-bit banks for each odd-even-numbered register,
such that it becomes the bottleneck, and the scoreboard (and ALUs)
will experience a constant backlog.

 in a future version if the regfile is made out of 4 banks of 4R2W
SRAM, that backlog will disappear.  quad-issue @ 64-bit even on CPU
workloads would be perfectly achievable, and we'd also be around the
48 GFLOPs mark for 32-bit FMACs.

l.