[libre-riscv-dev] Fall 2022 Interfaces

Fri Jun 12 12:21:46 BST 2020

(offlist conversation moved onlist after removing names).

phone messing formatting.  blech.  sorry.

the bandwidth requirements for video framebuffer are so insane i am leaning
towards suggesting use a multi scanline compression hardware algorithm.
similar to the old fax machines, huffman encoding, blah blah.

multi scanline because this will pick up vertical compression opportunities.

On Friday, June 12, 2020, Yehowshua <yimmanuel3 at gatech.edu> wrote:

> two uses
>
1. Stand alone SBC
2. GPU for POWER systems, and any system really.

yep.  this you can see on the original pages and descriptions i created.

glad to see you also worked it out.

> We can then literally provide identical GPU drivers to the host machine,
> except now, commands are sent over PCIE to LibreSOC where they are
> then issued as instructions on the hybrid CPU/GPU.
>

indeed.  with associated insanity and complexity, for which an entire
special team will be needed.

And remember, LibreSOC already has video output -
>

only if Richard Herveille's RGBTTL interface is included.

 so just send out the frame!
>

yyep :)

fortunately PCIe just memorymaps the framebuffer (i think)

so for nonaccelerated video it at least becomes braindead simple.

what is nice however is that by running an entire OS on the GPU, and it
being a full OS, a protocol can be invented which transfers video.

OR

just run the XServer *on the GPU*.

run xhost +

and... um... that's it.

done.

no need to write any drivers.

just treat the graphics card as if it was a networked Xserver.

> OK now to talk performance. When doing the discrete GPU, we can lift power
> restrictions.
> So lets say we double vector land widths, run the chip at 2GHz…
>
> We're talking 100 maybe 200GFLOPS.
>

yes.

it comes down to memory bandwidth at L0CacheBuffer.

4 core 800mhz dual issue is 1600 x 4 = 6.400 Giga vectors. and a FMAC is 2
already.

each vector is 4 FP32 therefore we are 50 GFLOPs

double the clock rate and 100 GFLOPs is achievable.

however to sustain that we need *SIXTEEN* LDST FunctionUnits.

and each one will have 2x LDST Ports.

32 ports @ around 160 bits wide

that's an internal routing of FIVE THOUSAND wires between the Function
Units and the L0 Cache Buffer.

in addition we will need a minimum of something mad like 8 way striped L1
data caches, and 16x 64 bit Wishbone Buses down into the L2 cache.

it's achievable, it's just going to be a handful.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68