[libre-riscv-dev] [Bug 168] create naturally aligned partition points

Thu Feb 13 17:55:18 GMT 2020

http://bugs.libre-riscv.org/show_bug.cgi?id=168

--- Comment #4 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Michael Nolan from comment #3)
> (In reply to Luke Kenneth Casson Leighton from comment #2)
> > ok so the idea is, that the only options for partition sizes are:
> > 
> > * 64
> > * 32,32
> > * 16,16,16,16
> > * 8,8,8,8,8,8,8,8
> > 
> > is that the idea?
> 
> Yes
> 
> > 
> > if so, this restricts us to not being able to run 32-bit arithmetic in one
> > "lane" and 16-16 bit arithmetic in the other.
> 
> > 
> > this so that on vectorised instructions, if there are 32-bit instructions
> > that happen to hit the 32-LO register port, the 32-*HI* port can be used
> > for 32, 16-16, 8-8-8-8 *completely different* instructions that *HAPPEN*
> > to occur (or are deliberately arranged to occur) on the exact same cycle
> > and happen to be the exact same operation.
> > 
> > now, whether these conditions turn out to be reasonable or not is another
> > matter, hence why, yeah, it should be fine to consider this, and thus
> > perhaps greatly simplify the partitioning.
> > 
> > would we end up with a huge number of 32-bit-adds mixed in with 8-8-8-8
> > adds?  i don't honestly know.
> 
> This would make scheduling a bit more complicated,

i'm anticipating it to be quite straightforward, by way of pushing the
"predicate bits" directly into the regfile write-enable lines, and to
breaking down operations into 32-bit "chunks".

so, where a sequence of elements (say 16 bit) are to be ADDed, that will
be "converted" into 2x 16-16 SIMD operations: one will go to HI-32 regfile,
the other will go to LO-32 regfile.

it's pretty straightforward.  it'll be slightly wasteful where the vector
length is not an exact multiple of 32-bits (3x8 for example) however as
a first iteration i'm not that concerned.

> but it might be beneficial to do this only for some modules. 

honestly it would complicate the decode phase, along these lines:
"if operation == NOT_CAPABLE_OF_DYNAMIC_PARTITIONING { do something else }"

whether that's ok compared to the complexity of the partitioned ALU ops?

> I don't think it'd make a huge
> difference for the adder or comparator to use an aligned partition, but it
> might simplify the shifter a good bit (because it eliminates a couple of the
> matrix entries).

it does... however i think the Switch statement really has to go.  if you
run "proc" "opt" then "show top" on a 64-bit shifter, it's awful.
the MUX chain is absolutely dreadful: each "switch" statement gets turned
into a "if x == 0b00001 if x == 0b00010 if x == 0b000011"... with the
results *chained* between each!

by comparison, for the gt_combiner, the mux-chains aren't 64-bit long,
they're only 7-long, because they're done on the *partition* gates,
not per-permutation-of-all-values-of-partitions.

you did manage to convert the "switch statement" from the original that
i did, of eq_combiner, and i am confident that the same thing can be done
here, based on the tables:

https://libre-riscv.org/3d_gpu/architecture/dynamic_simd/shift/

the only thing being, each table (each column output, o0...o7) is
computed independently, you can't share data *between* each column,
and that's fine.

-- 
You are receiving this mail because:
You are on the CC list for the bug.