[Libre-soc-bugs] [Bug 713] PartitionedSignal enhancement to add partition-context-aware lengths
bugzilla-daemon at libre-soc.org
bugzilla-daemon at libre-soc.org
Fri Oct 8 00:06:40 BST 2021
https://bugs.libre-soc.org/show_bug.cgi?id=713
--- Comment #35 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #33)
> (In reply to Jacob Lifshay from comment #31)
>
> > This is exactly how all SIMT works (which is exactly what we're
> > trying to do with transparent vectorization). The types and sizes are the
> > type/size of a *single* lane, not all-lanes-mushed-together.
>
> there is a lot of misinformation about SIMT. SIMT is standard cores
> (which may or may not have Packed SIMD ALUs) where they are "normal"
> cores in every respect *except* that they share one single PC,
> one single L1 I-cache, one single fetch and decoder, that *broadcasts*
> in a synchronous fashion that one instruction to *all* cores.
I don't care how SIMT is traditionally implemented in a GPU, that's totally
irrelevant and not what I intended. What I meant is that our HDL would be
written like how SIMT is used from a game programmer's perspective -- where if
a game programmer writes:
float a, b;
int c, d;
...
a = a > b ? c : d;
the gpu actually runs (vectorized with 64 lanes):
f32x64 a, b;
i32x64 c, d, muxed;
boolx64 cond;
...
cond = a > b; // lane-wise compare
muxed = mux(cond, c, d);
a = convert<f32x64>(muxed);
and if a programmer were to write (generalizing a bit to support dynamic XLEN):
xlen_int_t a, b, c, d;
...
a = b + c * d;
what would actually run is:
vec_with_xlen_lanes_t a, b, c, d;
...
for(lane_t lane : currently_active_lanes()) { // C++11 for-each
xlen_int_t p = c[lane] * d[lane];
a[lane] = b[lane] + p;
}
> > the input argument for all
> > current Signals *is* the bit-width of the current lane (aka. elwidth or
> > XLEN) except that our code currently is specialized for the specific case of
> > elwidth=64.
>
> yes. in the discussions with Paul and Toshaan i seriously considered
> an XLEN parameter in the HDL which would propagate from runtime through
> a PSpec (see test_issuer.py for an example) and would allow us to test
> a scalar 32 bit Power ISA core, see how many less gates are needed.
> and, just for laughs, to try and XLEN=16 core.
>
> but... time being what it is...
If we want any chance of matching the spec pseudo-code, we will need
xlen-ification of our HDL in some form, since currently it's hardwired for only
xlen=64.
Why not make it look more like the pseudo-code with arithmetic on an XLEN
constant? I could write the class needed for XLEN in a day, it's not that
complicated:
class SimdMap:
def __init__(self, values, *, convert=True):
if convert:
# convert values to a dict by letting map do all the hard work
values = SimdMap.map(lambda v: v, values).values
self.values = values
@staticmethod
def map(f, *args):
# TODO: fix bad wording, like builtin map, but for
# SimdMap instead of iterables
"""apply a function `f` to arguments which are one of:
* Mapping with ElWid-typed keys
* SimdMap instance
* or scalar
return a SimdMap of the results
"""
retval = {}
for i in ElWid:
mapped_args = []
for arg in args:
if isinstance(arg, SimdMap):
arg = arg.values[i]
elif isinstance(arg, Mapping):
arg = arg[i]
mapped_args.append(arg)
retval[i] = f(*mapped_args)
return SimdMap(retval, convert=False)
def __add__(self, other):
return SimdMap.map(operator.add, self, other)
def __radd__(self, other):
return SimdMap.map(operator.add, other, self)
def __sub__(self, other):
return SimdMap.map(operator.sub, self, other)
def __rsub__(self, other):
return SimdMap.map(operator.sub, other, self)
def __mul__(self, other):
return SimdMap.map(operator.mul, self, other)
def __rmul__(self, other):
return SimdMap.map(operator.mul, other, self)
def __floordiv__(self, other):
return SimdMap.map(operator.floordiv, self, other)
def __rfloordiv__(self, other):
return SimdMap.map(operator.floordiv, other, self)
...
XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})
# layouts really should be a subclass or wrapper over SimdMap
# with Shapes as values, but lkcl insisted...
def layout(elwid, part_counts, lane_shapes):
lane_shapes = SimdMap.map(Shape.cast, lane_shapes).values
signed = lane_shapes[ElWid.I64].signed
# rest unmodified...
assert all(i.signed == signed for i in lane_shapes.values())
part_wid = -min(-lane_shapes[i].width // c for i, c in part_counts.items())
...
...
# now the following works, because PartitionedSignal uses SimdMap on
# inputs for shapes, slicing, etc.
# example definition for addg6s, basically directly
# translating pseudo-code to nmigen+simd.
# intentionally not using standard ALU interface, for ease of exposition:
class AddG6s(Elaboratable):
def __init__(self):
with simd_scope(self, IntElWid, make_elwid_attr=True):
self.RA = PartitionedSignal(XLEN)
self.RB = PartitionedSignal(XLEN)
self.RT = PartitionedSignal(XLEN)
def elaborate(self, platform):
m = Module()
with simd_scope(self, IntElWid, m=m):
wide_RA = PartitionedSignal(unsigned(4 + XLEN))
wide_RB = PartitionedSignal(unsigned(4 + XLEN))
sum = PartitionedSignal(unsigned(4 + XLEN))
carries = PartitionedSignal(unsigned(4 + XLEN))
ones = PartitionedSignal(XLEN)
nibbles_need_sixes = PartitionedSignal(XLEN)
z4 = Const(0, 4)
m.d.comb += [
wide_RA.eq(Cat(self.RA, z4)),
wide_RB.eq(Cat(self.RB, z4)),
sum.eq(wide_RA + wide_RB),
carries.eq(sum ^ wide_RA ^ wide_RB),
ones.eq(Repl(Const(1, 4), XLEN // 4)),
nibbles_need_sixes.eq(~carries[0:XLEN-1] & ones),
self.RT.eq(nibbles_need_sixes * 6),
]
return m
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the libre-soc-bugs
mailing list