Hi Folks,
this is a revision of a message sent to the RISC-V isa-dev mailing list.
I have made more progress, by fine-tuning the 16-bit compressed opcode
space, and refined some rather slim documentation on this fledgling
virtual machine architecture. at this point it just has a subset of the
user-mode ISA for the 16-bit compressed opcode space of an architectural
proof-of-concept for a virtual machine that could map reasonably well to
hardware. but it has some unique characteristics: a RISC CPU
architecture with constant memory and CISC-like relocations, as well as
a vector-optimized instruction packet with some interesting
combinatorial decode characteristics, as well as an interesting future
linker. the general idea is a CPU-architecture virtual machine with a
few GPU-like features, the first being constant memory and an
instruction packet optimized for vectorized decoding. also, here is some
related work:
-
https://github.com/michaeljclark/zvec
the idea to add constant memory was to address the immediate encoding
issue, while not introducing instruction parse complexity, and the idea
to bifurcate the instruction stream into instructions and constants came
from delta-encoding in compression formats. essentially, all constants
except for small immediate constants that fit into bonded register slots
are read from constant memory. there is a front-end constant stream with
a dedicated branch instruction 'ibl' along with a refined
'jump-and-link' (call), ' jump-to-link' (ret), and a new
'pack-indirect'
instruction, which are changes focused on branching instructions and
constants at the same time, as well as calling virtual functions with
arbitrary addresses in registers, but necessarily within +/-2GiB of
(PC,IB) for back compatibility with a single link register.
the microarchitectural principle is to switch from a "control+data"
architecture to a "control+operand+data" architecture where there is a
third immediate operand bus fed at the front-end next to IFETCH instead
of at the back-end via LOAD-STORE. this is not uncommon in GPUs, and I
believe it may have first appeared in the Argonaut RISC core (ARC) in
1996. 29 years ago, but there may be earlier references to this
technique in expired IBM or other patents.
control - operand - data
it would be possible to replicate the operand bus or an immediate and
operand caching bus across execution ports and put a constant fetcher
and a constant fetch branch predictor a cycle later in the front-end
pipeline of a design utilizing this architecture, so that in some
incarnations it may only result in an increased pipe length and branch
stall latency for microarchitectures that bypass from constant memory to
this 'third' operand bus. a simple architecture on the other hand could
translate them to load instructions. additionally, constant memory can
be treated like instruction text, unlike gp-relative data, which can be
read-write, so a translator can statically translate instructions and
constants, stitching them back into a single stream for conventional
RISC and CISC architectures without constant memory. it probably needs a
fence instructions for systems code that modifies main memory that backs
constant memory, similar to FENCE.I. perhaps a FENCE.K or FENCE.C
instruction:
I$ K$ D$ i-fetcher k-fetcher load-store
for the reason of bifurcating instructions and constants, the encoding
is very particular about not mixing wires between the opcode portion of
the instruction and the bondable register slots. this is so that there
will be fewer instructions forms and less multiplex routing for the
decoder as there are no instructions like LUI and AUIPC with larger
immediate constants that need to multiplex opcode bits. and the very
regular instruction packet is the reason behind the term "super regular
RISC". I have also looked at the instruction config bits in ISAs like
Intel's EVEX encoding for AVX-512 SIMD ISA, and the number of
configuration bits map quite well to the 32-bit and 64-bit packets. for
this reason, I am also working on a new X86 disassembler and assembler
for use in a virtual machine translator. I don't have any association
with Intel, but I have an AVX-512 capable machine at home. you can see
some of the vector compression principles in my earlier work, like the
faster than DRAM compression algorithm for AVX-512:
-
https://github.com/michaeljclark/x86
this could easily map to a modified or extended RISC-V with an
alternative compressed encoding, but I don't think the RISC-V isa-dev
mailing list is the right place to discuss this. is it me, or did Google
turn 'comp.arch' read-only last year? so, with some private email as a
prompt, I have created a new mailing list for folks to talk openly about
more general computer-architecture-related things.
-
https://lists.anarch128.org/mailman/listinfo/comp-arch
like what would be required in a conventional CPU compiler and linker
for a general purpose architecture like this, as opposed to something
specific to GPUs. constant propagation needs to be a little different
and code size may go up slightly due to cloning of constants, because
each function needs its own immediate constant blocks with small
displacements as opposed to gp-relative in RISC-V. also the proposal
differs slightly from constant islands because the two streams are
completely bifurcated in that the linkage for constants is not
PC-relative, rather ib-link uses displacements in constant blocks to
traverse the constants required for a specific function translation.
it needs the 32-bit opcode space and a compiler to test out the
resulting instruction density. the short constant references may make
code size go down, but constant cloning may make code size go up. so I
am unsure what the net result will be. but it seems worthwhile to test
this concept out in a CPU virtual machine architecture. also noting that
the design sacrifices one-bit in the 16-bit compressed packet for
reasons of combinatorial instruction length decoding complexity. as well
as vectorized software decode complexity.
for software decoding on X86, an AVX-512 BITALG2 with VPEXT and VPDEP
would help, as opposed to the scalar versions that exist presently.
Michael