My Instruction Set



General instruction encoding sequence:

   LSB                                                                                        MSB
   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 34 25 26 27 28 29 30 31
   Z  C  V  NZ NC NV Opcode---------------- Destination------ Source1---------- Source2----------

Flags:
	Z = last operation yielded a zero
	C = last operation had an unsigned carry
	V = last operation had a signed carry
	I = interrupt flag
	P = priviledged instructions allowed
	X = paging in use

But the processor is little endian so the first byte contains 6 bits of mask followed by the low two
bits of opcode in its highest two bits...

The first three bits skip the instruction if the flag is on, the second three skip if the flag is off.
Many instructions have two forms, one that writes to the flags and one that doesn't. Some instructions
never write to the flags at all.

Writing to IP without using a JMP instruction imposes a large stall penalty.

There are 64 addressible registers, of which 4 have special meanings to the processor.
Register 0 = all zeros.
Register 1 = IP
Register 2 = return address
Register 3 = multiply/divide high address

By convention, the following registers are reserved for their purpose
Register 63 = stack pointer
Register 62 = thread-local storage pointer
Register 61 = frame pointer (if frame pointers are used)

Special Opcodes

0x00  Fault         This instruction is always invalid, raises an error if not flag masked
0x01  Syscall       This instruction raises a system call

Bit twiddling instructions

0x02  and   r, r, r
0x03  andf  r, r, r
0x04  or    r, r, r
0x05  orf   r, r, r
0x06  xor   r, r, r
0x07  xorf  r, r, r
0x08  nor   r, r, r
0x09  norf  r, r, r

nop is made as and 0, 0, 0
not is made as nor dest, src, 0

Bit shifting

0x0A  shl   r, r, r
0x0B  shlf  r, r, r
0x0C  shl   r, r, i
0x0D  shlf  r, r, i
0x0E  shr   r, r, r
0x0F  shrf  r, r, r
0x10  shr   r, r, i
0x11  shrf  r, r, i
0x12  sar   r, r, r
0x13  sarf  r, r, r
0x14  sar   r, r, i
0x15  sarf  r, r, i
0x16  rol   r, r, r
0x17  ror   r, r, r
0x18  rol   r, r, i	There's no need of an ror r, r, i as this can be encoded as rol r, r, 64 - i
0x1D  swab  r, r, i	Register value conversion
	i = 0: alias for mov
	i = 1: reverse bits
	i = 2: swap endian for 4x 2 byte numbers
	i = 4: swap endian for 2x 4 byte numbers
	i = 8: swap endian for 1x 8 byte number
	i = 17: sign extend byte to 8 bytes
	i = 19: sign extend two bytes to 8 bytes
	i = 21: sign extend four bytes to 8 bytes
	i = 33: zero extend byte to 8 bytes
	i = 34: zero extend two bytes to 8 bytes
	i = 36: zero extend four bytes to 8 bytes

JMP instructions

0x1A  jmp near          The register range encodes a signed number between -2^20 and 2^20-4 where the bottom 2 bits must be zero
0x1B  call near         like jmp, but writes return address to r2
0x19  call indirect     Jumps to address in source1, writes return address to dest (should be r2; r0 results in jmp indirect)

Addition and subtraction

0x20  add   r, r, r
0x21  addf  r, r, r
0x22  add   r, r, i	Unsigned 0-63
0x23  addf  r, r, i
0x26  adcf  r, r, r     Also adds carry bit; this instruction is unusual in that there is no non-f version
0x27  sbbf  r, r, r     Also subtracts carry bit
0x28  sub   r, r, r
0x29  subf  r, r, r
0x2A  sub   r, r, i     Unsigned 0-63
0x2B  subf  r, r, i     Unsigned 0-63
0x2E  flags (dest, source1, i)
			i = 0: clear carry
			i = 1: set carry
			i = 2: flip carry
			i = 4: copy source1 to flags (does not touch flags other than Z C V)
			i = 5: copy flags to dest (ditto)
			i = 6: copy source1 to flags including priviledged flags
			i = 7: copy flags to dest (including privileged flags)

Special long-form instructions for dealing with large constants. Add and sub take up 2 slots, where the first 6
bits of the second slot are set to 1 bits in encoding, while the 6 bits in source2 slot provide the actual value.
To reconstruct the constant, take the bottom 6 bits from the second 32 bit word, move them to the top 6 bits,
then take the bottom six bits of the first word and move them into the second word.

0x24  add   r, r, i     Unsigned 0-2^32-1
0x25  addf  r, r, i     Unsigned 0-2^32-1
0x2C  sub   r, r, i     Unsigned 0-2^32-1
0x2D  subf  r, r, i     Unsigned 0-2^32-1

loadc and jmp far occupy 3 slots each with the same basic idea; the second slot is repaired from source2 slot
while the third slot is repaired from the source1 slot
0x1C  jmp far           Unsigned 0-2^64-1
0x1D  jmp far           Unsigned 0-2^64-1
0x1E  loadc r, i        Unsigned 0-2^64-1

Memory access

0x30  load8  r, r, i    Loads 8 byte value into register; integer i is small offset (0-63)
0x31  load8h r, r, i    Continues load by handling alignment errors
0x32  load4  r, r, i    Loads 4 byte value into register
0x33  load4h r, r, i    Continues load by handlign alignment erros
0x34  load2  r, r, i    Loads 2 byte value into register
0x35  load2h r, r, i    Continues load by handling alignment erros
0x36  load1  r, r, i    Loads single byte into register
0x37  watch  0, r, i    monitors aligned 16 byte region for writes; each core can monitor 1 address
                        do not watch the highest possible 16 bytes of virtual address space; this
			is where the bogus watch is (in case of handling an interrupt)
0x38  sto8   r, r, i    Writes 8 byte value from register to memory
0x39  sto8h  r, r, i    Continues store by handling alignmnet errors
0x3A  sto4   r, r, i    Writes 4 byte value from register to memory
0x3B  sto4h  r, r, i    Continues store by handling alignment erros
0x3C  sto2   r, r, i    Writes 2 byte value from register to memory
0x3D  sto2h  r, r, i    Continues store by handling alignment erros
0x3E  sto1   r, r, i    Stores one byte value from register to memory
0x3F  storef r, r, i    If watch is good, writes 8 byte value from memory to register, clears all flags

;The difference between the 0x3x and the 0x48 memory operations is the range of i; structures of 64 bytes
;or less may be accessed byte aligned with no additional instructions; larger structures can only be accessed
;at 8-byte aligned chunks without one additional instruction.

0x40  load8  r, r, i    Loads 8 byte value into register; integer i is small offset (0-63)*8 + 64
0x41  load8h r, r, i    Continues load by handling alignment errors
0x42  load4  r, r, i    Loads 4 byte value into register
0x43  load4h r, r, i    Continues load by handlign alignment erros
0x44  load2  r, r, i    Loads 2 byte value into register
0x45  load2h r, r, i    Continues load by handling alignment erros
0x46  load1  r, r, i    Loads single byte into register
0x47  watch  0, r, i    monitors aligned 16 byte region for writes; each core can monitor 1 address
                        do not watch the highest possible 16 bytes of virtual address space; this
			is where the bogus watch is (in case of handling an interrupt)
0x48  sto8   r, r, i    Writes 8 byte value from register to memory
0x49  sto8h  r, r, i    Continues store by handling alignmnet errors
0x4A  sto4   r, r, i    Writes 4 byte value from register to memory
0x4B  sto4h  r, r, i    Continues store by handling alignment erros
0x4C  sto2   r, r, i    Writes 2 byte value from register to memory
0x4D  sto2h  r, r, i    Continues store by handling alignment erros
0x4E  sto1   r, r, i    Stores one byte value from register to memory
0x4F  storef r, r, i    If watch is good, writes 8 byte value from memory to register, clears all flags
                        If watch is bad or was not watching this address, sets all flags

watch and storef implement atomic primitives. lock inc [ra] atomic is as follows:

	watch	ra
	load8	rs1, ra, #0
	addf	rs1, rs1, #1
	flags   rs2, r0, #5
	storef  rs1, ra, #0
      z jmp near @-6         ; goes back to watch above
        flags   r0, rs2, #4

Multiplication and Division:

multiply sets no flags; division sets Z flag = whether or not divide overflow occurred
(division by zero or largest negative number)

0x50  mul8  r, r, r	Unsigned multiply 8 bytes; high output goes to r3
0x51  imul8 r, r, r	Signed multiply 8 bytes; high output goes to r3
0x52  mul4  r, r, r	Unsigned multiply 4 bytes
0x53  imul4 r, r, r	Signed multiply 4 bytes
0x54  mul2  r, r, r	Unsigned multiply 2 bytes
0x55  imul2 r, r, r	Signed multiply 2 bytes
0x56  mul1  r, r, r	Unsigned multiply 1 byte
0x57  imul1 r, r, r	Signed multiply 1 byte
0x58  div8  r, r, r	Unsigned divide 8 bytes; high input comes from r3, modulus goes to r3
0x59  div8  r, r, r	Signed divide 8 bytes, high input comes from r3, modulus goes to r3
0x5A  div4  r, r, r	Unsigned multiply 4 bytes, modulus goes to r3
0x5B  idiv4 r, r, r	Signed divide 4 bytes, modulus goes to r3
0x5C  div2  r, r, r	Unsigned divide 2 bytes, modulus goes to r3
0x5D  idiv2 r, r, r	Signed divide 2 bytes, modulus goes to r3
0x5E  div1  r, r, r	Unsigned divide 1 byte, modulus goes to r3
0x5F  idiv1 r, r, r	Signed divide 1 byte, modulus goes to r3

Floating point operations (double):  All operations touch flags as expected

0x60  fadd  r, r, r
0x61  fsub  r, r, r
0x62  fmul  r, r, r
0x63  fdiv  r, r, r
0x64  fmod  r, r, r (modulus of negative divided by positive is negative)
0x65  f_op  r, r, identifier
	i = 0: load constant
		r = r0: 0
		r = r1: 1
		r = r2: 2
		r = r3: -1
		r = r4: log2 e
		r = r5: log2 10
		r = r6: 1/log2 e
		r = r7: 1/log2 10
		r = r8: pi
		r = r9: e
		r = r10: golden ratio
		r = r11: sqrt(2)
		r = r12: the correct constant for implementing 4/5 rounding (hint: +0.5 is wrong)
		r = r30: +nan
		r = r31: inf
		r = r62 -nan
		r = r63: -inf
	i = 1: log2
	i = 2: 2^x
	i = 3: sin
	i = 4: cos
	i = 5: tan
	i = 6: invsin
	i = 7: invcos
	i = 8: invtan
	i = 9: sinh
	i = 10: cosh
	i = 11: tanh
	i = 12: 1/x
	i = 13: sqrt
	i = 14: floor
	i = 15: ceil
	i = 60: convert double to integer (floor rounding)
	i = 61: convert integer to double
	i = 62: convert double to single
	i = 63: convert single to double

Privileged instructions

0x70    setmsr  msr, r, #0
0x71	readmsr r, msr, #0
0x72	cli	0, 0 ,0
0x73	sti	0, 0, 0
0x74	hlt	0, 0, 0
0x75	iret	0, 0, 0
0x76	read	i, r, #size
0x77	read	r, r, #size
0x78	write	r, i, #size
0x79	write	r, i, #size
0x7F	readmagic r

The following MSRs exist.
	0 = address of page table
	1 = address of ISR page table
	2 = address of ISR
	3 = interrupt reason
		0 = hardware interrupt line
		1 = page read fault
		2 = page write fault
		3 = page execute fault
		4 = bus fault
		5 = memory fault
		6 = memory checksum error
		7 = instruction fault
		8 = syscall
	4 = interrupt line or faulting instruction
	5 = fault address
	8 = saved address of page table
	9 = saved IP
	10 = saved flags register
	12 = ISR register #1 (recommendation: ISR stack pointer)
	13 = ISR register #2 (recommendation: user stack pointer)
	14 = ISR register #3 (recommendation: pointer to process structure)
	15 = ISR register #4 (spare)

Instruction 0x6F is special. When in user mode, it doesn't fault but returns the real value of MSR #0.
In kernel mode, it simply faults like executing an invalid instruction. This bizarreness is intentional
and is designed so that virtualization of the CPU can be detected but nobody will use readmagic where
they meant readmsr r, msr, #0.


Calling convention:
	Stack frame is 16 byte aligned
	Register 63 is the stack frame pointer
	Registers 2, 4-31 and 61-62 are preserved registers.
	Function return value goes in register 32
		if return doesn't fit in eight bytes, register 32 points to where to write the return data
	Register 3 may be clobbered at call time by trampolines inserted by the linker; since the compiler
		knows where this can occur this does not conflict with its use as mul/div high half
	Register 33 is the trampoline binding register (for pointer to function with closure)
	Register 34 contains the this pointer for target call site
	The first 16 arguments are written to registers 35-51
		if an argument is larger than 8 bytes it uses more than one register and therefore counts
		as more than one argument here
	Any remaining arguments are pushed to the stack, last argument first
	Integers shorter than 8 bytes are zero extended or signed extended as appropriate
	_chkstk takes its argument in r60, preserves all registers except r2 and r57-59 and returns nothing
	Typically, r2 will be saved in r56 when calling _chkstk

_chkstk:
	add	r57, r0, #4096
	add	r58, r63, r0
	add	r59, r63, r0
	sub	r59, r59, r60
	sto8	r58, r58, #0
	sub	r58, r57, r0
	sub	r0, r58, r59
  nz nc jmp	@-4
	sto8	r59, rf9, r59
	jmp	r2

Function types:
	A lightweight leaf function does not touch r63, retains its return address in r2 throughout, and catches nothing.
	We note that _chkstk is an example of a lightweight leaf function despite touching the stack area; but unless the
	platform defines a red zone we may not depending on anything being preserved there.  A trampoline looks like a
	lightweight leaf function that executes a tail call.

	An unregistered middleweight function always has a frame pointer, is laid out in linear order so that disassembly
	can always find the function epilog, and never throws or catches exceptions itself. On most platforms, middleweight
	functions can only exist if they are dynamically generated because code pages can be paged out to source media and
	that media can fail to be accessible. (If the page file fails, that's probably unrecoverable.) Unregistered functions
	may not call _chkstk

	A heavyweight function needs to be registered so that the frame unwinder can find its epilog code and catch blocks.
	Loaded modules have their functions registered at link time.

Function prolog always looks like this:

	3 dd 0x00	; Optional hotpatch slab
function:
	and	r0, r0, r0			; Optional hotpatch nop hitpoint
	add	r60, r0, #stackspace		; Allocate stack space
	add	r56, r0, r2			; If #stackspace > 4096 we must call _chkstk
	call	_chkstk
	add	r2, r0, r56
	sub	r63, r63, r60
	; If we are a varargs function, spill all ... registers here (16 - number of fixed arguments)
	; Store additional registers to be preserved below largest stack space required for call
	; we *must* use r3 as the scribble register for large offsets so the unwind code doesn't hate us
	add	r3, r63, #large-offset
	sto8	r4, r3, #0			; save call-clobbered registers
	sto8	r2, r63, #some-offset		; store return address second to last
	sto8	r61, r63, #some-offset+1	; Store frame pointer last

	;; If the function has a large number of locals, additional frame pointers may be established
	add	r4, r61, #320

	;; Do function body

	; Epilog code; again r3 is the scribble register for large offsets
	lod8	r61, r63, #some-offset+1	; restore old frame pointer
						; the unwind disassembler locks on to this instruction to find the
						; start of the epilog; either variant of lod8 may be used with any
						; offset in range
	lod8	r2, r63, #some-offset		; restore old return address
	; Restore additional registers
	add	r3, r63, #large-offset
	lod8	r4, r3, #0
	; Return
	jmp	r2

If the function calls a function that takes an excessive number of arguments, the saved registers may be accessed
via r3 in a manner similar to how the stack is grown for allocating a buffer on the stack.

This processor has the design of a complete address split so that no kernel-mode memory is mapped in user mode, and the opposite is also true.
To ensure TLB consistency, the kernel must write to MSR #0 after editing page tables.

Here follows a vaguely plausible iterrupt service routine:

_isr:	; Interrupt service routine
	; Please note the ISR is at the bottom of the stack and can't really follow the calling conventions
	; It would be registered in the unwinder so that the unwinder knows to give up
	setmsr	13, r63
	loadc	r63, _isrfault
	setmsr	1, r63
	setmsr	15, r2
	readmsr	r2, 14

	; Save registers
	lod8	r2, r2, #(ptable.registers - ptable)
	sto8	r3, r2, 3*8
	readmsr	r3, 15
	sto8	r3, r2, 3*8
	readmsr	r3, 10
	sto8	r3, r2, 0*8
	readmsr	r3, 9
	sto8	r3, r2, 1*8
	sto8	r4, r2, 4*8
	sto8	r5, r2, 5*8
	;...
	sto8	r63, r2, 63*8

	readmsr	r63, 12
	readmsr	r33, 3			; Arguments for specific handler
	readmsr r34, 4
	readmsr r35, 5
	readmsr	r36, 14
	sub	r0, r33, 6
 c	jmp	_isrbad			; Processor is too new
	lod8	r37, _isrmastertable
	shl	r33, r33, 3
	add	r33, r33, r37
	call	r2, r33			; Dispatch to handlers

_transition: ; interrupt return, also jumped to by syscall handler to change execution context
             ; so that system calls can block and the kernel can be preempted
	; Clear any userspace watch as we almost ceratinly messed it up
	sub	r2, r0, 16
	watch	r2, #0
	; Restore registers
	readmsr	r2, 14			; Might have changed
	lod8	r2, r2, #(ptable.registers - ptable)
	lod8	r63, r2, 63*8
	;...
	lod8	r4, r2, 4*8
	lod8	r3, r2, 1*8
	setmsr	9, r3
	lod8	r3, r2, 0
	setmsr	10, r1
	lod8	r3, r2, 2*8
	setmsr	15, r3
	readmsr	r2, 15
	iret