Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Goat2 pipeline computer
- 2020-07-11 by goatflyer
- 2502/183/-2475
- Overview
- --------
- The Goat2 is a Harvard-architecture RISC machine, implemented as
- a pipeline of 2-stages, running with a clock period of 26 ticks.
- Due to the goal of keeping critical signals short, it might be better
- described as a DPU (Distributed Processing Unit) as opposed to CPU!
- Its internal registers are:
- - a program counter register "P" (yellow near ROM) 2524/197/-2510
- - an 8-bit accumulator "A" (red beside ALU) 2476/219/-2496
- - an 8-bit pointer register "X" (purple behind RAM) 2519/219/-2450
- - a "link" "L" to hold a return address (purple near Incrementor) 2507/195/-2493
- - 2 single-bit flags: "Z" zero (blue), and "C" carry (red) (under ALU) 2491/197/-2479
- Instruction words are 13 bits wide: 8 bits of data (white) and a 5 bit opcode (pink).
- Depending on the instruction, the 8 bits of data is either a RAM address,
- a literal constant, a branch target address in ROM, or otherwise ignored.
- Instruction decoding is scattered everywhere along the (pink) inverted
- opcode bus, with the decoding of interest close to where control lines are needed.
- Although the instruction set supports full 8-bit address for both ROM and RAM,
- I implemented only 56 instruction words of ROM and 16 bytes of RAM
- for reasons of speed and plot size.
- Therefore registers "P" and "L" implement only 6 bits each, and only the lower 4 bits
- of register "X" are connected to the RAM address mux.
- Clock (light blue) and reset (black) control are in front of the ROM and low.
- 2504/181/-2474
- Demonstration Program: Double Dabble
- ------------------------------------
- The demo program is an implementation of the well-known DoubleDabble algorithm
- which converts a binary byte into BCD.
- It illustrates many of the Goat2 features:
- calling functions
- different types of return
- conditional branching
- delay slot instructions
- indirection
- register forwarding
- To use the demo, there is a (magenta) column of levers 2470/216/-2447
- for the input byte inside one corner of the RAM (cell 0xf)
- and 3 decimal digit displays designed by Cadenjb attached
- outside the orange ALU output bus. 2449/217/-2502
- The levers presently have a value of 0xA5 = 165 decimal.
- Enter a binary number on the levers and start the computer (see below).
- Let it run for about 25 minutes to view the result.
- Reset/Restart Procedure
- -----------------------
- - Turn on RESET
- - Turn on CLK
- - Wait at least 2 full periods
- - Turn off RESET
- The program execution begins at ROM location 0,
- All registers and flags are considered garbage
- and should be initialized by software before use.
- Although the clock will automatically shut off
- due to a breakpoint I left at the end address 3,
- it is polite to turn off the CLK when you are
- finished using the computer.
- Explanation of Features
- -----------------------
- Indirection
- -----------
- The single pointer register "X" is used for all indirection.
- To illustrate its use, here is a possible "memcpy" implementation:
- memcpy:
- ldx srcptr ; load X with current source pointer
- ldax ; load source byte
- ldx dstptr ; now load X with current destination pointer
- stax ; store byte to destination
- inc srcptr
- inc dstptr ; bump both pointers
- dec loopcount ; reduce count
- jnz memcpy ; repeat for the length of the transfer
- ...
- Delay Slots
- -----------
- Pipeline means the next instruction is fetched while the current one is executing.
- When the current instruction is a branch or call or return,
- the Goat2 will still execute the fetched instruction,
- rather than discard it as some other CPUs do when a branch is taken.
- (Goat2 is similar to SPARC in this regard)
- Because no instruction is ever "wasted", the need for "branch prediction"
- is completely eliminated for 2-stage pipeline computers.
- A delay slot instruction is never skipped.
- If you really can't think of anything useful, then put a "nop".
- Call and Return
- ---------------
- The Goat2 has no hardware stack or stack pointer. Instead, the "call" instruction
- is actually a "jump-and-link" instruction: the return address is stored into
- the link register "L", and the program counter "P" receives the target address.
- (Similar to ARM Cortex behaviour).
- This approach allows single-cycle dispatch to called functions.
- Note: the return address is saved into "L" by the 2nd stage of the pipeline,
- so it (correctly) contains the address AFTER the delay slot instruction.
- The Goat2 offers three different "return" instructions:
- retl
- This loads "P" from "L" directly.
- This is useful for simple functions which do not call other functions,
- the return address simply stays in "L" and is returned to at the end.
- ret <addr>
- This loads "P" from the contents of memory address <addr>.
- The called function should store "L" at some address using the
- "stl <addr>" instruction somewhere near the beginning of the function.
- This is an easy way to write non-recursive functions.
- Note: ret <addr> can also be used to perform indirect jumps or computed goto.
- retx
- This loads "P" from the location pointed to by register "X"
- Recursive functions must employ some kind of software stack;
- one such implementation is as follows:
- myfunction:
- stl temp ; hold "L" temporarily
- ldx stackpointer
- lda temp ; get "L" back
- stax ; store return address to whereever "X" points
- inc stackpointer ; bump address (this example uses post-increment "push")
- ... ; do function work which may involve calls to myself
- dec stackpointer ; un-bump (here we must pre-decrement for "pop")
- ldx stackpointer
- retx ; jump to the return address saved where x points to
- ... ;(delay slot) something useful like loading a return value
- Register Forwarding
- -------------------
- As mentioned previously there are two stages in the pipeline which operate in parallel.
- The first stage fetches instructions and arguments for the ALU.
- The arguments are read midway through the stage after the instruction was fetched.
- The second stage latches the ALU arguments and opcode from the first stage,
- then performs ALU computation and prepares for the writeback of the ALU output.
- The writeback actually occurs at the end of the stage (at the following clock edge).
- Without register forwarding this produces surprising results;
- consider the naive, pipeline-unaware code sequence
- lda value1 ; load "A" from "value1"
- aca value2 ; add-with-carry "A" and contents of "value2", result to "A"
- sta result ; store sum here
- The problems with this code (without forwarding) are:
- - when the first stage is fetching the arguments for aca,
- register "A" has still not been updated with the contents of value1 from lda.
- - When reading "A" for sta, register "A" does not yet contain the result of the add.
- What forwarding does is to connect the first stage's argument reader
- EITHER to the present register output (if the register is not being updated),
- OR directly to the alu output bus (if the register is being updated).
- This connection is safe to make even if the ALU has not finished calculating output,
- since the final value will only be latched at the (next) clock edge.
- With forwarding, the naive example above works correctly as the comments indicate.
- The Goat2 supports forwarding for registers "A", "X", and the two flags "Z" and "C" only.
- But specifically, memory forwarding is NOT supported!
- Examples such as this will still give surprising results...
- sta temp ; write new value
- lda temp ; !!this value of temp is the old value from before the sta!!
- Instruction Set
- ---------------
- There are 3 logical (or, and, xor) and 1 arithmetic (add-with-carry) instructions
- which read and update the accumulator "A".
- The second argument is either an immediate constant or a RAM address.
- The 4 monadic operators (increment, decrement, rotate right, and ?reserved?)
- operate on a single byte in RAM only.
- All the above instructions affect the "Z" flag (Z=1 if ALU result is zero)
- Adding, shifting, or rotating also affect "C" flag (C=1 if carry out is set)
- All other instructions, including loading, storing,
- and jump/call/return instructions never affect the flags.
- Instruction decoding table
- --------------------------
- top2 00 01 10 11
- bot3
- 000 nop ori # inc m ora m
- 001 jl t ani # dec m ana m
- 010 jc t xri # ? m xra m
- 011 jnc t aci # ror m aca m
- 100 jz t mvia # sta m lda m
- 101 jnz t retl stl m ret m
- 110 jle t mvix # stx m ldx m
- 111 jgt t retx stax ldax
- t = a target (ROM) address
- # = a numeric constant
- m = a memory (RAM) address
- jle = jump if either carry or zero is set
- jgt = jump if both carry and zero are not set
- ? was originally rol, or rotate left
- but it seems such a waste because the sequence
- lda m, aca m, sta m produces the same output in memory and carry,
- whereas there is no substitute for ror (it can divide by 2).
- Also, without memory forwarding, a sequence of rols or rors
- on the same address does not do the expected useful thing.
- For now, this instruction spot is reserved for a better idea (see below)
- Single-Step & Breakpoints
- -------------------------
- Breakpoints are added by placing repeaters in the topmost level of ROM.
- There is no limit to the number of breakpoints set.
- When the selected ROM is read, the system clock will stop
- and the note block at 2515/190/-2472 will play.
- To continue normal execution at full speed again,
- press the Single-Step button once.
- It is not necessary to remove the breakpoint; this is
- useful for stopping at the same point in each iteration
- of a loop for example.
- If instead you wish to single-step the next few instructions,
- first stop the clock, then press Single-Step.
- Only 1 instruction will execute for each press of Single-Step.
- To resume full speed operation when not stopped at a breakpoint,
- just turn the clock back on.
- Program Listing: Double Dabble
- ------------------------------
- // RAM addresses used:
- #define input 0x0E // memory mapped levers
- #define units 0x0F // top nibble drives lamps
- #define tens 0x0D // top nibble drives lamps
- #define hundreds 0x0B // top nibble drives lamps
- #define binary 0x0C
- #define binret 0x0A
- #define loopcnt 0x09
- #define tmp2 0x02
- #define tmp1 0x01
- #define call jl // call is an alias for the jump-and-link instruction "jl"
- #define jmp jl // so is jmp
- // we use call to show we expect the called function to return
- // whereas we use jmp to show an unconditional jump with no return
- // but remember "L" is overwritten no matter which alias is used!
- // timing information cycle count: 1~ = one instruction
- // 535~ * 2.6 seconds/~ = 1391 seconds, or 23 min 11 seconds.
- program:
- 00 1c0e lda input ; read levers
- 01 0105 call bin2dec
- 02 140c sta binary ; (delay slot) store input argument
- done:
- 03 0103 jmp done ; loop forever
- 04 0000 nop ; (delay slot)
- ; this routine converts the value stored in "binary"
- ; into the top nibble of 3 bcd bytes named units, tens, hundreds.
- bin2dec: // 11~ + 8*65~ = 531~
- 05 150a stl binret ; save return address
- 06 0c00 mvia #0
- 07 140f sta units
- 08 140d sta tens
- 09 140b sta hundreds ; clear output variables
- 0a 0c08 mvia #8 ; we loop 8 times...
- 0b 1409 sta loopcnt ; because "binary" contains 8 bits
- 0c 0b00 aci #0 ; clear carry (add 0 to "A" with carry yields no carry)
- 0d 1c0c lda binary
- ddloop: // 11~ + 3*18~ = 65~ for one iteration of the loop
- 0e 1b0c aca binary ; shift left (add to itself), msbit goes to cy
- 0f 140c sta binary
- 10 011b call dd ; do double-dabble on this digit
- 11 0e0f mvix #units ;(delay slot) point to units variable
- 12 011b call dd
- 13 0e0d mvix #tens ;(delay slot) point to 10s
- 14 011b call dd
- 15 0e0b mvix #hundreds ;(delay slot) point to 100s
- ; maximum hundreds value of 2 never overflows so carry is always clear
- 16 1109 dec loopcnt
- 17 050e jnz ddloop
- 18 1c0c lda binary ; (delay slot) get ready to double in next iteration
- 19 1d0a ret binret ; return to caller
- 1a 0000 nop ;(delay slot) nothing useful to put here
- ; this is double dabble on a single bcd digit.
- ; the 4 bits of digit are in the top nibble of the byte
- ; therefore, to add 3 we really need to add 0x30
- dd: // 18~
- 1b 031e jnc savecy
- 1c 0c00 mvia #0 ; (delay slot) we will save this value if input carry was 0
- 1d 0b0f aci #0x0F ; otherwise we will save 0x10 (cy + 0x0F)
- ; note: this also clears carry
- savecy:
- 1e 1401 sta tmp1 ; save shifted input cy here (0x00 or 0x10)
- 1f 1f00 ldax ; get bcd digit (high nibble)
- 20 0b30 aci #0x30 ; trial add 3
- 21 1402 sta tmp2 ; save result of trial addition here
- 22 0980 ani #0x80 ;if bcd digit was originally 5 or more, then bit 7 will now be set
- 23 0526 jnz double ; if result after AND is non-zero, this means bit 7 is set
- 24 1c02 lda tmp2 ;(delay slot) re-get trial result on our way there
- 25 1f00 ldax ; re-get original bcd digit since it was less than 5
- double:
- 26 1402 sta tmp2 ; this is what we want to double
- 27 0b00 aci #0 ; but first we clear carry
- 28 1c01 lda tmp1 ; get (shifted) input carry
- 29 1b02 aca tmp2 ; *1
- 2a 1b02 aca tmp2 ; *2
- 2b 0d00 retl ; return to caller (output cy goes to the next higher digit)
- 2c 1700 stax ;(delay slot) store the updated bcd digit as we leave
- Program size: 45 instructions ROM.
- From 56 leaves 11 instruction words unused (empty).
- Data size: 9 bytes used.
- From 16 leaves 7 bytes unused.
Add Comment
Please, Sign In to add comment