Guest User

The Goat2 Pipeline Computer

a guest
Jul 11th, 2020
29
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 13.95 KB | None | 0 0
  1.  
  2. Goat2 pipeline computer
  3. 2020-07-11 by goatflyer
  4.  
  5.  
  6. 2502/183/-2475
  7.  
  8. Overview
  9. --------
  10.  
  11. The Goat2 is a Harvard-architecture RISC machine, implemented as
  12. a pipeline of 2-stages, running with a clock period of 26 ticks.
  13. Due to the goal of keeping critical signals short, it might be better
  14. described as a DPU (Distributed Processing Unit) as opposed to CPU!
  15.  
  16. Its internal registers are:
  17. - a program counter register "P" (yellow near ROM) 2524/197/-2510
  18. - an 8-bit accumulator "A" (red beside ALU) 2476/219/-2496
  19. - an 8-bit pointer register "X" (purple behind RAM) 2519/219/-2450
  20. - a "link" "L" to hold a return address (purple near Incrementor) 2507/195/-2493
  21. - 2 single-bit flags: "Z" zero (blue), and "C" carry (red) (under ALU) 2491/197/-2479
  22.  
  23. Instruction words are 13 bits wide: 8 bits of data (white) and a 5 bit opcode (pink).
  24. Depending on the instruction, the 8 bits of data is either a RAM address,
  25. a literal constant, a branch target address in ROM, or otherwise ignored.
  26.  
  27. Instruction decoding is scattered everywhere along the (pink) inverted
  28. opcode bus, with the decoding of interest close to where control lines are needed.
  29.  
  30. Although the instruction set supports full 8-bit address for both ROM and RAM,
  31. I implemented only 56 instruction words of ROM and 16 bytes of RAM
  32. for reasons of speed and plot size.
  33.  
  34. Therefore registers "P" and "L" implement only 6 bits each, and only the lower 4 bits
  35. of register "X" are connected to the RAM address mux.
  36.  
  37. Clock (light blue) and reset (black) control are in front of the ROM and low.
  38. 2504/181/-2474
  39.  
  40. Demonstration Program: Double Dabble
  41. ------------------------------------
  42.  
  43. The demo program is an implementation of the well-known DoubleDabble algorithm
  44. which converts a binary byte into BCD.
  45.  
  46. It illustrates many of the Goat2 features:
  47. calling functions
  48. different types of return
  49. conditional branching
  50. delay slot instructions
  51. indirection
  52. register forwarding
  53.  
  54. To use the demo, there is a (magenta) column of levers 2470/216/-2447
  55. for the input byte inside one corner of the RAM (cell 0xf)
  56. and 3 decimal digit displays designed by Cadenjb attached
  57. outside the orange ALU output bus. 2449/217/-2502
  58.  
  59. The levers presently have a value of 0xA5 = 165 decimal.
  60. Enter a binary number on the levers and start the computer (see below).
  61. Let it run for about 25 minutes to view the result.
  62.  
  63.  
  64. Reset/Restart Procedure
  65. -----------------------
  66.  
  67. - Turn on RESET
  68. - Turn on CLK
  69. - Wait at least 2 full periods
  70. - Turn off RESET
  71.  
  72. The program execution begins at ROM location 0,
  73. All registers and flags are considered garbage
  74. and should be initialized by software before use.
  75.  
  76. Although the clock will automatically shut off
  77. due to a breakpoint I left at the end address 3,
  78. it is polite to turn off the CLK when you are
  79. finished using the computer.
  80.  
  81.  
  82. Explanation of Features
  83. -----------------------
  84.  
  85. Indirection
  86. -----------
  87.  
  88. The single pointer register "X" is used for all indirection.
  89. To illustrate its use, here is a possible "memcpy" implementation:
  90.  
  91. memcpy:
  92. ldx srcptr ; load X with current source pointer
  93. ldax ; load source byte
  94. ldx dstptr ; now load X with current destination pointer
  95. stax ; store byte to destination
  96. inc srcptr
  97. inc dstptr ; bump both pointers
  98. dec loopcount ; reduce count
  99. jnz memcpy ; repeat for the length of the transfer
  100. ...
  101.  
  102.  
  103. Delay Slots
  104. -----------
  105.  
  106. Pipeline means the next instruction is fetched while the current one is executing.
  107.  
  108. When the current instruction is a branch or call or return,
  109. the Goat2 will still execute the fetched instruction,
  110. rather than discard it as some other CPUs do when a branch is taken.
  111. (Goat2 is similar to SPARC in this regard)
  112.  
  113. Because no instruction is ever "wasted", the need for "branch prediction"
  114. is completely eliminated for 2-stage pipeline computers.
  115.  
  116. A delay slot instruction is never skipped.
  117. If you really can't think of anything useful, then put a "nop".
  118.  
  119.  
  120. Call and Return
  121. ---------------
  122.  
  123. The Goat2 has no hardware stack or stack pointer. Instead, the "call" instruction
  124. is actually a "jump-and-link" instruction: the return address is stored into
  125. the link register "L", and the program counter "P" receives the target address.
  126. (Similar to ARM Cortex behaviour).
  127. This approach allows single-cycle dispatch to called functions.
  128.  
  129. Note: the return address is saved into "L" by the 2nd stage of the pipeline,
  130. so it (correctly) contains the address AFTER the delay slot instruction.
  131.  
  132. The Goat2 offers three different "return" instructions:
  133.  
  134. retl
  135. This loads "P" from "L" directly.
  136. This is useful for simple functions which do not call other functions,
  137. the return address simply stays in "L" and is returned to at the end.
  138.  
  139. ret <addr>
  140. This loads "P" from the contents of memory address <addr>.
  141. The called function should store "L" at some address using the
  142. "stl <addr>" instruction somewhere near the beginning of the function.
  143. This is an easy way to write non-recursive functions.
  144. Note: ret <addr> can also be used to perform indirect jumps or computed goto.
  145.  
  146. retx
  147. This loads "P" from the location pointed to by register "X"
  148. Recursive functions must employ some kind of software stack;
  149. one such implementation is as follows:
  150.  
  151. myfunction:
  152. stl temp ; hold "L" temporarily
  153. ldx stackpointer
  154. lda temp ; get "L" back
  155. stax ; store return address to whereever "X" points
  156. inc stackpointer ; bump address (this example uses post-increment "push")
  157.  
  158. ... ; do function work which may involve calls to myself
  159.  
  160. dec stackpointer ; un-bump (here we must pre-decrement for "pop")
  161. ldx stackpointer
  162. retx ; jump to the return address saved where x points to
  163. ... ;(delay slot) something useful like loading a return value
  164.  
  165.  
  166.  
  167. Register Forwarding
  168. -------------------
  169.  
  170. As mentioned previously there are two stages in the pipeline which operate in parallel.
  171.  
  172. The first stage fetches instructions and arguments for the ALU.
  173. The arguments are read midway through the stage after the instruction was fetched.
  174.  
  175. The second stage latches the ALU arguments and opcode from the first stage,
  176. then performs ALU computation and prepares for the writeback of the ALU output.
  177.  
  178. The writeback actually occurs at the end of the stage (at the following clock edge).
  179.  
  180. Without register forwarding this produces surprising results;
  181. consider the naive, pipeline-unaware code sequence
  182.  
  183. lda value1 ; load "A" from "value1"
  184. aca value2 ; add-with-carry "A" and contents of "value2", result to "A"
  185. sta result ; store sum here
  186.  
  187. The problems with this code (without forwarding) are:
  188. - when the first stage is fetching the arguments for aca,
  189. register "A" has still not been updated with the contents of value1 from lda.
  190. - When reading "A" for sta, register "A" does not yet contain the result of the add.
  191.  
  192. What forwarding does is to connect the first stage's argument reader
  193. EITHER to the present register output (if the register is not being updated),
  194. OR directly to the alu output bus (if the register is being updated).
  195.  
  196. This connection is safe to make even if the ALU has not finished calculating output,
  197. since the final value will only be latched at the (next) clock edge.
  198.  
  199. With forwarding, the naive example above works correctly as the comments indicate.
  200.  
  201. The Goat2 supports forwarding for registers "A", "X", and the two flags "Z" and "C" only.
  202. But specifically, memory forwarding is NOT supported!
  203. Examples such as this will still give surprising results...
  204.  
  205. sta temp ; write new value
  206. lda temp ; !!this value of temp is the old value from before the sta!!
  207.  
  208.  
  209. Instruction Set
  210. ---------------
  211.  
  212. There are 3 logical (or, and, xor) and 1 arithmetic (add-with-carry) instructions
  213. which read and update the accumulator "A".
  214. The second argument is either an immediate constant or a RAM address.
  215.  
  216. The 4 monadic operators (increment, decrement, rotate right, and ?reserved?)
  217. operate on a single byte in RAM only.
  218.  
  219. All the above instructions affect the "Z" flag (Z=1 if ALU result is zero)
  220. Adding, shifting, or rotating also affect "C" flag (C=1 if carry out is set)
  221.  
  222. All other instructions, including loading, storing,
  223. and jump/call/return instructions never affect the flags.
  224.  
  225.  
  226. Instruction decoding table
  227. --------------------------
  228.  
  229. top2 00 01 10 11
  230. bot3
  231.  
  232. 000 nop ori # inc m ora m
  233. 001 jl t ani # dec m ana m
  234. 010 jc t xri # ? m xra m
  235. 011 jnc t aci # ror m aca m
  236.  
  237. 100 jz t mvia # sta m lda m
  238. 101 jnz t retl stl m ret m
  239. 110 jle t mvix # stx m ldx m
  240. 111 jgt t retx stax ldax
  241.  
  242.  
  243. t = a target (ROM) address
  244. # = a numeric constant
  245. m = a memory (RAM) address
  246.  
  247. jle = jump if either carry or zero is set
  248. jgt = jump if both carry and zero are not set
  249.  
  250. ? was originally rol, or rotate left
  251. but it seems such a waste because the sequence
  252. lda m, aca m, sta m produces the same output in memory and carry,
  253. whereas there is no substitute for ror (it can divide by 2).
  254.  
  255. Also, without memory forwarding, a sequence of rols or rors
  256. on the same address does not do the expected useful thing.
  257.  
  258. For now, this instruction spot is reserved for a better idea (see below)
  259.  
  260.  
  261. Single-Step & Breakpoints
  262. -------------------------
  263.  
  264. Breakpoints are added by placing repeaters in the topmost level of ROM.
  265. There is no limit to the number of breakpoints set.
  266. When the selected ROM is read, the system clock will stop
  267. and the note block at 2515/190/-2472 will play.
  268.  
  269. To continue normal execution at full speed again,
  270. press the Single-Step button once.
  271.  
  272. It is not necessary to remove the breakpoint; this is
  273. useful for stopping at the same point in each iteration
  274. of a loop for example.
  275.  
  276. If instead you wish to single-step the next few instructions,
  277. first stop the clock, then press Single-Step.
  278. Only 1 instruction will execute for each press of Single-Step.
  279.  
  280. To resume full speed operation when not stopped at a breakpoint,
  281. just turn the clock back on.
  282.  
  283.  
  284.  
  285.  
  286.  
  287.  
  288. Program Listing: Double Dabble
  289. ------------------------------
  290.  
  291.  
  292.  
  293. // RAM addresses used:
  294. #define input 0x0E // memory mapped levers
  295. #define units 0x0F // top nibble drives lamps
  296. #define tens 0x0D // top nibble drives lamps
  297. #define hundreds 0x0B // top nibble drives lamps
  298.  
  299. #define binary 0x0C
  300. #define binret 0x0A
  301. #define loopcnt 0x09
  302.  
  303. #define tmp2 0x02
  304. #define tmp1 0x01
  305.  
  306. #define call jl // call is an alias for the jump-and-link instruction "jl"
  307. #define jmp jl // so is jmp
  308.  
  309. // we use call to show we expect the called function to return
  310. // whereas we use jmp to show an unconditional jump with no return
  311. // but remember "L" is overwritten no matter which alias is used!
  312.  
  313.  
  314.  
  315. // timing information cycle count: 1~ = one instruction
  316. // 535~ * 2.6 seconds/~ = 1391 seconds, or 23 min 11 seconds.
  317. program:
  318. 00 1c0e lda input ; read levers
  319.  
  320. 01 0105 call bin2dec
  321. 02 140c sta binary ; (delay slot) store input argument
  322.  
  323. done:
  324. 03 0103 jmp done ; loop forever
  325. 04 0000 nop ; (delay slot)
  326.  
  327.  
  328. ; this routine converts the value stored in "binary"
  329. ; into the top nibble of 3 bcd bytes named units, tens, hundreds.
  330. bin2dec: // 11~ + 8*65~ = 531~
  331. 05 150a stl binret ; save return address
  332.  
  333. 06 0c00 mvia #0
  334. 07 140f sta units
  335. 08 140d sta tens
  336. 09 140b sta hundreds ; clear output variables
  337.  
  338. 0a 0c08 mvia #8 ; we loop 8 times...
  339. 0b 1409 sta loopcnt ; because "binary" contains 8 bits
  340.  
  341. 0c 0b00 aci #0 ; clear carry (add 0 to "A" with carry yields no carry)
  342. 0d 1c0c lda binary
  343.  
  344. ddloop: // 11~ + 3*18~ = 65~ for one iteration of the loop
  345. 0e 1b0c aca binary ; shift left (add to itself), msbit goes to cy
  346. 0f 140c sta binary
  347.  
  348. 10 011b call dd ; do double-dabble on this digit
  349. 11 0e0f mvix #units ;(delay slot) point to units variable
  350.  
  351. 12 011b call dd
  352. 13 0e0d mvix #tens ;(delay slot) point to 10s
  353.  
  354. 14 011b call dd
  355. 15 0e0b mvix #hundreds ;(delay slot) point to 100s
  356. ; maximum hundreds value of 2 never overflows so carry is always clear
  357.  
  358. 16 1109 dec loopcnt
  359. 17 050e jnz ddloop
  360. 18 1c0c lda binary ; (delay slot) get ready to double in next iteration
  361.  
  362. 19 1d0a ret binret ; return to caller
  363. 1a 0000 nop ;(delay slot) nothing useful to put here
  364.  
  365.  
  366.  
  367. ; this is double dabble on a single bcd digit.
  368. ; the 4 bits of digit are in the top nibble of the byte
  369. ; therefore, to add 3 we really need to add 0x30
  370.  
  371. dd: // 18~
  372. 1b 031e jnc savecy
  373. 1c 0c00 mvia #0 ; (delay slot) we will save this value if input carry was 0
  374. 1d 0b0f aci #0x0F ; otherwise we will save 0x10 (cy + 0x0F)
  375. ; note: this also clears carry
  376.  
  377. savecy:
  378. 1e 1401 sta tmp1 ; save shifted input cy here (0x00 or 0x10)
  379.  
  380. 1f 1f00 ldax ; get bcd digit (high nibble)
  381. 20 0b30 aci #0x30 ; trial add 3
  382. 21 1402 sta tmp2 ; save result of trial addition here
  383.  
  384. 22 0980 ani #0x80 ;if bcd digit was originally 5 or more, then bit 7 will now be set
  385. 23 0526 jnz double ; if result after AND is non-zero, this means bit 7 is set
  386. 24 1c02 lda tmp2 ;(delay slot) re-get trial result on our way there
  387.  
  388. 25 1f00 ldax ; re-get original bcd digit since it was less than 5
  389.  
  390. double:
  391. 26 1402 sta tmp2 ; this is what we want to double
  392. 27 0b00 aci #0 ; but first we clear carry
  393.  
  394. 28 1c01 lda tmp1 ; get (shifted) input carry
  395. 29 1b02 aca tmp2 ; *1
  396. 2a 1b02 aca tmp2 ; *2
  397.  
  398. 2b 0d00 retl ; return to caller (output cy goes to the next higher digit)
  399. 2c 1700 stax ;(delay slot) store the updated bcd digit as we leave
  400.  
  401.  
  402.  
  403. Program size: 45 instructions ROM.
  404. From 56 leaves 11 instruction words unused (empty).
  405.  
  406. Data size: 9 bytes used.
  407. From 16 leaves 7 bytes unused.
Add Comment
Please, Sign In to add comment