The Interpreter

Interpreted languages are executed indirectly by the machine, there is a program which reads your source code, and then executes it, but your program is never translated into machine code. In contrast with compiled languages which take your source code and make machine code out of it, then the program is loaded into RAM and the CPU jumps to it and starts executing instruction by instruction.

There are Forth compilers, and even Forth computers where the machine code is basically Forth, but we will make a lightweight Forth interpreter, as close to the metal as possible.

As with everything we will start small and build up, we need to execute this program: 2 3 + 4 + . cr bye. You havent seen the word . so far in Forth it means pop a value from the stack and show it on screen. In our SUBLEQ computer we didnt have a screen, but you can imagine how we can create a circuit with a grid of LEDs to and maybe few 28at64c eeproms to control them via their I/O pins and a register that controls the eeproms's address lines, so we just set the register to a value which will then set some address to the eeproms and then they will enable or disable specific LEDs.

If we have 8x8 grid of leds, We could create the number 2 by enabling the right rows and disabling the right columns (to drive the LEDs to ground).

---**---
--*--*--
-*----*-
------*-
-----*--
----*---
--*-----
--*****-

The screen itself is beyond the scope of this book, though I encourage you to look up the various ways to show pixels, from huge led arrays to oled screens, eink, 7 segment displays, liquid crystal displays and so on. What is more important for me is how does the CPU "talk" to a complicated circuit like a screen. Or a keyboard or mouse for that matter. If you have enough wires between the two components so that you can fit all the information in one go, you just set them up, HIGH, LOW.. HIGH.. whatever the information is, pulse a clock so the other circuit knows to latch or use them however it sees fit, and it is done, but if we want to send 'hello world' to a screen, and each character is 8 bits, we will need 88 wires plus 1 for the clock, so 89 wires to send it in one go. Not that its not impossible to have that many wires, its just impractical.

We could build a circuit which expects the data to come piece by piece, so we send it 'h', 'e', 'l', 'l', 'o', one by one, each time the clock pulses, the screen will append the character to an internal buffer, maybe small RAM or few registers depending on the size, and then display it. We might have few bit patterns that tell the screen to clear the buffer, or maybe move the cursor so that the next character will be displayed on a specific position. This is a communication protocol. A protocol sounds a bit scary, but you know a lot of social protocols, for example when you meet somebody you say 'hello', this is expected of you, and you expect the other person to say 'hello' back. If they dont, the social protocol is not followed, and there are some consequences and the communication is broken (not always, but you see my point). A protocol is just a series of expectations. Some protocols have extreme consequences and very strict rules, for example, you must pay for what you buy from the store, or you will go to jail. The circuit designer wants to make it as easy as possible for us to use their circuit at its maximum potential, in the same time they have certain limitations, cost of manufacturing for example, why do you think the 74LS181's outputs are inverted, I doubt it is just to annoy us. So the circuit designer says 'ok, if you send this bit pattern, the circuit will do this, and this is what you should expect, this is how long the clock pulse should be..' and so on. If we follow the expected protocol, and the circuit is not damaged, we should be able to display the information we want. And we have never met the manufacturer, nor the designer. There could be hundreds, maybe thousands, of people working on the parts of that circuit, and we never met any of them, we just read few pages of text they wrote explaining the communication protocol and bam! we could use their circuit. The fact that this happens just blows my mind.

Lets look at an example of an imaginary protocol for our 8x8 LED display. Imagine we have 8 data wires (D0-D7), and 2 control wires: CLK (clock) and CMD (command mode). When CMD is HIGH, the data is interpreted as a command, when LOW it's interpreted as regular data.

Command format (CMD = HIGH):
0000 0001: Clear display
0000 0010: Home cursor
0000 0100: Move cursor right
0000 1000: Move cursor left

Data format (CMD = LOW):
Just send ASCII character codes. For example:
0110 1000: 'h'
0110 0101: 'e'
0110 1100: 'l'
0110 1100: 'l'
0110 1111: 'o'

To send "hello" and clear screen:
1. Set CMD HIGH, send 0000 0001 (clear)
2. Pulse CLK
3. Set CMD LOW
4. Send 'h' (0110 1000)
5. Pulse CLK
6. Send 'e' (0110 0101)
7. Pulse CLK
...and so on.

This is a very simple protocol and it will not handle the "real world" properly, for example what if there is noise in the wires? or what how do we know if the data is done sending? how do we know if the display is done showing the data?

Real World protocols like i2c, SPI, UART, USB, PCIe, and etc handle tremendous amount of edge cases and have various tradeoffs with speed and complexity. The important thing is that a protocol is is just an agreed upon sequence of actions or signals.

So how would our . pop from stack and display on screen word work? We will use a virtual computer. We built our SUBLEQ computer with flipflops and wires, we could simulate it - create a program that pretends to be those chips.

QEMU is a machine emulator - it pretends to be a computer. When you run QEMU, it creates a virtual CPU, virtual RAM, virtual devices, all living inside your real computer's memory. Just as we can write a program that simulates our SUBLEQ computer's ALU and RAM, QEMU simulates entire processors like RISC-V or x86.

When the virtual CPU executes an instruction like addi x5, x0, 42, QEMU calculates what would happen if a real CPU executed that instruction - which registers would change, how the flags would be set, what memory would be accessed. The virtual CPU doesn't know it's not real. Our programs running inside QEMU don't know they're running in a simulation.

The magic of QEMU is that it can also simulate devices like screens, keyboards, hard drives and so on..entire computers. We can crash it as many times we want, or corrupt it, and most importantly, we can pause it and debug it, step through each instruction and see what is the state of the registers and the memory. When you are programming you must execute the instructions in your head, and think what would the computer would do, when you make a mistake what you have in your head is not what the computer state is, and you must look at the computer's memory and try to understand where things went wrong. Why it is the way it is? Being able to debug your program step by step is very very powerful. Of course you can do that with any program on your computer, there is no need to use QEMU for that, you can just break into a program with gdb (a debugger program) and execute it instruction by instruction. Our goal is however to make an operating system for an actual physical computer (either esp32c3 or Pico 2350), and starting with a virtual computer will make the development much much.. much easier.

There are few things you need to install, QEMU, RISC-V GNU Compiler Toolchain, GDB: The GNU Project Debugger, an editor like Visual Studio Code, or Emacs, and GNU Make. Depending on your operating system they will require different steps, I suggest you ask ChatGPT or Sonnet how to do it.

  • QEMU https://www.qemu.org/
  • RISC-V GNU Compiler Toolchain https://github.com/riscv-collab/riscv-gnu-toolchain
  • GNU Make https://www.gnu.org/software/make/
  • GDB https://www.sourceware.org/gdb/
  • Emacs https://www.gnu.org/software/emacs/
  • Visual Studio Code https://code.visualstudio.com/

Make sure you enable support for RISC-V 32bit.


Create a directory where we will put the files, I will call mine part1, and we will start by making a simple RISCV assembly program that will print 'hello world' on the virtual screen of QEMU, just so that we make sure all the tooling is working. I use linux and macos, but if you are using windows, you can ask Sonnet to translate the commands and make it work.

Create a file called boot.s and type this code in it, as you are typing it, try to think about it, and its totally OK to be confused. This was quite common in the 80s btw, havint pages and pages of code in a magazine and you have to type it in. I was too young at the time to experience it, I got my first computer in 1997 or so, but I just love the paper code medium.

You can also take a picture with your phone and copy the text from there.

  .section .text
  .globl _start

_start:
  li a0, 'h'
  call putc
  li a0, 101
  call putc
  li a0, 'l'
  call putc
  li a0, 108
  call putc
  li a0, 'o'
  call putc
  li a0, 32
  call putc
  li a0, 'w'
  call putc
  li a0, 111
  call putc
  li a0, 'r'
  call putc
  li a0, 108
  call putc
  li a0, 'd'
  call putc
  li a0, 10
  call putc

wait_for_q:
  call getch
  li t1, 'q'
  beq t1, a0, exit_qemu
  call putc
  j wait_for_q

unreachable:
  j unreachable

  ####
  # Subroutine: getch
  # Reads a character from UART
  # Returns: a0 - the character read
getch:
  li t0, 0x10000000      # t0 = 0x10000000, this is UART's base address  Load UART base address into t0
1:
  lbu t1, 5(t0)          # t1 = mem[t0 + 5], base + 5 is UART status register
  andi t1, t1, 0x01      # t1 = t1 & 0x01, use only the last bit
  beqz t1, 1b            # If no data ready, keep polling until the bit is 1
  lbu a0, 0(t0)          # a0 = mem[t0], base + 0 is the dasta register
  ret

  ####
  # Subroutine: putc
  # Writes a character to UART
  # Parameters: a0 - the character to write
putc:
  li t0, 0x10000000      # t0 = 0x10000000, again t0 = UART base address
1:
  lbu t1, 5(t0)          # t1 = mem[t0 + 5], load 1 byte from the UART status register
  andi t1, t1, 0x20      # t1 = t1 & 0x20, 0x20 is 00100000, check if this bit is 1
  beqz t1, 1b            # if not, we are not ready to transmit, try again
  sb a0, 0(t0)           # mem[t0] = a0, store a0 character to UART data register
  ret

exit_qemu:
  li t0, 0x100000        # t0 = 0x100000, QEMU exit device address
  li t1, 0x5555          # t1 = 0x5555, success exit code
  sw t1, 0(t0)           # mem[t0] = t1, store exit code to QEMU exit device
  j .                    # infinite loop until QEMU exits

  .end

Now we need another file linker.ld:

OUTPUT_ARCH( "riscv" )
ENTRY( _start )

MEMORY
{
  RAM (rwx) : ORIGIN = 0x80000000, LENGTH = 128M
}

SECTIONS
{
  .text :
  {
    *(.text.init)
    *(.text)
  } > RAM

  .rodata :
  {
    *(.rodata)
  } > RAM

  .data :
  {
    *(.data)
  } > RAM

  .bss :
  {
    *(.bss)
    . = ALIGN(8);
  } > RAM

  _bss_end = .;

  _stack_top = ORIGIN(RAM) + LENGTH(RAM);
  _ram_end = ORIGIN(RAM) + LENGTH(RAM);

  _end = .;
}

If you execute the following commands now:

riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments \
  -m elf32lriscv boot.o -o boot.elf

This will create a file boot.elf which is our machine code executable program, we could ask QEMU to run it:

qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf

And you should see 'hello world' printed, if you press any character you will see it echoed at the terminal, if you press 'q' then qemu will exit. riscv64-unknown-elf-as is the assembler, it takes the source code and creates machine code object file, it contains just relative addresses and might even reference unresolved symbols (e.g. we might want to call a subroutine from another file, which is not even compiled yet, and even if it is we dont know where it will sit in RAM, so how can we jump to it?). The linker however has all the information, in our linker file we say RAM starts at address 0x80000000, then various sections are in this order, first .text section then .rodata then .data then .bss, and then we have few symbols where does bss end, where does the ram end, where would we like to put the top of our stack, in this case the stack "grows" downwards, the program is in the start of the RAM, and the stack will start from the end of the ram and grow down.

The sections:

  • .text: the program itself, the machine code
  • .rodata: read only data, like constants we would like to have
  • .data: initialized variables, can be modified during execution,
  • .bss: uninitialized variables, but this does not actually take size in the executable, we can just say we want 10kb array of bytes, and the executable wont increase with 10kb, as opposed of the other sections.

Don't worry about those for now, we will get back to them later. In the assembler we use .section and .end to define a section and we can specify where would it live in RAM. ENTRY( _start ) this specifies where should the computer jump to when the program is loaded.

When the linker creates the .elf file, inside of it it will put all this information, plus the machine code itself. ELF means Executable and Linkablke Format, it is very common format for executables. We have not spoken about files yet, but a file is just an array of bytes on non volatile storage, it has a name or some way for you to find it. How you would interptet the bytes inside of it is up to you. Whatever program reads .elf file it will have expectation that the ELF format is followed.

You can examine .elf files with the readelf program. The options -h is to view the header, a header is a piece of structured information in the beginning of a byte sequence, -S is to view the section headers, -l to view the program headers.

ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           RISC-V
  Version:                           0x1
  Entry point address:               0x80000000
  Start of program headers:          52 (bytes into file)
  Start of section headers:          5476 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         2
  Size of section headers:           40 (bytes)
  Number of section headers:         12
  Section header string table index: 11

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        80000000 001000 0000c0 00  AX  0   0  4
  [ 2] .bss              NOBITS          800000c0 0010c0 000000 00  WA  0   0  1
  [ 3] .riscv.attributes RISCV_ATTRIBUTE 00000000 0010c0 00004c 00      0   0  1
  [ 4] .debug_line       PROGBITS        00000000 00110c 000156 00      0   0  1
  [ 5] .debug_info       PROGBITS        00000000 001262 000026 00      0   0  1
  [ 6] .debug_abbrev     PROGBITS        00000000 001288 000014 00      0   0  1
  [ 7] .debug_aranges    PROGBITS        00000000 0012a0 000020 00      0   0  8
  [ 8] .debug_str        PROGBITS        00000000 0012c0 000043 01  MS  0   0  1
  [ 9] .symtab           SYMTAB          00000000 001304 000150 10     10  16  4
  [10] .strtab           STRTAB          00000000 001454 000095 00      0   0  1
  [11] .shstrtab         STRTAB          00000000 0014e9 000078 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), p (processor specific)

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  RISCV_ATTRIBUT 0x0010c0 0x00000000 0x00000000 0x0004c 0x00000 R   0x1
  LOAD           0x001000 0x80000000 0x80000000 0x000c0 0x000c0 RWE 0x1000

 Section to Segment mapping:
  Segment Sections...
   00     .riscv.attributes
   01     .text

You see the ELF file starts with 7f 45 4c 46, in decimal that is 127 69 76 70 which is 7f then the ascii for E, L and F. Every ELF file starts with this 4 bytes, but not every file that starts with them is an ELF file.

The linker will create the right ELF file from our machine code. I wont go deeper into it, but there are amazing guides online to explain the ELF format, and since it is very well defined and documented format the language models know a lot about it, so asking if you are confused just ask ChatGPT or Sonnet.

  ...
  li a0, 'h'
  call putc
  li a0, 101
  call putc
  ...

Back to our assembly, you see I used letters and numbers to put a value in a0, they are of course the same thing. We have decided which letter is which number and defined it in a standard called ASCII, which stands for 'American Standard Code for Information Interchange'. It was defined in 1972. The letter 'A' is 65, 'B' is 66.. and so on. This is the whole table:

.-------------------------- ASCII Table ------------------------------------.
|                                                                           |
|  Dec  Hex  Char     Dec  Hex  Char     Dec  Hex  Char     Dec  Hex  Char  |
|  ----------------   ----------------   ----------------   --------------  |
|   0   00   NUL      32   20  space     64   40    @       96   60    `    |
|   1   01   SOH      33   21    !       65   41    A       97   61    a    |
|   2   02   STX      34   22    "       66   42    B       98   62    b    |
|   3   03   ETX      35   23    #       67   43    C       99   63    c    |
|   4   04   EOT      36   24    $       68   44    D      100   64    d    |
|   5   05   ENQ      37   25    %       69   45    E      101   65    e    |
|   6   06   ACK      38   26    &       70   46    F      102   66    f    |
|   7   07   BEL      39   27    '       71   47    G      103   67    g    |
|   8   08   BS       40   28    (       72   48    H      104   68    h    |
|   9   09   TAB      41   29    )       73   49    I      105   69    i    |
|  10   0A   LF       42   2A    *       74   4A    J      106   6A    j    |
|  11   0B   VT       43   2B    +       75   4B    K      107   6B    k    |
|  12   0C   FF       44   2C    ,       76   4C    L      108   6C    l    |
|  13   0D   CR       45   2D    -       77   4D    M      109   6D    m    |
|  14   0E   SO       46   2E    .       78   4E    N      110   6E    n    |
|  15   0F   SI       47   2F    /       79   4F    O      111   6F    o    |
|  16   10   DLE      48   30    0       80   50    P      112   70    p    |
|  17   11   DC1      49   31    1       81   51    Q      113   71    q    |
|  18   12   DC2      50   32    2       82   52    R      114   72    r    |
|  19   13   DC3      51   33    3       83   53    S      115   73    s    |
|  20   14   DC4      52   34    4       84   54    T      116   74    t    |
|  21   15   NAK      53   35    5       85   55    U      117   75    u    |
|  22   16   SYN      54   36    6       86   56    V      118   76    v    |
|  23   17   ETB      55   37    7       87   57    W      119   77    w    |
|  24   18   CAN      56   38    8       88   58    X      120   78    x    |
|  25   19   EM       57   39    9       89   59    Y      121   79    y    |
|  26   1A   SUB      58   3A    :       90   5A    Z      122   7A    z    |
|  27   1B   ESC      59   3B    ;       91   5B    [      123   7B    {    |
|  28   1C   FS       60   3C    <       92   5C    \      124   7C    |    |
|  29   1D   GS       61   3D    =       93   5D    ]      125   7D    }    |
|  30   1E   RS       62   3E    >       94   5E    ^      126   7E    ~    |
|  31   1F   US       63   3F    ?       95   5F    _      127   7F   DEL   |
|                                                                           |
|------------------------ Control Characters -------------------------------|
|                                                                           |
|  NUL  Null             SO   Shift Out        FS   File Separator          |
|  SOH  Start of Header  SI   Shift In         GS   Group Separator         |
|  STX  Start of Text    DLE  Data Link Escape RS   Record Separator        |
|  ETX  End of Text      DC1  Device Control 1 US   Unit Separator          |
|  EOT  End of Trans.    DC2  Device Control 2 SP   Space                   |
|  ENQ  Enquiry          DC3  Device Control 3 DEL  Delete                  |
|  ACK  Acknowledge      DC4  Device Control 4                              |
|  BEL  Bell             NAK  Negative Ack.                                 |
|  BS   Backspace        SYN  Synchronous Idle                              |
|  TAB  Horizontal Tab   ETB  End of Trans. Blk                             |
|  LF   Line Feed        CAN  Cancel                                        |
|  VT   Vertical Tab     EM   End of Medium                                 |
|  FF   Form Feed        SUB  Substitute                                    |
|  CR   Carriage Return  ESC  Escape                                        |
|                                                                           |
'---------------------------------------------------------------------------'

When we use 'x' with single quotes it literally means take the ascii code of that character and substitute it, in the case for x its the number 120, or 0x78 in hex. li a0, 'h' is the same as li a0, 104.

You will also notice I am using the mnemonic name a0 name instead of x10, it is just easier to read the code that way, a* is for arguments and return values, but its just because we use it that way, we could pass a parameter using t1(x6) or whatever, anything but zero(x0).

I used call instead of jal, call is a pseudo instruction, since jal has relative offset but the offset can only be 20 bits, 1 of which is sign bit, so we cant jump more than 524287 bytes away, we need to use jalr, and we need to use auipc to put the upper 20 bits in the register where we store the offset to. basically call is rewritten to :

auipc x6, offset[31:12]     # Upper 20 bits of offset, PC-relative
jalr  x1, offset[11:0](x6)  # Lower 12 bits of offset

It could also be rewritten to:

auipc x1, offset[31:12]     # Upper 20 bits of offset, PC-relative
jalr  x1, offset[11:0](x1)  # Lower 12 bits of offset

In some scenarios gcc uses t1.

  li a0, 'h'
  call putc

This is clear, we put 104 into a0 and then jump to putc while putting pc+4 into ra(x1).

putc:
  li t0, 0x10000000      # t0 = 0x10000000, again t0 = UART base address
1:
  lbu t1, 5(t0)          # t1 = mem[t0 + 5], load 1 byte from the UART status register
  andi t1, t1, 0x20      # t1 = t1 & 0x20, 0x20 is 00100000, check if this bit is 1
  beqz t1, 1b            # if not, we are not ready to transmit, try again
  sb a0, 0(t0)           # mem[t0] = a0, store a0 character to UART data register
  ret

lbu means 'load byte unsigned' which just means it will load 1 byte from a specific memory address, in our case address 0x10000005, then the next instruction is andi t1, t1, 0x20, which will do binary and operation. You already know the and truth table, you apply the AND logic bit by bit and write the result, for example:

      01010101
AND   00001111
     ----------
      00000101

Only if both bits are 1 then the output bit is 1. QEMU's UART status register will put 1 on bit 5, so if we and with 00100000 then the result will not be zero only if the 5th bit is one, otherwise we will get zero in the result. then we have beqz t1, 1b means if t1 is zero jump to the label 1 backwards, it is just a handy way to use temporary labels without us naming them, and this will just read again the status register. This pattern is very common, it is called a busy wait, you keep checking something over and over. It is also called 'polling', but usually when people say poll they mean 'check every second' or 'every millisecond' or some time interval, in a busy wait we use 100% of the cpu resources until the status changes.

If the 5th bit is 1 and t1 is not zero, it means that the UART is ready for us to write to it, you can think from the UART's point of view, it has some buffer, and when the buffer is full, because it might be printing slower than your writing speed, you will have to wait. Then we just write a0, which is the character we passed as parameter, into the UART data register, which for QEMU is at address 0x10000000. Then we do ret which is just jalr zero, 0(ra), it will jump to the value of ra which is pc+4 of wherever we called the call pseudo instruction.

This is how we print a character using QEMU's UART.

We keep printing 'e', 'l', 'l'. .. and so on '10' is ASCII for new line, and then we have a getch loop.

wait_for_q:
  call getch
  li t1, 'q'
  beq t1, a0, exit_qemu
  call putc
  j wait_for_q

again we call getch which is a subroutine like putc, putc was writing the a0 parameter, getch returns che character that the user typed into a0, then we compare it with the letter 'q' and if its equal we jump to exit_qemu, if not we will call putc which will read from a0 and send it to the UART, so the character you typed will appear on the terminal, and then we jump again to wait for 'q' to appear.

getch:
  li t0, 0x10000000      # t0 = 0x10000000, this is UART's base address
1:
  lbu t1, 5(t0)          # t1 = mem[t0 + 5], base + 5 is UART status register
  andi t1, t1, 0x01      # t1 = t1 & 0x01, use only the last bit
  beqz t1, 1b            # If no data ready, keep polling until the bit is 1
  lbu a0, 0(t0)          # a0 = mem[t0], base + 0 is the data register
  ret

getch is very similar to putc, it checks a status register, but it checks for the last bit instead of the 5th bit like putc, then keeps busy looping until this bit is 1, which QEMU's UART system will set once there is something in the input buffer, which happens when the user types a character on the keyboard. If the bit is set, we read from the data register and put the value of a0 and we return back.

qemu_exit just writes a specific value to specific address specified by QEMU so that we can tell it to shut down the virtual computer and exit.

exit_qemu:
  li t0, 0x100000        # t0 = 0x100000, QEMU exit device address
  li t1, 0x5555          # t1 = 0x5555, success exit code
  sw t1, 0(t0)           # mem[t0] = t1, store exit code to QEMU exit device
  j .                    # infinite loop until QEMU exits

When we call getch and putc I call them subroutine, but they are actually functions, subroutines dont take or return anything, they are just a sequence of instructions, functions take inputs and produce outputs. From now on I will use the term function, and this is similar to functions you learn in math, for example y = 3x + 2 is a function that has one parameter and returns one value. The return value of the function depends on the parameter. You can also see it as a map from input to output.

input | output
-------------
    0 | 2
    1 | 5
    2 | 8
    3 | 11
    ...

Calling a sequence of instructions that take some parameter and return some output a function is as good as calling a 32 bit value an integer. It is an integer, but it can not fit the whole number line, the math variables have no limit x can be infinity, can be zero, can be infinitely precise fraction, can even be infinite irrational number like π. The funciton y = 3x + 2 works fine, in our 32 bit computer however our output will only approximate the abstract function. There are a lot of symbols in programming that are kind of like math but not quite, the x = y equal symbol in math means equality whatever is on the left is the same as whatever is on the right, in programming languages x = y typically means store copy whatever the value of y is in memory into wherever x is in memory.

y = 3x + 2
y - 2 = 3x
(y - 2)/3 = x
x = (y - 2)/3

Those are all equivalent in math, but make no sense in almost all programming languages.

That does not stop us from saying x = 3 or li a5, 7 being a5 = 7, but it is more of a 'set' operation, of course after the operation is executed a5 will be 7, but 7 is not a5, as in 7 = a5 doesnt even make sense when you think of the wires.

It is similar with functions, even with lambda calculus and functional languages, things are not "quite" alright, and thats totally OK, you have to understand abstract operations, but you also have to understand the limit of the machine, and then you can get the best of both.


OK obviously nobody writes characters instruction by instruction, we can just put the string 'hello world' somewhere in memory and make a function puts that will take its address as parameter and print each character in a loop.

This is again the whole program, first type it in, and then we will discuss it, just replace boot.s with this code:

  .section .text
  .globl _start

_start:
  la sp, _stack_top

  la a0, message    # Load address of message into a0
  call puts         # Call our new puts function

  la a0, messageb
  call puts

  la a0, messaged
  call puts

  la a0, messageh
  call puts

wait_for_q:
  call getch
  li t1, 'q'
  beq t1, a0, exit_qemu
  call putc
  j wait_for_q

unreachable:
  j unreachable

  ####
  # Subroutine: puts
  # Prints a null-terminated string
  # Parameters: a0 - address of string to print
puts:
  addi sp, sp, -8     # Allocate stack space
  sw ra, 0(sp)        # Save return address
  sw s0, 4(sp)        # Save s0 (we'll use it as our string pointer)
  mv s0, a0           # Copy string address to s0

puts_loop:
  lbu a0, 0(s0)       # Load byte from string
  beqz a0, puts_done  # If byte is 0, we're done
  call putc           # Print the character
  addi s0, s0, 1      # Move to next character
  j puts_loop         # Repeat

puts_done:
  lw ra, 0(sp)        # Restore return address
  lw s0, 4(sp)        # Restore s0
  addi sp, sp, 8      # Deallocate stack space
  ret

getch:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x01
  beqz t1, 1b
  lbu a0, 0(t0)
  ret

putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

exit_qemu:
  li t0, 0x100000
  li t1, 0x5555
  sw t1, 0(t0)
  j .


messaged:
  .byte 104              # h
  .byte 101              # e
  .byte 108              # l
  .byte 108              # l
  .byte 111              # o
  .byte 32               # space
  .byte 119              # w
  .byte 111              # o
  .byte 114              # r
  .byte 108              # l
  .byte 100              # d
  .byte 10               # newline
  .byte 0                # null terminator

messageb:
  .byte 0b01101000       # h (104 or 0x68)
  .byte 0b01100101       # e (101 or 0x65)
  .byte 0b01101100       # l (108 or 0x6C)
  .byte 0b01101100       # l (108 or 0x6C)
  .byte 0b01101111       # o (111 or 0x6F)
  .byte 0b00100000       # space (32 or 0x20)
  .byte 0b01110111       # w (119 or 0x77)
  .byte 0b01101111       # o (111 or 0x6F)
  .byte 0b01110010       # r (114 or 0x72)
  .byte 0b01101100       # l (108 or 0x6C)
  .byte 0b01100100       # d (100 or 0x64)
  .byte 0b00001010       # newline (10 or 0x0A)
  .byte 0                # null terminator

messageh:
  .byte 0x68             # h
  .byte 0x65             # e
  .byte 0x6c             # l
  .byte 0x6c             # l
  .byte 0x6f             # o
  .byte 0x20             # space
  .byte 0x77             # w
  .byte 0x6f             # o
  .byte 0x72             # r
  .byte 0x6c             # l
  .byte 0x64             # d
  .byte 0x0a             # newline
  .byte 0x00             # null terminator

message:
  .asciz "hello world\n"   # .asciz adds null terminator automatically

  .end

You see how in the .data section I predefined some bytes, when the assembler makes the machine code it will put those specific bytes in the start of the .data segment, which is just after .rodata which is after the .text segment. Immediately after you will see bytes 104, 101, 108, 108.. la a0, messageb la is a pseudo instruction means load address, it is similar to li, but might use auipc which is Add Upper Immediate to PC, auipc rd, immediate means rd = pc + immediate << 12, immediate shifted left 12 bits, so we can use it for relative offsets and then we could add to the lower 12 bits with addi. Anyway, la a0, messageb will just put in a0 the address of wherever the label messageb is in memory.

I used messageb messageh messaged and message, all are exactly the same in memory. A sequence of characters is a string, a null terminated string is a sequence of characters that ends with 0. This means we dont need to know the length of the string we just print until we reach zero byte. This simple convenience, you will later find out, is the root cause of billions of dollars lost due to bugs, memory corruption, security exploits, and all kinds of pain and suffering.

There is one other big change, in start we do la sp, _stack_top and you can see in linker.ld we set _stack_top to be at the end of RAM, so now the register sp(x2) will be set to the very end of our RAM address space.

_start:
  la sp, _stack_top

  la a0, message
  call puts
  ...

puts:
  addi sp, sp, -8     # Allocate stack space
  sw ra, 0(sp)        # Save return address
  sw s0, 4(sp)        # Save s0 (we'll use it as our string pointer)
  mv s0, a0           # Copy string address to s0

puts_loop:
  lbu a0, 0(s0)       # Load byte from string
  beqz a0, puts_done  # If byte is 0, we're done
  call putc           # Print the character
  addi s0, s0, 1      # Move to next character
  j puts_loop         # Repeat

puts_done:
  lw ra, 0(sp)        # Restore return address
  lw s0, 4(sp)        # Restore s0
  addi sp, sp, 8      # Deallocate stack space
  ret
  ...

The puts function takes one argument in a0, which is a pointer to the null terminated string we will print. We do call puts which will set ra to pc+4, but inside of puts we need to call putc now the second call will also set ra to pc+4, then if we do ret from puts, which again is just jal zero, 0(ra), it will actually jump to the wrong place.

_start:
  la sp, _stack_top

  la a0, message
  jal ra, puts # call puts
  ...  <------------------------------------------+
                                                  |
puts:                                             |
  addi sp, sp, -4                                 |
  sw s0, 0(sp)                                    |
  mv s0, a0                                 we want to jump
                                              back there
puts_loop:                                        |
  lbu a0, 0(s0)                                   |
  beqz a0, puts_done                              |
  jal ra, putc             # call putc            |
  addi s0, s0, 1 <-------+                        |
  j puts_loop            | it will actually       |
                         | jump here              |
puts_done:               | as ra was overwritten  |
  lw s0, 0(sp)           |                        |
  addi sp, sp, 4         |                        |
  jalr zero, 0(ra) ------+ # ret  /---------------+
  ...

So we need to store ra somewhere and take it back, before we return. For that we will use the system's stack, we use sp (x2) to keep track of where the top of the stack is. When we call a function that is going to call another function, it must store the return address on the stack, and then take it out. The stack is also used for all kinds of local variables, we can allocate as much space as we need by moving sp down, and then we move it back up. There is a convention that the s* registers are also saved by the callee if they are going to use them, in our case we use s0 to keep track of the index that we are printing at the moment, if we call a function that also uses s* they will also store it on the stack and then make sure its restored, the same way we do.

This is what this code does, it allocates 8 bytes of stack space


  RAM BASE: (0x80000000)
_stack_top: (0x80000000 + 128M)

         address | value
sp -> _stack_top |
                 |
                 |
                 |
                 |
                 |
                 |
  data & program | xx
  data & program | xx
  data & program | xx
  data & program | xx
        RAM BASE |

after executing:
  addi sp, sp, -8
  sw ra, 0(sp)
  sw s0, 4(sp)

         address | value
      _stack_top |
           4(sp) | s0
    sp ->  0(sp) | ra
                 |
                 |
                ...
                 |
                 |
  data & program | xx
  data & program | xx
  data & program | xx
  data & program | xx
        RAM BASE |

sw ra, 0(sp) is memory[sp + 0] = ra and sw s0, 4(sp) is memory[sp + 4] = s0

This is called the function prologue, the stack preparation, storing the s* registers, preparing local variables and so on. And restoring the stack is called function epilogue.

puts:
  # prologue
  addi sp, sp, -8     # Allocate stack space
  sw ra, 0(sp)        # Save return address
  sw s0, 4(sp)        # Save s0 (we'll use it as our string pointer)

  ...

  # epilogue
  lw ra, 0(sp)        # Restore return address
  lw s0, 4(sp)        # Restore s0
  addi sp, sp, 8      # Deallocate stack space
  ret

You see after we return from puts, we are fetching the value for ra from where we stored it at 0(sp) and the value for s0 from 4(sp). This way when we do ret it will jump back to where it is supposed to.

Those two things combined, the fact that we store the return address on the stack lead to a whole generation of exploits, if you just find a bug that allows you to write on the stack, you can make the program jump wherever you want, you can overwrite the program itself even, if true can become if false, as the program is just data. There are all kinds of protections in place to prevent this from happening, but, it seems like people find ways around them.

OK now we are ready to discuss the actual meat of the puts function.

  ...
  mv s0, a0           # Copy string address to s0

puts_loop:
  lbu a0, 0(s0)       # Load byte from string
  beqz a0, puts_done  # If byte is 0, we're done
  call putc           # Print the character
  addi s0, s0, 1      # Move to next character
  j puts_loop         # Repeat
  ...

First we store a0 into s0 (s0 = a0), thats what mv s0, a0 does, its the same as addi a0, s0, 0, so we start from position 0, we load the value at s0 + 0, if its zero then we have reached the null termination and we jump to done, if not we call putc, as we already have the proper character in a0, and putc uses a0 as its argument, so that works out nicely, then we want to move to the next character, so we increment s0 += 1, and we jump back to the loop, which again loads from s0+0 but now this is pointing to the next character, and so on until we get to the 0 byte.

PHEW! now we can print more than one character, and also know how to call functions that call functions, we know about the system stack and about prologues and epilogues.

We are ready to write a forth interpreter program that parses and executes our tictactoe program, but of course we will start small, with the very core of Forth.

  .section .text
  .globl _start

_start:
  la sp, _stack_top
  la s1, FORTH_STACK_END # SP
  la s0, bytecode        # IP

  # start the program
  j NEXT


  # the program should terminate by itself,
  # in case it doesnt, we will print Z as a
  # debug message and exit
  li a0, 'Z'
  call putc
  j qemu_exit


##########################
NEXT:
  lw t0, 0(s0)           # IP
  addi s0, s0, 4         # IP
  jr t0

PLUS:
  # POP t0
  lw t0, 0(s1)           # SP
  addi s1, s1, 4         # SP

  # POP t1
  lw t1, 0(s1)           # SP
  addi s1, s1, 4         # SP

  add t0, t0, t1

  # PUSH t0
  addi s1, s1, -4        # SP
  sw t0, 0(s1)

  j NEXT

CR:
  li a0, '\n'
  call putc
  j NEXT

LITERAL:
  lw t0, 0(s0)           # IP
  addi s0, s0, 4         # IP

  # PUSH t0
  addi s1, s1, -4        # SP
  sw t0, 0(s1)           # SP

  j NEXT

EMIT:
  # POP a0
  lw a0, 0(s1)          # SP
  addi s1, s1, 4        # SP

  add a0, a0, '0'
  call putc
  j NEXT

BYE:
  j qemu_exit

  ##########################

putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

getch:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x01
  beqz t1, 1b
  lbu a0, 0(t0)
  ret

qemu_exit:
  li t0, 0x100000
  li t1, 0x5555
  sw t1, 0(t0)
  j .

bytecode:
  # our program writting in our new language
  # "2 3 + 4 + . cr bye"
  .word LITERAL
  .word 2
  .word LITERAL
  .word 3
  .word PLUS
  .word LITERAL
  .word 4
  .word PLUS
  .word EMIT
  .word CR
  .word BYE

  # allocate 1024 zero bytes for the FORTH Stack
  .space 1024
FORTH_STACK_END:
  .end

Save this in place of boot.s assemble it and run it

riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments -m elf32lriscv boot.o -o boot.elf
qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf

You should see then number 9 printed and then qemu will exit. First we will make a quality of life improvement, it must be annoying to write those 3 commands all the time, so we will create a Makefile which will just execute them when we type the command make, Makefiles are just a recipe of steps, it can be very complicated, and honestly I hate it, as I think it is very complicated, but we will use just a small part of the Make language to describe our recipe. Create a file in the same directory as boot.s and call it Makefile, inside of it write those instructions:

.RECIPEPREFIX = >
all:
> riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
> riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments \
  -m elf32lriscv boot.o -o boot.elf

run:
> qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf

It can also be written with <tab> as prefix, the tab character is usually displayed as 8 spaces, depending it has ASCII code of 9, but some editors display it as 2 or as 4, depending on their configuration, and of course in some editors when you press the tab button it will insert spaces instead the single ascii character 9, but when the make program is processing the Makefile it expects ASCII 9 instead of a 32,32,32,32,32,32,32,32 8 spaces. In newer GNU Make versions we can change the prefix with .RECIPEPREFIX = >.

all:
    riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
    riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments \
      -m elf32lriscv boot.o -o boot.elf

run:
    qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf

If you have no issues with <tab> use it, its much easier to read.

Now if you type make in the directory it will run the assembler and linker and get boot.elf, if you type make run it will run qemu. We will later build a more complicated Makefile that will allow us to work with more assembly file and help us run the debugger.

Now, lets discuss our program.

  # "2 3 + 4 + . cr bye"
  .word LITERAL
  .word 2
  .word LITERAL
  .word 3
  .word PLUS
  .word LITERAL
  .word 4
  .word PLUS
  .word EMIT
  .word CR
  .word BYE

.word means 4 bytes, there is also .byte we use those directives to put specific data in the binary. This .word LITERAL .word 2 .. sequence is the same as writing the bytes 0x80000058, 0x00000002, 0x80000058, 0x00000003, 0x8000002c, 0x80000058, 0x00000004, 0x8000002c, 0x8000006c, 0x8000004c, 0x80000080, as you will see in a bit.

Once the whole binary is compiled into an .elf file, you can use objdump to see its dissassembled machine code, dissassembly is the process of taking bytes and converting them to mnemonic insturctions, for example 00008067 is jalr zero, 0(ra). Because in the linker we say that our program will be loaded at address 0x80000000, which is where QEMU's RAM starts, in real hardware you know by now that those addressess are just enabled or disabled wires, in the .elf file the address 0x80000000 is specified as Entry Point Address, it is also specified that the program should be loaded at this address. When it makes the machine code it knows very well where every instruction will be. So at address 0x80000000 we have auipc sp,0x8000 and then immediately after we have 0x80000004 mv sp, sp which is the same as addi sp, sp, 0, those two instructions are the result of the expanded pseudo instruction la sp, _stack_top, auipc means Add Upper Immediate to PC, our pc is at 0x80000000, we will add 0x8000 to the upper 20 bits, which means 0x8000 << 12, or 0x8000000, and this of course is 134217728 in decimal, or 128MB and in our linker we have defined that _stack_top is _stack_top = ORIGIN(RAM) + LENGTH(RAM), and sp(x2) will be set at 0x80000000+0x8000000 0x88000000.

The next 2 instructions will come from la s1, FORTH_STACK_END, now this is more interesting, you can see the label FORTH_STACK_END in the end of our .data section, before it we have said .space 1024, the assembler will create 1024 bytes empty spaces and then know where FORTH_STACK_END exactly is going to be. auipc s1, 0x0 will put pc in s1, then we have s1 = s1 + 1268, you have to be careful when reading objdump, if some numbers are decimal some are hexadecimal, the addressess on the left and the machinecode are hexadecimal, and they dont start with 0x, but the arguments to instructions are decimal, so s1 will be 0x800004fc, and then we have la s0, bytecode, which will put 0x800000d0 in s0.

And then the magic happens, we have j 80000020, which is the code for our NEXT function.


$ riscv64-unknown-elf-objdump -D boot.elf

boot.elf:     file format elf32-littleriscv


Disassembly of section .text:

80000000 <_start>:
80000000:	08000117          	auipc	sp,0x8000
80000004:	00010113          	mv	sp,sp
80000008:	00000497          	auipc	s1,0x0
8000000c:	4f448493          	addi	s1,s1,1268 # 800004fc <FORTH_STACK_END>
80000010:	00000417          	auipc	s0,0x0
80000014:	0c040413          	addi	s0,s0,192 # 800000d0 <bytecode>
80000018:	0080006f          	j	80000020 <NEXT>
8000001c:	0980006f          	j	800000b4 <qemu_exit>

80000020 <NEXT>:
80000020:	00042283          	lw	t0,0(s0)
80000024:	00440413          	addi	s0,s0,4
80000028:	00028067          	jr	t0

8000002c <PLUS>:
8000002c:	0004a283          	lw	t0,0(s1)
80000030:	00448493          	addi	s1,s1,4
80000034:	0004a303          	lw	t1,0(s1)
80000038:	00448493          	addi	s1,s1,4
8000003c:	006282b3          	add	t0,t0,t1
80000040:	ffc48493          	addi	s1,s1,-4
80000044:	0054a023          	sw	t0,0(s1)
80000048:	fd9ff06f          	j	80000020 <NEXT>

8000004c <CR>:
8000004c:	00a00513          	li	a0,10
80000050:	034000ef          	jal	80000084 <putc>
80000054:	fcdff06f          	j	80000020 <NEXT>

80000058 <LITERAL>:
80000058:	00042283          	lw	t0,0(s0)
8000005c:	00440413          	addi	s0,s0,4
80000060:	ffc48493          	addi	s1,s1,-4
80000064:	0054a023          	sw	t0,0(s1)
80000068:	fb9ff06f          	j	80000020 <NEXT>

8000006c <EMIT>:
8000006c:	0004a503          	lw	a0,0(s1)
80000070:	00448493          	addi	s1,s1,4
80000074:	03050513          	addi	a0,a0,48
80000078:	00c000ef          	jal	80000084 <putc>
8000007c:	fa5ff06f          	j	80000020 <NEXT>

80000080 <BYE>:
80000080:	0340006f          	j	800000b4 <qemu_exit>

80000084 <putc>:
80000084:	100002b7          	lui	t0,0x10000
80000088:	0052c303          	lbu	t1,5(t0) # 10000005 <_start-0x6ffffffb>
8000008c:	02037313          	andi	t1,t1,32
80000090:	fe030ce3          	beqz	t1,80000088 <putc+0x4>
80000094:	00a28023          	sb	a0,0(t0)
80000098:	00008067          	ret

8000009c <getch>:
8000009c:	100002b7          	lui	t0,0x10000
800000a0:	0052c303          	lbu	t1,5(t0) # 10000005 <_start-0x6ffffffb>
800000a4:	00137313          	andi	t1,t1,1
800000a8:	fe030ce3          	beqz	t1,800000a0 <getch+0x4>
800000ac:	0002c503          	lbu	a0,0(t0)
800000b0:	00008067          	ret

800000b4 <qemu_exit>:
800000b4:	001002b7          	lui	t0,0x100
800000b8:	00005337          	lui	t1,0x5
800000bc:	55530313          	addi	t1,t1,1365 # 5555 <_start-0x7fffaaab>
800000c0:	0062a023          	sw	t1,0(t0) # 100000 <_start-0x7ff00000>
800000c4:	0000006f          	j	800000c4 <qemu_exit+0x10>

Disassembly of section .data:

800000d0 <bytecode>:
800000d0:	0058                	.insn	2, 0x0058
800000d2:	8000                	.insn	2, 0x8000
800000d4:	0002                	.insn	2, 0x0002
800000d6:	0000                	.insn	2, 0x
800000d8:	0058                	.insn	2, 0x0058
800000da:	8000                	.insn	2, 0x8000
800000dc:	00000003          	lb	zero,0(zero) # 0 <_start-0x80000000>
800000e0:	002c                	.insn	2, 0x002c
800000e2:	8000                	.insn	2, 0x8000
800000e4:	0058                	.insn	2, 0x0058
800000e6:	8000                	.insn	2, 0x8000
800000e8:	0004                	.insn	2, 0x0004
800000ea:	0000                	.insn	2, 0x
800000ec:	002c                	.insn	2, 0x002c
800000ee:	8000                	.insn	2, 0x8000
800000f0:	006c                	.insn	2, 0x006c
800000f2:	8000                	.insn	2, 0x8000
800000f4:	004c                	.insn	2, 0x004c
800000f6:	8000                	.insn	2, 0x8000
800000f8:	0080                	.insn	2, 0x0080
800000fa:	8000                	.insn	2, 0x8000
	...

NEXT

NEXT:
  lw t0, 0(s0)           # IP
  addi s0, s0, 4         # IP
  jr t0

NEXT loads 4 bytes from memory at address s0 into t0, then increments s0 with 4 and jumps to t0. the value of s0 is 0x800000d0, and the value at memory[0x800000d0] is 0x800000058.

800000d0:	0058                	.insn	2, 0x0058
800000d2:	8000                	.insn	2, 0x8000

You can see it here, but it is written backwards, 0058 8000. How we print numbers and how we use them depend on which bytes we think are first. There are two ways, Big-endian and Little-endian.

In our case we compiling the code for a Little Endian RISC-V processor.

Memory Address | Byte Value
--------------------------
800000d0      | 58        (least significant byte)
800000d1      | 00
800000d2      | 00
800000d3      | 80        (most significant byte)

The term "endian" comes from Gulliver's Travels where two groups fought if they should break eggs at the big end or little end.

Objdump is showing the data 2 bytes at a time for memory contents, and thats why it looks backwards.

Honestly this endianness thing always annoys me, I wish it was only one, but we are where we are.

s0: 0x800000d0           # Forth Instruction Pointer
s1: 0x800004fc           # Forth Stack Pointer

        Address     Value (big-endian)  | Meaning
        ----------------------------------------------
s0 ->   800000d0:   0x80000058           | LITERAL
        800000d4:   0x00000002           | 2
        800000d8:   0x80000058           | LITERAL
        800000dc:   0x00000003           | 3
        800000e0:   0x8000002c           | PLUS
        800000e4:   0x80000058           | LITERAL
        800000e8:   0x00000004           | 4
        800000ec:   0x8000002c           | PLUS
        800000f0:   0x8000006c           | EMIT
        800000f4:   0x8000004c           | CR
        800000f8:   0x80000080           | BYE
        800000fc:   0x00000000           |
        ...                              |
        800004f4:   0x00000000           |
        800004f8:   0x00000000           |
s1 ->   800004fc:   0x00000000           | Top of stack
        80000500:   0x00000000           | unused memory
        ...                              | unused memory

Anyway, if you look up you will see that at address 0x80000058 we have our LITERAL function, so NEXT will jump to LITERAL. We will follow the value of s0 through the process.

LITERAL

NEXT added 4 to s0 before jumping, so it is at 0x800000d4 when we come into LITERAL

s0: 0x800000d4           # Forth Instruction Pointer
s1: 0x800004fc           # Forth Stack Pointer

        Address     Value (big-endian)  | Meaning
        ----------------------------------------------
        800000d0:   0x80000058           | LITERAL
s0 ->   800000d4:   0x00000002           | 2
        800000d8:   0x80000058           | LITERAL
        800000dc:   0x00000003           | 3
        800000e0:   0x8000002c           | PLUS
        800000e4:   0x80000058           | LITERAL
        800000e8:   0x00000004           | 4
        800000ec:   0x8000002c           | PLUS
        800000f0:   0x8000006c           | EMIT
        800000f4:   0x8000004c           | CR
        800000f8:   0x80000080           | BYE
        800000fc:   0x00000000           |
        ...                              |
        800004f4:   0x00000000           |
        800004f8:   0x00000000           |
s1 ->   800004fc:   0x00000000           | Top of stack
        80000500:   0x00000000           | unused memory
        ...                              | unused memory

Literal will load the value at memory[s0], in this case you can see its the value 2, then it will add 4 to s0, and push it on the forth stack, we use s1 to keep track of it. Our stack grows upwards, meaning it starts at a high address and we just decreas its value, it is all relative this upwards downwards thing, I call it upwards because I have the low addressess on top when I write, so the stack grows up, but if you draw the memory the other way it will grow down. Anyway, we decrease the value of s1.

LITERAL:
  lw t0, 0(s0)           # IP
  addi s0, s0, 4         # IP

  # PUSH t0
  addi s1, s1, -4        # SP
  sw t0, 0(s1)           # SP

  j NEXT

After LITERAL is done we will have 2 on the Forth stack, and then we jump to NEXT.

s0: 0x800000d8           # Forth Instruction Pointer
s1: 0x800004f8           # Forth Stack Pointer

        Address     Value (big-endian)  | Meaning
        ----------------------------------------------
        800000d0:   0x80000058           | LITERAL
        800000d4:   0x00000002           | 2
s0 ->   800000d8:   0x80000058           | LITERAL
        800000dc:   0x00000003           | 3
        800000e0:   0x8000002c           | PLUS
        800000e4:   0x80000058           | LITERAL
        800000e8:   0x00000004           | 4
        800000ec:   0x8000002c           | PLUS
        800000f0:   0x8000006c           | EMIT
        800000f4:   0x8000004c           | CR
        800000f8:   0x80000080           | BYE
        800000fc:   0x00000000           |
        ...                              |
        800004f4:   0x00000000           |
s1 ->   800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | Top of stack
        80000500:   0x00000000           | unused memory
        ...                              | unused memory

NEXT

NEXT again will load the value at memory[s0] into t0, in this case memory[0x800000d8], which is again 0x80000058, it will increment s0 with 4 and jump to t0.

This will be the memory state after NEXT.

s0: 0x800000dc           # Forth Instruction Pointer
s1: 0x800004f8           # Forth Stack Pointer

        Address     Value (big-endian)  | Meaning
        ----------------------------------------------
        800000d0:   0x80000058           | LITERAL
        800000d4:   0x00000002           | 2
        800000d8:   0x80000058           | LITERAL
s0 ->   800000dc:   0x00000003           | 3
        800000e0:   0x8000002c           | PLUS
        800000e4:   0x80000058           | LITERAL
        800000e8:   0x00000004           | 4
        800000ec:   0x8000002c           | PLUS
        800000f0:   0x8000006c           | EMIT
        800000f4:   0x8000004c           | CR
        800000f8:   0x80000080           | BYE
        800000fc:   0x00000000           |
        ...                              |
        800004f4:   0x00000000           |
s1 ->   800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | Top of stack
        80000500:   0x00000000           | unused memory
        ...                              | unused memory

LITERAL

Again literal will load memory[s0] which is 3 and push it on the forth stack by also decrementing s1 with 4 and incrementing s0 with 4.

s0: 0x800000e0           # Forth Instruction Pointer
s1: 0x800004f4           # Forth Stack Pointer


        Address     Value (big-endian)  | Meaning
        ----------------------------------------------
        800000d0:   0x80000058           | LITERAL
        800000d4:   0x00000002           | 2
        800000d8:   0x80000058           | LITERAL
        800000dc:   0x00000003           | 3
s0 ->   800000e0:   0x8000002c           | PLUS
        800000e4:   0x80000058           | LITERAL
        800000e8:   0x00000004           | 4
        800000ec:   0x8000002c           | PLUS
        800000f0:   0x8000006c           | EMIT
        800000f4:   0x8000004c           | CR
        800000f8:   0x80000080           | BYE
        800000fc:   0x00000000           |
        ...                              |
s1 ->   800004f4:   0x00000003           | 3
        800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | Top of stack
        80000500:   0x00000000           | unused memory
        ...                              | unused memory

NEXT

Same story, load memory[s0] into t0, memory[0x800000e0] is 0x8000002c, and that is the address of our PLUS function, add 4 to s0 and jump to t0,

s0: 0x800000e4           # Forth Instruction Pointer
s1: 0x800004f4           # Forth Stack Pointer

        Address     Value (big-endian)  | Meaning
        ----------------------------------------------
        800000d0:   0x80000058           | LITERAL
        800000d4:   0x00000002           | 2
        800000d8:   0x80000058           | LITERAL
        800000dc:   0x00000003           | 3
        800000e0:   0x8000002c           | PLUS
s0 ->   800000e4:   0x80000058           | LITERAL
        800000e8:   0x00000004           | 4
        800000ec:   0x8000002c           | PLUS
        800000f0:   0x8000006c           | EMIT
        800000f4:   0x8000004c           | CR
        800000f8:   0x80000080           | BYE
        800000fc:   0x00000000           |
        ...                              |
s1 ->   800004f4:   0x00000003           | 3
        800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | Top of stack
        80000500:   0x00000000           | unused memory
        ...                              | unused memory

PLUS

Plus will pop two values from the stack, add them and push back to the stack, lets follow the stack.

8000002c <PLUS>:
8000002c:	0004a283          	lw	t0,0(s1)
80000030:	00448493          	addi	s1,s1,4
80000034:	0004a303          	lw	t1,0(s1)
80000038:	00448493          	addi	s1,s1,4
8000003c:	006282b3          	add	t0,t0,t1
80000040:	ffc48493          	addi	s1,s1,-4
80000044:	0054a023          	sw	t0,0(s1)
80000048:	fd9ff06f          	j	80000020 <NEXT>

lw t0, 0(s1), memory[800004f4] is 3
              t0 is set to 3

---------------------------------------------
s1 ->   800004f4:   0x00000003           | 3
        800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | 0
---------------------------------------------


addi	s1,s1,4
---------------------------------------------
        800004f4:   0x00000003           | 3
s1 ->   800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | 0
---------------------------------------------


lw t1, 0(s1), memory[800004f8] is 2
              t1 is set to 2
---------------------------------------------
        800004f4:   0x00000003           | 3
s1 ->   800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | 0
---------------------------------------------

addi s1,s1,4
---------------------------------------------
        800004f4:   0x00000003           | 3
        800004f8:   0x00000002           | 2
s1 ->   800004fc:   0x00000000           | 0
---------------------------------------------

add	t0,t0,t1, t0 = t0 + t1
              t0 is set to 5
---------------------------------------------
        800004f4:   0x00000003           | 3
        800004f8:   0x00000002           | 2
s1 ->   800004fc:   0x00000000           | 0
---------------------------------------------

addi	s1,s1,-4
---------------------------------------------
        800004f4:   0x00000003           | 3
s1 ->   800004f8:   0x00000002           | 2
        800004fc:   0x00000000           | 0
---------------------------------------------

sw t0, 0(s1), t0 is 5,
              memory[800004f8] is set to 5

---------------------------------------------
        800004f4:   0x00000003           | 3
s1 ->   800004f8:   0x00000005           | 5
        800004fc:   0x00000000           | 0
---------------------------------------------

After the PLUS function you see the top of the stack has value 5, we have "consumed" 2 and 3 and inserted "5" in their place, 3 is still left in memory but it is just a garbage value, we wont bother cleaning it up, next time we add something to the stack it will be overwritten. And when its done it will jump to NEXT.

This is how the memory looks after PLUS

s0: 0x800000e4           # Forth Instruction Pointer
s1: 0x800004f8           # Forth Stack Pointer

        Address     Value (big-endian)  | Meaning
        ----------------------------------------------
        800000d0:   0x80000058           | LITERAL
        800000d4:   0x00000002           | 2
        800000d8:   0x80000058           | LITERAL
        800000dc:   0x00000003           | 3
        800000e0:   0x8000002c           | PLUS
s0 ->   800000e4:   0x80000058           | LITERAL
        800000e8:   0x00000004           | 4
        800000ec:   0x8000002c           | PLUS
        800000f0:   0x8000006c           | EMIT
        800000f4:   0x8000004c           | CR
        800000f8:   0x80000080           | BYE
        800000fc:   0x00000000           |
        ...                              |
        800004f4:   0x00000003           | 3
s1 ->   800004f8:   0x00000005           | 5
        800004fc:   0x00000000           | Top of stack
        80000500:   0x00000000           | unused memory
        ...                              | unused memory

NEXT

Same old next, doing the same thing, load memory[s0] into t0, add 4 to s0, jump to t0. so we go to LITERAL again, which will put 4 on the stack, then again we go to PLUS, which will pop 4 and pop 5 and add them and push 9., then we go to EMIT. EMIT pops 9 from the stack and adds 48 to it, puts the result in a0 and calls putc to print the character on screen (48 is the ascii for '0' and 48 + 9 is the ascii for 9). after EMIT is done it jumps to NEXT, then NEXT jumps to CR, which prints a new line, and jumps to NEXT again, and then we get to BYE which exits qemu.

--

You see we have a language inside assembly, it weaving like a thread, function -> next -> function -> next -> function next. So tiny and nice, it took us only few lines of code. Just like a silk thread weaving through memory.

Imagine WRITE function that pops two values from the stack, one a memory address, and one a value. Almost like PLUS but instead of pushing to the stack, we will write to the specific value to the specified address.

WRITE:
  # POP t0, address
  lw t0, 0(s1)
  addi s1, s1, 4

  # POP t1, value
  lw t1, 0(s1)
  addi s1, s1, 4

  sw t1, 0(t0)

  j NEXT

We could write this program that writes 7 to address 0x800000fc

.word LITERAL
.word 7
.word LITERAL
.word 0x800000fc
.word WRITE

We put the value 8 on the stack with .word LITERAL .word 7, and then we put 0x800000fc on the stack .word LITERAL .word 0x800000fc, then LITERAL's NEXT will jump into WRITE, which will pop the two values from the stack, the first pop is the address into t0, then it will pop the value 7 into t1 and finally it will write 7 into memory[t0], or memory[0x800000fc], now imagine if the program itself is there at address 0x800000fc.

I wrote this small program, the addressess are different than the ones we had so far because I addedd the WRITE code which will move everyuthing by 6 instructions, each instruciton is 4 bytes, so everything will be off by 24 bytes, but anyway, I just want to illustrate the point:

bytecode:
  .word LITERAL
  .word 0x800000a0
  .word LITERAL
  .word 0x80000104
  .word WRITE

bytecode is at 0x800000f0 and ends at 0x80000104, so with this small program we write the value 0x800000a0 at address 0x80000104 and the value 0x800000a0 happens to be the address of BYE. I could've written it using labels:

bytecode:
  .word LITERAL
  .word BYE
  .word LITERAL
  .word bytecode+20 # 4 * 5
  .word WRITE
  # -> we want to write here

Or we could create a bytecode_end label that we can use.

bytecode:
  .word LITERAL
  .word BYE
  .word LITERAL
  .word bytecode_end
  .word WRITE
bytecode_end:

The assembler knows where everything will be in memory, it knows that bytecode will be at address X then each .word is 4 bytes, it knows bytecode_end is going to be at bytecode + 20 bytes. .word bytecode_end will be replaced with the apropriate value. The labeling in modern assembler is really cool! And what is even more cool, is that we wrote our program using our small bytecode language, tha modified the memory where it lives. Such power!

Think for a second, what would this do.

  .word LITERAL
  .word LITERAL
  .word LITERAL
  .word LITERAL
  .word WRITE

We will do few quality of life improvements to allow us to write code easier, for example I am constantly confused by stacks growing direction, I often forget to do -4 or +4 and that leads to a lot of pain and suffering and hours of debugging and then facepalming.

We will use MACROs, a macro is just a piece of code that gets executed before the program is compiled, it is like a program that the compiler executes on the source code itself.

  .macro PUSH reg
  addi s1, s1, -4
  sw \reg, 0(s1)
  .endm

  .macro POP reg
  lw \reg, 0(s1)
  addi s1, s1, 4
  .endm

then PLUS becomes:

PLUS:
  POP t0
  POP t1
  add t0, t0, t1
  PUSH t0
  j NEXT

see its much clearer, POP reg will be expanded into "lw reg, 0(s1); addi s1, s1, 4" and PUSH reg will be expanded in "addi s1, s1, -4; sw reg, 0(s1)". For example POP t0 will expand to "lw t0, 0(s1); addi s1, s1, 4".

You can see that macros allow you to extend the language. Some programming languages have extremely flexible macro system that is a language in itself, and the best languages have a macro system that is the language itself (like LISP). In our case we will just use macros to help us not repeat the same few lines of code over and over again.

We could of course create a POP function and a PUSH function, and call into it, but this will at least double the amount of instructions per operation, its really not worth it.

We will create another quality of life improvement, instead of using s0 and s1 we will use :IP and :SP, IP as in instruction pointer, it is the same as program counter, just a register we will use to point where are we in the program, our index finger if you will, and :SP we will use for a stack pointer in the Forth stack. Sadly it is not possible to do that with a macro or in any other way in RISC-V assembly, so we will use an external program to replace :SP to s1 and :IP to s0 before we give the source code to the assembler to make machine code.

We will use the sed command to replace all occurances. Now we are entering a bit more complicated territory because the project is going to grow, we could put everything in boot.s but its going to be really hard to read, also we will need a way to debug if there is an issue and be able to execute instructions step by step.

This is a new version of the Makefile that allows us to have many .s files and in the end they are linked into one .elf file, it also creates a build/ directory and puts there all the object file (unlinked machine code files) and the elf file. It also uses sed to replace :IP, :SP and other registers we would use later into their corresponding s0, s1, s2 registers.

Its beyond the scope of the book to dig deeper into (.. I was going to use the word delve here, but now people will think that chatgpt wrote this if I do) the GNU Make language, I also dont think its worth spending time on it, just ask chatgpt to copy the code from the page and explain it.

# Compiler and linker
AS = riscv64-unknown-elf-as
LD = riscv64-unknown-elf-ld
GDB = riscv64-unknown-elf-gdb

# Flags
ASFLAGS = -g -march=rv32g -mabi=ilp32
LDFLAGS = -T linker.ld --no-warn-rwx-segments -m elf32lriscv

# QEMU command
QEMU = qemu-system-riscv32
QEMU_FLAGS = -nographic -machine virt -bios none

# Directories
SRC_DIR = .
BUILD_DIR = build
OBJ_DIR = $(BUILD_DIR)/obj

# Source files
SRC_FILES = $(wildcard $(SRC_DIR)/*.s)
OBJ_FILES = $(patsubst $(SRC_DIR)/%.s,$(OBJ_DIR)/%.o,$(SRC_FILES))

# Target executable
TARGET = $(BUILD_DIR)/boot.elf

# GDB script
GDB_SCRIPT = $(BUILD_DIR)/gdb_commands.gdb

# Default target
all: directories $(TARGET) $(GDB_SCRIPT)

# Create necessary directories
directories:
	@mkdir -p $(OBJ_DIR)

# Compile .s files to object files
$(OBJ_DIR)/%.o: $(SRC_DIR)/%.s
	@sed -e 's/:IP/s0/g' \
       -e 's/:SP/s1/g' \
       -e 's/:RSP/s2/g' \
       -e 's/:CSP/s3/g' \
       -e 's/:HERE/s4/g' \
       -e 's/:XT/s5/g' \
       -e 's/:LATEST/s6/g' \
       -e 's/:MODE/s7/g' \
       -e 's/:ESP/s8/g' $< > $@.pre.s
	$(AS) $(ASFLAGS) $@.pre.s -o $@

# Link object files to create the executable
$(TARGET): directories $(OBJ_FILES)
	$(LD) $(LDFLAGS) $(OBJ_FILES) -o $@

# Create GDB script
$(GDB_SCRIPT):
	@echo "target remote localhost:1234" > $@
	@echo "tui enable" >> $@
	@echo "tui layout reg" >> $@
	@echo "file $(TARGET)" >> $@
	@echo "break _start" >> $@
	@echo "continue" >> $@

# Clean up
clean:
	rm -rf $(BUILD_DIR)

# Run the program in QEMU
run: $(TARGET)
	$(QEMU) $(QEMU_FLAGS) -kernel $(TARGET)

# Run QEMU with GDB server enabled
qemu-gdb: $(TARGET)
	reset ; $(QEMU) $(QEMU_FLAGS) -kernel $(TARGET) -S -s

# Run GDB and connect to QEMU
gdb: $(TARGET) $(GDB_SCRIPT)
	$(GDB) -x $(GDB_SCRIPT)
kill:
	killall -9 qemu-system-riscv32
objdump:
	riscv64-unknown-elf-objdump -D build/boot.elf

objdump-data:
	riscv64-unknown-elf-objdump -s -j .data  build/boot.elf

.PHONY: all clean run run-gdb-server run-gdb debug directories objdump kill

When you replace your Makefile with this version you will have the commands make, make run, make clean, make gdb, make qemu-gdb, make kill, make objdump if you want to debug the program you need to run make qemu-gdb which starts QEMU waiting for gdb to hook into it, and does not execute any instruction until its connected to gdb, then in another terminal you must run make gdb which will start gdb with the right parameters to connect to QEMU. Then you can run 'si' which means 'step into' and it will run one instruction at a time. You can also add breakpoints and pause the program at various places.

OK, now we can rewrite our code with all the quality of life improvements, we will split it into 3 files, boot.s where we will just do super basic preparation and jump into the forth interpreter, qemu.s where we will have all the qemu dependent code, like putc, getch, qemu_exit, and forth.s where we will keep the forth stuff.

  # boot.s
  .section .text
  .globl _start

_start:
  la sp, _stack_top

  j forth

  li a0, 'Z'
  call putc
  j qemu_exit

  .end
   # qemu.s
  .section .text

  .globl putc
  .globl getch
  .globl qemu_exit

putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

getch:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x01
  beqz t1, 1b
  lbu a0, 0(t0)
  ret

qemu_exit:
  li t0, 0x100000
  li t1, 0x5555
  sw t1, 0(t0)
  j .

  .end
  # forth.s
  .section .text
  .globl forth

  .macro PUSH reg
  addi :SP, :SP, -4
  sw \reg, 0(:SP)
  .endm

  .macro POP reg
  lw \reg, 0(:SP)
  addi :SP, :SP, 4
  .endm

forth:
  la :SP, FORTH_STACK_END
  la :IP, bytecode

  # start the program
  j NEXT

NEXT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  jr t0

PLUS:
  POP t0
  POP t1
  add t0, t0, t1
  PUSH t0
  j NEXT

LITERAL:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  PUSH t0
  j NEXT

EMIT:
  POP a0
  add a0, a0, '0'
  call putc
  j NEXT

WRITE:
  POP t0 # address
  POP t1 # value
  sw t1, 0(t0)
  j NEXT

BYE:
  j qemu_exit

CR:
  li a0, '\n'
  call putc
  j NEXT

bytecode:
  # "2 3 + 4 + . cr bye"
  .word LITERAL
  .word 2
  .word LITERAL
  .word 3
  .word PLUS
  .word LITERAL
  .word 4
  .word PLUS
  .word EMIT
  .word CR
  .word BYE
  .space 1024
FORTH_STACK_END:
  .end

You will notice we use this .globl directive, which tells the assembler that this symbol (e.g getch) will be accessible from other object files.


Just like a silk thread weaving through memory.

I want to be able to write the text "2 3 + 4 + . cr bye" somewhere in memory instead of writing the bytecode by hand.

...
program:
  .asciz "2 3 + 4 + . cr bye"

.asciz means null terminated ascii string, it will write the bytes 50 32 51 32 43 32 52 32 43 32 46 32 99 114 32 98 121 101 0 in the binary, which then will be loaded in memory on wherever the program: label happens to fall at.

You have experienced many levels of programming languages so far, from the microcode and the EEPROM wires, to SUBLEQ, to assembly, and now the mini forth bytecode thread jumping language. All of them allow you to program the computer. If you made a language on top of the wires lets call it W, and then a language on top of this language lets call it A, then whatever A can do W can do, as ultimately, A is executed by W. Why do we keep building higher and higher level languages, further and further from the wires? Those languages for sure can program the machine, but, in order for you express your thoughts into it, you have to think in the language you are using, and those very low level languages are much harder to think in, you cant keep track of 9548 wires and if they are on or off and what is going to happen next, its just not possible. But you can think about higher concepts, like remembering what is in a stack of values, you know the number 2 and 3 and you want to add them, this is how you think, the programming language has to be good both for you AND the machine to think in.

The problem is that everyone of us thinks differently, and certain things are easy for one and hard for another, as I said, if I were to make the perfect chair for me, it will be a torture device for you. Keep that in mind when studying programming, the languages we have are a compromise between how most people think and how the machines we made think. Do not worry if you struggle to express yourself, it takes time. It is not the same as learning another man made language, like knowing Dutch and learning English, those are languages made by people for people, they dont change faster than we change.

We will now build up from our bytecode language to the ascii "human" like language, but you know by now, it is just wires all the way down.

The first step we must do is to be able to know where a symbol starts and where it ends, for example 2 and + are 1 character long, cr is 2, bye is 3. Looking at the program we can just split the symbols by space, and things will work out. This is the very first step in any programming languages, tokenizing the program. Tokenization is the process of splitting something into chunks that you would work with, for example in language this is usually words, but you could also make character tokens, or you can make bigrams (twowords) or trigrams triwordstogether, or character ngrams li ke th is, for us we want to create a token out of each symbol, word, digit etc.

When I am working with something in memory I always imagine it in some random address, in this case I will think that our text program will be located at address 0x80001000.

Memory Address  ASCII    Hex    Dec
-----------------------------------
0x80001000      '2'     0x32    50
0x80001001      ' '     0x20    32
0x80001002      '3'     0x33    51
0x80001003      ' '     0x20    32
0x80001004      '+'     0x2B    43
0x80001005      ' '     0x20    32
0x80001006      '4'     0x34    52
0x80001007      ' '     0x20    32
0x80001008      '+'     0x2B    43
0x80001009      ' '     0x20    32
0x8000100A      '.'     0x2E    46
0x8000100B      ' '     0x20    32
0x8000100C      'c'     0x63    99
0x8000100D      'r'     0x72   114
0x8000100E      ' '     0x20    32
0x8000100F      'b'     0x62    98
0x80001010      'y'     0x79   121
0x80001011      'e'     0x65   101
0x80001012      '\0'    0x00     0

We will make a function that takes a memory address and then returns the address of the next token and how big it is.

If we give it address 0x80001000 it should return 0x80001000 and length 1, if we give it 0x80001001, it should return 0x80001002 and 1. It will skip the leading spaces, then count the bytes of the token, if it reaches the null termination 0 or space it will stop.

Create a new file string.s and write the following code:

  # string.s
  .section .text
  .globl token

  # input:
  #  a0 address an ascii string
  # output:
  #  a0 start token address
  #  a1 token length
token:
  mv t3, a0            # t3 = initial address
  li a1, 0             # length = 0
  li t1, '!'           # ascii 33, space is 32

  # Skip leading spaces
.L_skip_spaces:
  lbu t0, 0(t3)        # load byte at current position
  beqz t0, .L_done     # if null termination, done
  bge t1, t0, .L_count_token # if char >= 33 start counting
  addi t3, t3, 1       # increment address
  j .L_skip_spaces

  # Count token length until space or null
.L_count_token:
  mv a0, t3 # a0 is the start of the token

.L_count_token_next:
  lbu t0, 0(t3)        # load byte
  blt t0, t1, .L_done  # it char < 33 (including 0) done
  addi a1, a1, 1       # increment length
  addi t3, t3, 1       # increment address
  j .L_count_token_next

.L_done:
  ret

  .end

There is a convention in RISC-V assembly to use .L_ for local labels, nothing will stop you to jump to them from anywhere, but at least its clear that its not intended to be jumped into from random places.

This code skips more than space, it skips anything in the ascii table below 33 (!), which includes new line, tab and other weird characters.

Our language will be the inverse of python. WHITESPACE FREEDOOOMM!

2
        3 +

  4


  +
. cr

bye

We will use string.s to write other string functions we need, like is_number, atoi (ascii to integer), puts and print integer. As you might have guessed, we need to know if a token is a number or not, so we know if we should make put LITERAL, 3 or PLUS in the bytecode.


I will just show a bunch of code now, nothing you havent seen, but just more of it. There are a bunch of helper functions to help us manipulate the stack, to make from 1 2 3 -> 3 1 2, or 1 2 -> 1 2 1 2 and few more, if you read the code you will see how they work, its just pops and pushes.

Ther are also few helper functions to allow us to do memcompare, print integers, convert strings to integers etc. Again I wont go into a lot of detail, I have asked chatgpt to redo the comments so they are clearer, at least I found them clearer than the ones I wrote.

  # string.s
  # I actually wrote this but o1 pro styled it, it makes such beautiful and clear comments.
  # After I confirmed they are correct, I couldnt resist using its version
  #=====================================================================
  #  RISC-V Assembly Utilities
  #
  #  This file provides:
  #    - token      : Extract the next token (non-whitespace) from a string
  #    - is_number  : Check if a substring is purely decimal digits
  #    - atoi       : Convert a decimal string to an integer
  #    - puts       : Print a null-terminated string
  #    - puts_len   : Print a string up to a given length
  #    - print_int  : Print an integer in decimal format
  #    - memcmp     : Compare two memory arrays
  #    - print_unsigned_hex : Print integer in hex format (useful for address print)
  #=====================================================================
  .section .text
  .globl token
  .globl is_number
  .globl atoi
  .globl puts
  .globl puts_len
  .globl print_int
  .globl memcmp
  .globl print_unsigned_hex

  #---------------------------------------------------------------------
  # token
  #
  # Input:
  #   a0 = address of a null-terminated ASCII string
  #
  # Output:
  #   a0 = start of the next token
  #   a1 = length of that token
  #
  # Description:
  #   1) Skips leading whitespace (ASCII < 33).
  #   2) Returns the address at which the non-whitespace data begins.
  #   3) Counts characters until the next whitespace or null terminator.
  #---------------------------------------------------------------------
token:
  mv      t3, a0               # t3 = current pointer in string
  li      a1, 0                # a1 = token length = 0
  li      t1, 33               # ASCII 33 = '!' (first non-space, e.g. ' ' = 32)

  #--- Skip leading spaces
.L_skip_spaces:
  lbu     t0, 0(t3)            # load byte
  beqz    t0, .L_done_token    # if null terminator -> done (empty token)
  bge     t0, t1, .L_count_token
  addi    t3, t3, 1            # else skip this whitespace char
  j       .L_skip_spaces

  #--- Count token length
.L_count_token:
  mv      a0, t3               # a0 = start of token
.L_count_token_next:
  lbu     t0, 0(t3)            # load byte
  blt     t0, t1, .L_done_token
  addi    a1, a1, 1            # increment token length
  addi    t3, t3, 1            # move to next character
  j       .L_count_token_next

.L_done_token:
  ret

  #---------------------------------------------------------------------
  # is_number
  #
  # Input:
  #   a0 = address of the substring
  #   a1 = length of the substring
  #
  # Output:
  #   a0 = -1 if the substring is a valid integer (negative or positive)
  #   a0 = 0 if not
  #
  # Notes:
  #   - A leading minus sign is optional.
  #   - A lone minus sign ("-") is invalid.
  #   - Any non-digit character immediately disqualifies the string.
  #---------------------------------------------------------------------
is_number:
  beqz    a1, .L_not_number       # if length == 0, not a number

  mv      t0, a0                  # t0 = current string pointer
  mv      t1, a1                  # t1 = remaining length

  # Check for optional leading minus sign
  lbu     t2, 0(t0)               # look at first character
  li      t3, '-'
  beq     t2, t3, .L_handle_minus # if '-', skip it

  #---------------------------------------------------------------------
  # .L_check_digit_loop:
  #   Check each character must be '0'..'9'.
  #---------------------------------------------------------------------
.L_check_digit_loop:
  lbu     t2, 0(t0)               # load current character
  li      t3, '0'                 # ASCII '0' (48)
  li      t4, '9'                 # ASCII '9' (57)

  blt     t2, t3, .L_not_number   # if char < '0' -> not number
  bgt     t2, t4, .L_not_number   # if char > '9' -> not number

  # Move to next character
  addi    t0, t0, 1
  addi    t1, t1, -1

  bnez    t1, .L_check_digit_loop # keep checking until length=0

  # If we exit the loop normally, all checked chars are digits
  li      a0, -1                  # indicate "valid number"
  ret

  #---------------------------------------------------------------------
  # .L_handle_minus:
  #   Skip the minus sign and then check digits.
  #---------------------------------------------------------------------
.L_handle_minus:
  addi    t0, t0, 1               # skip '-'
  addi    t1, t1, -1
  beqz    t1, .L_not_number       # if no chars after '-', not number

  j       .L_check_digit_loop

  #---------------------------------------------------------------------
  # .L_not_number:
  #   If anything fails above, return 0.
  #---------------------------------------------------------------------
.L_not_number:
  li      a0, 0
  ret

  #---------------------------------------------------------------------
  # atoi (ASCII to Integer)
  #
  # Input:
  #   a0 = address of decimal string (may start with '-', followed by digits)
  #   a1 = length of the string
  #
  # Output:
  #   a0 = integer value of that string
  #
  # Description:
  #   - If the first character is '-', then parse the rest as digits
  #     and return the negative of that value.
  #   - Otherwise, treat all characters as digits ('0'..'9').
  #
  # Assumptions:
  #   - The string is valid and contains only an optional '-' plus digits,
  #     or the function’s caller already ensures validity.
  #---------------------------------------------------------------------
atoi:
  # Prologue: save RA and s-registers
  addi    sp, sp, -20
  sw      ra,  0(sp)
  sw      s0,  4(sp)
  sw      s1,  8(sp)
  sw      s2, 12(sp)
  sw      s3, 16(sp)

  mv      s0, a0        # s0 = pointer to string
  mv      s1, a1        # s1 = remaining length
  li      s2, 0         # s2 = accumulator (result)
  li      t0, 10        # t0 = base (10)
  li      s3, 0         # s3 = sign flag (0 = positive, 1 = negative)

  # If string is empty, result stays 0
  beqz    s1, .L_atoi_done

  # Check for optional leading '-'
  lbu     t1, 0(s0)      # load first character
  li      t2, '-'
  bne     t1, t2, .L_parse_digits  # if not '-', skip sign logic

  # If '-' is found, set sign flag to negative
  li      s3, 1
  addi    s0, s0, 1      # skip the '-'
  addi    s1, s1, -1     # adjust the remaining length

.L_parse_digits:
  # Loop over remaining digits
.L_atoi_loop:
  beqz    s1, .L_atoi_done  # stop if no characters left

  # result = result * 10
  mul     s2, s2, t0

  # add current digit
  lbu     t1, 0(s0)       # load ASCII digit
  addi    t1, t1, -48     # convert '0'..'9' to 0..9
  add     s2, s2, t1

  # advance pointers
  addi    s0, s0, 1
  addi    s1, s1, -1
  j       .L_atoi_loop

.L_atoi_done:
  # If negative flag was set, flip the sign
  beqz    s3, .L_return_result
  neg     s2, s2

.L_return_result:
  mv      a0, s2

  # Epilogue: restore RA and s-registers
  lw      ra,  0(sp)
  lw      s0,  4(sp)
  lw      s1,  8(sp)
  lw      s2, 12(sp)
  lw      s3, 16(sp)
  addi    sp, sp, 20
  ret

  #---------------------------------------------------------------------
  # puts
  #
  # Input:
  #   a0 = address of a null-terminated string
  #
  # Description:
  #   Prints characters one at a time until it hits a null terminator.
  #   Assumes an external function putc is available to print a single char.
  #---------------------------------------------------------------------
puts:
  # Prologue
  addi    sp, sp, -8
  sw      ra, 0(sp)
  sw      s0, 4(sp)

  mv      s0, a0

.L_puts_loop:
  lbu     a0, 0(s0)        # load current char
  beqz    a0, .L_puts_done # if '\0', stop
  call    putc             # print char
  addi    s0, s0, 1        # next char
  j       .L_puts_loop

.L_puts_done:
  # Epilogue
  lw      ra, 0(sp)
  lw      s0, 4(sp)
  addi    sp, sp, 8
  ret

  #---------------------------------------------------------------------
  # puts_len
  #
  # Input:
  #   a0 = address of string
  #   a1 = length
  #
  # Description:
  #   Prints exactly 'length' characters from the given address.
  #   Calls an external function putc to print a single char.
  #---------------------------------------------------------------------
puts_len:
  # Prologue
  addi    sp, sp, -12
  sw      ra,  0(sp)
  sw      s0,  4(sp)
  sw      s1,  8(sp)

  mv      s0, a0         # string address
  mv      s1, a1         # length

.L_puts_len_loop:
  beqz    s1, .L_puts_len_done  # if length == 0, done

  lbu     a0, 0(s0)       # load current char
  call    putc            # print char

  addi    s0, s0, 1
  addi    s1, s1, -1
  j       .L_puts_len_loop

.L_puts_len_done:
  # Epilogue
  lw      ra,  0(sp)
  lw      s0,  4(sp)
  lw      s1,  8(sp)
  addi    sp, sp, 12
  ret


#---------------------------------------------------------------------
  # print_unsigned_hex
  #
  # Input:
  #   a0 = unsigned integer to print in hexadecimal format
  #
  # Description:
  #   1) Prints "0x" prefix
  #   2) Extracts each 4-bit nibble from most to least significant
  #   3) Converts each nibble to its ASCII hex digit ('0'-'9', 'a'-'f')
  #   4) Skips leading zeros but always prints at least one digit
  #
  # Notes:
  #   - Uses putc to print individual characters
  #   - Prints lowercase hex digits (a-f) for values 10-15
  #   - Always includes "0x" prefix for clarity
  #---------------------------------------------------------------------
print_unsigned_hex:
  # Prologue
  addi    sp, sp, -20
  sw      ra,  0(sp)
  sw      s0,  4(sp)
  sw      s1,  8(sp)
  sw      s2, 12(sp)
  sw      s3, 16(sp)

  mv      s0, a0             # s0 = number to print
  li      s1, 28             # s1 = current shift amount (7 nibbles * 4)
  li      s2, 0              # s2 = leading zeros flag (0 = still skipping)

  # Print "0x" prefix
  li      a0, '0'
  call    putc
  li      a0, 'x'
  call    putc

.L_print_hex_loop:
  # Extract current nibble
  mv      t0, s0
  srl     t0, t0, s1        # shift right to get current nibble
  andi    t0, t0, 0xf       # mask to get just the nibble

  # Skip this digit if it's a leading zero (unless it's the last digit)
  bnez    t0, .L_print_digit    # if non-zero, must print it
  bnez    s2, .L_print_digit    # if already printed something, must continue
  beqz    s1, .L_print_digit    # if it's the last digit, must print even if zero

  # This is a leading zero we can skip
  j       .L_next_nibble

.L_print_digit:
  li      s2, 1             # mark that we're now printing digits

  # Convert to ASCII
  li      t1, 10
  blt     t0, t1, .L_numeric    # if < 10, use '0'-'9'

  # Handle a-f (value 10-15)
  addi    t0, t0, 'a' - 10
  j       .L_print_char

.L_numeric:
  # Handle 0-9
  addi    t0, t0, '0'

.L_print_char:
  mv      a0, t0
  call    putc

.L_next_nibble:
  addi    s1, s1, -4        # move to next nibble
  bgez    s1, .L_print_hex_loop

  # Epilogue
  lw      ra,  0(sp)
  lw      s0,  4(sp)
  lw      s1,  8(sp)
  lw      s2, 12(sp)
  lw      s3, 16(sp)
  addi    sp, sp, 20
  ret



  #---------------------------------------------------------------------
  # print_int
  #
  # Input:
  #   a0 = integer to print
  #
  # Description:
  #   1) Checks if the number is 0; prints '0' if so.
  #   2) If negative, print a '-', then flip it positive.
  #   3) Continuously take remainder by 10, push ASCII digit onto stack,
  #      then pop them off in reverse order to print.
  #---------------------------------------------------------------------
print_int:
  # Prologue
  addi    sp, sp, -16
  sw      ra,  0(sp)
  sw      s0,  4(sp)
  sw      s1,  8(sp)
  sw      s2, 12(sp)

  mv      s0, a0             # s0 = integer to print
  mv      s1, sp             # s1 = stack pointer for pushing digits
  li      s2, 10             # divisor = 10

  # Handle zero as special case
  bnez    s0, .L_pi_convert
  li      a0, '0'
  call    putc
  j       .L_pi_done

.L_pi_convert:
  # Handle negative numbers
  bgez    s0, .L_pi_digits
  li      a0, '-'
  call    putc
  neg     s0, s0

.L_pi_digits:
  # Repeatedly divide s0 by 10, push remainder digit onto stack
  beqz    s0, .L_pi_print
  rem     t0, s0, s2         # remainder
  addi    t0, t0, 48         # + '0'
  addi    s1, s1, -4
  sw      t0, 0(s1)
  div     s0, s0, s2
  j       .L_pi_digits

.L_pi_print:
  # Pop digits and print
  beq     s1, sp, .L_pi_done
  lw      a0, 0(s1)
  call    putc
  addi    s1, s1, 4
  j       .L_pi_print

.L_pi_done:
  # Epilogue
  lw      ra,  0(sp)
  lw      s0,  4(sp)
  lw      s1,  8(sp)
  lw      s2, 12(sp)
  addi    sp, sp, 16
  ret

  #---------------------------------------------------------------------
  # memcmp
  #
  # Inputs:
  #   a0 = ptr1 (start address of first buffer)
  #   a1 = len1 (number of bytes in first buffer)
  #   a2 = ptr2 (start address of second buffer)
  #   a3 = len2 (number of bytes in second buffer)
  #
  # Output:
  #   a0 = -1 if buffers have same length and contents
  #   a0 =  0 otherwise (length mismatch or byte mismatch)
  #---------------------------------------------------------------------
memcmp:
  # First, check if lengths are equal
  bne     a1, a3, .L_not_equal   # lengths differ => not equal

  # If length is 0 and they are both the same size, they're "equal" (both empty)
  beqz    a1, .L_equal

.L_compare_loop:
  lbu     t0, 0(a0)              # load byte from first buffer
  lbu     t1, 0(a2)              # load byte from second buffer
  bne     t0, t1, .L_not_equal   # mismatch => not equal

  addi    a0, a0, 1              # advance ptr1
  addi    a2, a2, 1              # advance ptr2
  addi    a1, a1, -1             # decrement length
  bnez    a1, .L_compare_loop    # if more bytes to compare, continue

.L_equal:
  li      a0, -1                 # indicate "equal"
  ret

.L_not_equal:
  li      a0, 0                  # indicate "not equal"
  ret

  .end
  # forth.s
  .section .text
  .globl forth

  .macro PUSH reg
  addi :SP, :SP, -4
  sw \reg, 0(:SP)
  .endm

  .macro POP reg
  lw \reg, 0(:SP)
  addi :SP, :SP, 4
  .endm

forth:
  la :SP, FORTH_STACK_END
  la :IP, bytecode

  # start the program
  j NEXT

NEXT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  jr t0

  # ( a b -- c )
PLUS:
  POP t0
  POP t1
  add t0, t0, t1
  PUSH t0
  j NEXT

  # ( -- n )
LITERAL:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  PUSH t0
  j NEXT

  # ( n -- )
EMIT:
  POP a0
  jal print_int
  j NEXT

  # ( value addr -- )
WRITE:
  POP t0 # address
  POP t1 # value
  sw t1, 0(t0)
  j NEXT

  # ( -- )
BYE:
  j qemu_exit

  # ( -- )
CR:
  li a0, '\n'
  jal putc
  j NEXT

  # ( addr -- len addr )
PARSE_TOKEN:
  POP a0
  jal token
  PUSH a1 # length
  PUSH a0 # token address
  j NEXT

  # ( len addr -- n )
ATOI:
  POP a0 # address
  POP a1 # length
  jal atoi
  PUSH a0
  j NEXT

  # ( len addr -- f )
IS_NUMBER:
  POP a0 # address
  POP a1 # length
  jal is_number
  PUSH a0
  j NEXT

  # ( a -- a a )
DUP:
  POP t0
  PUSH t0
  PUSH t0
  j NEXT

  # ( a b -- b a )
SWAP:
  POP t0 # b
  POP t1 # a
  PUSH t0
  PUSH t1
  j NEXT

  # ( a -- )
DROP:
  POP zero
  j NEXT

  # ( a b -- )
TWODROP:
  POP zero
  POP zero
  j NEXT

  # ( a b -- a b a b )
TWODUP:
  POP t0 # b
  POP t1 # a
  PUSH t1 # a
  PUSH t0 # b
  PUSH t1 # a
  PUSH t0 # b
  j NEXT

  # ( n1 n2 -- n1 n2 n1 )
OVER:
  POP t0 # n2
  POP t1 # n1
  PUSH t1 # n1
  PUSH t0 # n2
  PUSH t1 # n1
  j NEXT


  # (x1 x2 x3 x4 -- x3 x4 x1 x2)
TWOSWAP:
  POP t0 # x4
  POP t1 # x3
  POP t2 # x2
  POP t3 # x1

  PUSH t1
  PUSH t0
  PUSH t3
  PUSH t2
  j NEXT
  # (x1 x2 x3 -- x2 x3 x1 )
ROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t1 # x2
  PUSH t0 # x3
  PUSH t2 # x1
  j NEXT

  # (x1 x2 x3 -- x3 x1 x2)
NROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t0 # x3
  PUSH t2 # x1
  PUSH t1 # x2
  j NEXT

  # ( a b -- f)
EQUAL:
  POP t0
  POP t1
  beq t0, t1, .L_equal
  li t0, 0
  PUSH t0
  j NEXT
.L_equal:
  li t0, -1
  PUSH t0
  j NEXT

  # ( len1 addr1 len2 addr2 -- flag)
MEMCMP:
  POP a2
  POP a3
  POP a0
  POP a1
  call memcmp
  PUSH a0
  j NEXT

  # ( f -- )
BRANCH_ON_ZERO:
  POP t0
  beqz t0, .L_do_branch
  addi :IP, :IP, 4
  j NEXT
.L_do_branch:
  lw :IP, 0(:IP)
  j NEXT

  # ( -- )
JUMP:
  lw :IP, 0(:IP)
  j NEXT

  # just a debug function to print the whole stack
  # print debugging.. some people hate it some people love it
  # i both hate it and love it
DEBUG_STACK:
  addi sp, sp, -12
  sw ra,  0(sp)
  sw s8,  4(sp)
  sw s9,  8(sp)

  li a0, '<'
  call putc
  li a0, '>'
  call putc
  li a0, ' '
  call putc

  mv s9, :SP
  add s9, s9, -4
  la s8, FORTH_STACK_END
  add s8, s8, -4
.L_debug_stack_loop:
  beq s8, s9, .L_debug_stack_loop_end
  lw a0, 0(s8)
  call print_unsigned_hex
  li a0, ' '
  call putc
  addi s8, s8, -4
  j .L_debug_stack_loop

.L_debug_stack_loop_end:
  li a0, '\n'
  call putc

  lw ra,  0(sp)
  lw s8,  4(sp)
  lw s9,  8(sp)
  addi sp, sp, 12
  j NEXT


human_program:
  .asciz "842 31 + 721 + 3 + . bye"

  # This bytecode says:
  #   1) Push address of human_program onto stack.
  #   2) Go parse tokens from that string.
  #   3) Decide if each token is a number or a known word (+, ., bye).
  #   4) Execute the corresponding Forth logic.

  # lets assume human_program is at address 1000

bytecode:
  .word LITERAL
  .word human_program    # 1000

  # parse the token, and check if we have reached end of string
next_token:
  .word PARSE_TOKEN            # ( addr -- len addr)
                         # 3 1000 for the first token 842, length is 3, address is 1000
                         # 2 1004 for the second token: 31, length is 2 address is 1004

  .word OVER             # ( n1 n2 -- n1 n2 n1 )
                         # 3 1000 3
  .word LITERAL
  .word 0                # push 0 to the stack
                         # 3 1000 3 0
                         # we want to compare if the token's length is 0
                         # so we push 0 and call equal

  .word EQUAL            # ( n1 n2 -- flag )
                         # push -1 if n1 == n2, 0 otherwise
                         # 3 1000 -1/0

  .word BRANCH_ON_ZERO   # ( flag -- )
                         # pop the flag and if 0, jump to the next instruction, if not continue

  .word check_is_number  # if we have a token (flag is -1) check if its a number

  .word BYE              # no token left, quit qemu


  # check if the token is a number, and if it is convert it to integer and push it to the stack
check_is_number:
                         # when we come here, the stacks is: len addr of the token
  .word TWODUP           # (n1 n2 -- n1 n2 n1 n2)
                         # duplicate the token len and addr because IS_NUMBER
                         # will pop len addr and return a flag if the token is number, and
                         # we still want to use the actual token after that
  .word IS_NUMBER        # ( len addr -- flag )
                         # 3 1000 3 1000 -> 3 1000 -1/0
  .word LITERAL
  .word -1               # push -1, stack becomes: len addr flag -1
                         # we want to compare IS_NUMBER with -1 (true), so we push -1
                         # and call equal
  .word EQUAL            # ( n1 n2 -- flag)
  .word BRANCH_ON_ZERO   # ( flag -- )
  .word not_a_number     # if the result of equal is zero, means the token is not a number
  .word TWODUP           # otherwise it is a number
                         # duplicate the len addr so we can convert it from string to a 4 byte number
                         # stack is now 3 1000 3 1000

  .word ATOI             # ( len addr -- value )
                         # stack: 3 1000 842
                         # now the token is properly converted to a number and is on top of the stack

  .word NROT             # ( n1 n2 n3 -- n3 n1 n2 )
                         # stack: 842 3 1000
                         # we want to -rot the stack so that the token length and address are on top
                         # we want to add the length to the address and go parse the next token
  .word PLUS             # ( n1 n2 -- n )
                         # stack 852 1003
  .word JUMP             # jump to next token
  .word next_token

  # if its not a number, check if its a dot . for EMIT
not_a_number:
  .word TWODUP           # when we come here the stack is: ... len addr
                         # duplicate the token to be compared with "."
  .word LITERAL
  .word 1                # length of "."
  .word LITERAL
  .word string_dot       # address of the string "."
  .word MEMCMP           # ( len1 addr1 len2 addr2 -- flag)
  .word BRANCH_ON_ZERO   # ( flag -- )
  .word not_a_dot        # if memcmp pushes 0 to the stack, then the token is not "."
                         # otherwise prepare the stack to call EMIT to print it
  .word ROT              # ( x1 x2 x3 -- x2 x3 x1 )
                         # rotate the stack so we get the value that shoud've been pushed
                         # to the stack before we come here, so the stack is len addr value
                         #
  .word EMIT             # ( v -- )
                         # print the top of the stack, after it becomes len addr of the token
  .word PLUS             # add the token addr and its length, and go to the next token
  .word JUMP             # jump to next token
  .word next_token       #
not_a_dot:
  .word TWODUP
  .word LITERAL
  .word 1
  .word LITERAL
  .word string_plus
  .word MEMCMP
  .word BRANCH_ON_ZERO
  .word not_a_plus
  .word TWOSWAP
  .word PLUS
  .word NROT
  .word PLUS
  .word JUMP
  .word next_token
not_a_plus:
  .word TWODUP
  .word LITERAL
  .word 3
  .word LITERAL
  .word string_bye
  .word MEMCMP
  .word BRANCH_ON_ZERO
  .word do_next_token
  .word BYE
do_next_token:
  .word PLUS
  .word JUMP
  .word next_token


string_dot:
  .ascii "."
  .zero 3

string_plus:
  .ascii "+"
  .zero 3

string_bye:
  .ascii "bye"
  .zero 1
  .space 1024
FORTH_STACK_END:

  .end

First relax, thats a lot of code.

This is the compiled machine code. The program is to be loaded at address 0x80000000. Lets look at its purest form, where nothing is hidden, no secrets, no pseudo instructions, no macros, no words, no comments. As close as we can get to the wires. And yet, there are no wires, our QEMU computer is a computer within a computer.

80000000 <_start>:
80000000:	08000117          	auipc	sp,0x8000
80000004:	00010113          	addi	sp,sp,0 # 88000000 <_ram_end>
80000008:	0040006f          	jal	zero,8000000c <forth>

8000000c <forth>:
8000000c:	00001497          	auipc	s1,0x1
80000010:	87148493          	addi	s1,s1,-1935 # 8000087d <FORTH_STACK_END>
80000014:	00000417          	auipc	s0,0x0
80000018:	36540413          	addi	s0,s0,869 # 80000379 <bytecode>
8000001c:	0040006f          	jal	zero,80000020 <NEXT>

80000020 <NEXT>:
80000020:	00042283          	lw	t0,0(s0)
80000024:	00440413          	addi	s0,s0,4
80000028:	00028067          	jalr	zero,0(t0)

8000002c <PLUS>:
8000002c:	0004a283          	lw	t0,0(s1)
80000030:	00448493          	addi	s1,s1,4
80000034:	0004a303          	lw	t1,0(s1)
80000038:	00448493          	addi	s1,s1,4
8000003c:	006282b3          	add	t0,t0,t1
80000040:	ffc48493          	addi	s1,s1,-4
80000044:	0054a023          	sw	t0,0(s1)
80000048:	fd9ff06f          	jal	zero,80000020 <NEXT>

8000004c <LITERAL>:
8000004c:	00042283          	lw	t0,0(s0)
80000050:	00440413          	addi	s0,s0,4
80000054:	ffc48493          	addi	s1,s1,-4
80000058:	0054a023          	sw	t0,0(s1)
8000005c:	fc5ff06f          	jal	zero,80000020 <NEXT>

80000060 <EMIT>:
80000060:	0004a503          	lw	a0,0(s1)
80000064:	00448493          	addi	s1,s1,4
80000068:	28d000ef          	jal	ra,80000af4 <print_int>
8000006c:	fb5ff06f          	jal	zero,80000020 <NEXT>

80000070 <WRITE>:
80000070:	0004a283          	lw	t0,0(s1)
80000074:	00448493          	addi	s1,s1,4
80000078:	0004a303          	lw	t1,0(s1)
8000007c:	00448493          	addi	s1,s1,4
80000080:	0062a023          	sw	t1,0(t0)
80000084:	f9dff06f          	jal	zero,80000020 <NEXT>

80000088 <BYE>:
80000088:	0290006f          	jal	zero,800008b0 <qemu_exit>

8000008c <CR>:
8000008c:	00a00513          	addi	a0,zero,10
80000090:	7f0000ef          	jal	ra,80000880 <putc>
80000094:	f8dff06f          	jal	zero,80000020 <NEXT>

80000098 <PARSE_TOKEN>:
80000098:	0004a503          	lw	a0,0(s1)
8000009c:	00448493          	addi	s1,s1,4
800000a0:	025000ef          	jal	ra,800008c4 <token>
800000a4:	ffc48493          	addi	s1,s1,-4
800000a8:	00b4a023          	sw	a1,0(s1)
800000ac:	ffc48493          	addi	s1,s1,-4
800000b0:	00a4a023          	sw	a0,0(s1)
800000b4:	f6dff06f          	jal	zero,80000020 <NEXT>

800000b8 <ATOI>:
800000b8:	0004a503          	lw	a0,0(s1)
800000bc:	00448493          	addi	s1,s1,4
800000c0:	0004a583          	lw	a1,0(s1)
800000c4:	00448493          	addi	s1,s1,4
800000c8:	091000ef          	jal	ra,80000958 <atoi>
800000cc:	ffc48493          	addi	s1,s1,-4
800000d0:	00a4a023          	sw	a0,0(s1)
800000d4:	f4dff06f          	jal	zero,80000020 <NEXT>

800000d8 <IS_NUMBER>:
800000d8:	0004a503          	lw	a0,0(s1)
800000dc:	00448493          	addi	s1,s1,4
800000e0:	0004a583          	lw	a1,0(s1)
800000e4:	00448493          	addi	s1,s1,4
800000e8:	019000ef          	jal	ra,80000900 <is_number>
800000ec:	ffc48493          	addi	s1,s1,-4
800000f0:	00a4a023          	sw	a0,0(s1)
800000f4:	f2dff06f          	jal	zero,80000020 <NEXT>

800000f8 <DUP>:
800000f8:	0004a283          	lw	t0,0(s1)
800000fc:	00448493          	addi	s1,s1,4
80000100:	ffc48493          	addi	s1,s1,-4
80000104:	0054a023          	sw	t0,0(s1)
80000108:	ffc48493          	addi	s1,s1,-4
8000010c:	0054a023          	sw	t0,0(s1)
80000110:	f11ff06f          	jal	zero,80000020 <NEXT>

80000114 <SWAP>:
80000114:	0004a283          	lw	t0,0(s1)
80000118:	00448493          	addi	s1,s1,4
8000011c:	0004a303          	lw	t1,0(s1)
80000120:	00448493          	addi	s1,s1,4
80000124:	ffc48493          	addi	s1,s1,-4
80000128:	0054a023          	sw	t0,0(s1)
8000012c:	ffc48493          	addi	s1,s1,-4
80000130:	0064a023          	sw	t1,0(s1)
80000134:	eedff06f          	jal	zero,80000020 <NEXT>

80000138 <DROP>:
80000138:	0004a003          	lw	zero,0(s1)
8000013c:	00448493          	addi	s1,s1,4
80000140:	ee1ff06f          	jal	zero,80000020 <NEXT>

80000144 <TWODROP>:
80000144:	0004a003          	lw	zero,0(s1)
80000148:	00448493          	addi	s1,s1,4
8000014c:	0004a003          	lw	zero,0(s1)
80000150:	00448493          	addi	s1,s1,4
80000154:	ecdff06f          	jal	zero,80000020 <NEXT>

80000158 <TWODUP>:
80000158:	0004a283          	lw	t0,0(s1)
8000015c:	00448493          	addi	s1,s1,4
80000160:	0004a303          	lw	t1,0(s1)
80000164:	00448493          	addi	s1,s1,4
80000168:	ffc48493          	addi	s1,s1,-4
8000016c:	0064a023          	sw	t1,0(s1)
80000170:	ffc48493          	addi	s1,s1,-4
80000174:	0054a023          	sw	t0,0(s1)
80000178:	ffc48493          	addi	s1,s1,-4
8000017c:	0064a023          	sw	t1,0(s1)
80000180:	ffc48493          	addi	s1,s1,-4
80000184:	0054a023          	sw	t0,0(s1)
80000188:	e99ff06f          	jal	zero,80000020 <NEXT>

8000018c <OVER>:
8000018c:	0004a283          	lw	t0,0(s1)
80000190:	00448493          	addi	s1,s1,4
80000194:	0004a303          	lw	t1,0(s1)
80000198:	00448493          	addi	s1,s1,4
8000019c:	ffc48493          	addi	s1,s1,-4
800001a0:	0064a023          	sw	t1,0(s1)
800001a4:	ffc48493          	addi	s1,s1,-4
800001a8:	0054a023          	sw	t0,0(s1)
800001ac:	ffc48493          	addi	s1,s1,-4
800001b0:	0064a023          	sw	t1,0(s1)
800001b4:	e6dff06f          	jal	zero,80000020 <NEXT>

800001b8 <TWOSWAP>:
800001b8:	0004a283          	lw	t0,0(s1)
800001bc:	00448493          	addi	s1,s1,4
800001c0:	0004a303          	lw	t1,0(s1)
800001c4:	00448493          	addi	s1,s1,4
800001c8:	0004a383          	lw	t2,0(s1)
800001cc:	00448493          	addi	s1,s1,4
800001d0:	0004ae03          	lw	t3,0(s1)
800001d4:	00448493          	addi	s1,s1,4
800001d8:	ffc48493          	addi	s1,s1,-4
800001dc:	0064a023          	sw	t1,0(s1)
800001e0:	ffc48493          	addi	s1,s1,-4
800001e4:	0054a023          	sw	t0,0(s1)
800001e8:	ffc48493          	addi	s1,s1,-4
800001ec:	01c4a023          	sw	t3,0(s1)
800001f0:	ffc48493          	addi	s1,s1,-4
800001f4:	0074a023          	sw	t2,0(s1)
800001f8:	e29ff06f          	jal	zero,80000020 <NEXT>

800001fc <ROT>:
800001fc:	0004a283          	lw	t0,0(s1)
80000200:	00448493          	addi	s1,s1,4
80000204:	0004a303          	lw	t1,0(s1)
80000208:	00448493          	addi	s1,s1,4
8000020c:	0004a383          	lw	t2,0(s1)
80000210:	00448493          	addi	s1,s1,4
80000214:	ffc48493          	addi	s1,s1,-4
80000218:	0064a023          	sw	t1,0(s1)
8000021c:	ffc48493          	addi	s1,s1,-4
80000220:	0054a023          	sw	t0,0(s1)
80000224:	ffc48493          	addi	s1,s1,-4
80000228:	0074a023          	sw	t2,0(s1)
8000022c:	df5ff06f          	jal	zero,80000020 <NEXT>

80000230 <NROT>:
80000230:	0004a283          	lw	t0,0(s1)
80000234:	00448493          	addi	s1,s1,4
80000238:	0004a303          	lw	t1,0(s1)
8000023c:	00448493          	addi	s1,s1,4
80000240:	0004a383          	lw	t2,0(s1)
80000244:	00448493          	addi	s1,s1,4
80000248:	ffc48493          	addi	s1,s1,-4
8000024c:	0054a023          	sw	t0,0(s1)
80000250:	ffc48493          	addi	s1,s1,-4
80000254:	0074a023          	sw	t2,0(s1)
80000258:	ffc48493          	addi	s1,s1,-4
8000025c:	0064a023          	sw	t1,0(s1)
80000260:	dc1ff06f          	jal	zero,80000020 <NEXT>

80000264 <EQUAL>:
80000264:	0004a283          	lw	t0,0(s1)
80000268:	00448493          	addi	s1,s1,4
8000026c:	0004a303          	lw	t1,0(s1)
80000270:	00448493          	addi	s1,s1,4
80000274:	00628a63          	beq	t0,t1,80000288 <EQUAL+0x24>
80000278:	00000293          	addi	t0,zero,0
8000027c:	ffc48493          	addi	s1,s1,-4
80000280:	0054a023          	sw	t0,0(s1)
80000284:	d9dff06f          	jal	zero,80000020 <NEXT>
80000288:	fff00293          	addi	t0,zero,-1
8000028c:	ffc48493          	addi	s1,s1,-4
80000290:	0054a023          	sw	t0,0(s1)
80000294:	d8dff06f          	jal	zero,80000020 <NEXT>

80000298 <MEMCMP>:
80000298:	0004a603          	lw	a2,0(s1)
8000029c:	00448493          	addi	s1,s1,4
800002a0:	0004a683          	lw	a3,0(s1)
800002a4:	00448493          	addi	s1,s1,4
800002a8:	0004a503          	lw	a0,0(s1)
800002ac:	00448493          	addi	s1,s1,4
800002b0:	0004a583          	lw	a1,0(s1)
800002b4:	00448493          	addi	s1,s1,4
800002b8:	0c5000ef          	jal	ra,80000b7c <memcmp>
800002bc:	ffc48493          	addi	s1,s1,-4
800002c0:	00a4a023          	sw	a0,0(s1)
800002c4:	d5dff06f          	jal	zero,80000020 <NEXT>

800002c8 <BRANCH_ON_ZERO>:
800002c8:	0004a283          	lw	t0,0(s1)
800002cc:	00448493          	addi	s1,s1,4
800002d0:	00028663          	beq	t0,zero,800002dc <BRANCH_ON_ZERO+0x14>
800002d4:	00440413          	addi	s0,s0,4
800002d8:	d49ff06f          	jal	zero,80000020 <NEXT>
800002dc:	00042403          	lw	s0,0(s0)
800002e0:	d41ff06f          	jal	zero,80000020 <NEXT>

800002e4 <JUMP>:
800002e4:	00042403          	lw	s0,0(s0)
800002e8:	d39ff06f          	jal	zero,80000020 <NEXT>

800002ec <DEBUG_STACK>:
800002ec:	ff410113          	addi	sp,sp,-12
800002f0:	00112023          	sw	ra,0(sp)
800002f4:	01812223          	sw	s8,4(sp)
800002f8:	01912423          	sw	s9,8(sp)
800002fc:	03c00513          	addi	a0,zero,60
80000300:	580000ef          	jal	ra,80000880 <putc>
80000304:	03e00513          	addi	a0,zero,62
80000308:	578000ef          	jal	ra,80000880 <putc>
8000030c:	02000513          	addi	a0,zero,32
80000310:	570000ef          	jal	ra,80000880 <putc>
80000314:	00048c93          	addi	s9,s1,0
80000318:	ffcc8c93          	addi	s9,s9,-4
8000031c:	00000c17          	auipc	s8,0x0
80000320:	561c0c13          	addi	s8,s8,1377 # 8000087d <FORTH_STACK_END>
80000324:	ffcc0c13          	addi	s8,s8,-4
80000328:	019c0e63          	beq	s8,s9,80000344 <DEBUG_STACK+0x58>
8000032c:	000c2503          	lw	a0,0(s8)
80000330:	730000ef          	jal	ra,80000a60 <print_unsigned_hex>
80000334:	02000513          	addi	a0,zero,32
80000338:	548000ef          	jal	ra,80000880 <putc>
8000033c:	ffcc0c13          	addi	s8,s8,-4
80000340:	fe9ff06f          	jal	zero,80000328 <DEBUG_STACK+0x3c>
80000344:	00a00513          	addi	a0,zero,10
80000348:	538000ef          	jal	ra,80000880 <putc>
8000034c:	00012083          	lw	ra,0(sp)
80000350:	00412c03          	lw	s8,4(sp)
80000354:	00812c83          	lw	s9,8(sp)
80000358:	00c10113          	addi	sp,sp,12
8000035c:	cc5ff06f          	jal	zero,80000020 <NEXT>

80000360 <human_program>:
80000360:	20323438          	.word	0x20323438
80000364:	2b203133          	.word	0x2b203133
80000368:	31323720          	.word	0x31323720
8000036c:	33202b20          	.word	0x33202b20
80000370:	2e202b20          	.word	0x2e202b20
80000374:	65796220          	.word	0x65796220
	...

80000379 <bytecode>:
80000379:	8000004c          	.word	0x8000004c
8000037d:	80000360          	.word	0x80000360

80000381 <next_token>:
80000381:	80000098          	.word	0x80000098
80000385:	8000018c          	.word	0x8000018c
80000389:	8000004c          	.word	0x8000004c
8000038d:	00000000          	.word	0x00000000
80000391:	80000264          	.word	0x80000264
80000395:	800002c8          	.word	0x800002c8
80000399:	800003a1          	.word	0x800003a1
8000039d:	80000088          	.word	0x80000088

800003a1 <check_is_number>:
800003a1:	80000158          	.word	0x80000158
800003a5:	800000d8          	.word	0x800000d8
800003a9:	8000004c          	.word	0x8000004c
800003ad:	ffffffff          	.word	0xffffffff
800003b1:	80000264          	.word	0x80000264
800003b5:	800002c8          	.word	0x800002c8
800003b9:	800003d5          	.word	0x800003d5
800003bd:	80000158          	.word	0x80000158
800003c1:	800000b8          	.word	0x800000b8
800003c5:	80000230          	.word	0x80000230
800003c9:	8000002c          	.word	0x8000002c
800003cd:	800002e4          	.word	0x800002e4
800003d1:	80000381          	.word	0x80000381

800003d5 <not_a_number>:
800003d5:	80000158          	.word	0x80000158
800003d9:	8000004c          	.word	0x8000004c
800003dd:	00000001          	.word	0x00000001
800003e1:	8000004c          	.word	0x8000004c
800003e5:	80000471          	.word	0x80000471
800003e9:	80000298          	.word	0x80000298
800003ed:	800002c8          	.word	0x800002c8
800003f1:	80000409          	.word	0x80000409
800003f5:	800001fc          	.word	0x800001fc
800003f9:	80000060          	.word	0x80000060
800003fd:	8000002c          	.word	0x8000002c
80000401:	800002e4          	.word	0x800002e4
80000405:	80000381          	.word	0x80000381

80000409 <not_a_dot>:
80000409:	80000158          	.word	0x80000158
8000040d:	8000004c          	.word	0x8000004c
80000411:	00000001          	.word	0x00000001
80000415:	8000004c          	.word	0x8000004c
80000419:	80000475          	.word	0x80000475
8000041d:	80000298          	.word	0x80000298
80000421:	800002c8          	.word	0x800002c8
80000425:	80000441          	.word	0x80000441
80000429:	800001b8          	.word	0x800001b8
8000042d:	8000002c          	.word	0x8000002c
80000431:	80000230          	.word	0x80000230
80000435:	8000002c          	.word	0x8000002c
80000439:	800002e4          	.word	0x800002e4
8000043d:	80000381          	.word	0x80000381

80000441 <not_a_plus>:
80000441:	80000158          	.word	0x80000158
80000445:	8000004c          	.word	0x8000004c
80000449:	00000003          	.word	0x00000003
8000044d:	8000004c          	.word	0x8000004c
80000451:	80000479          	.word	0x80000479
80000455:	80000298          	.word	0x80000298
80000459:	800002c8          	.word	0x800002c8
8000045d:	80000465          	.word	0x80000465
80000461:	80000088          	.word	0x80000088

80000465 <do_next_token>:
80000465:	8000002c          	.word	0x8000002c
80000469:	800002e4          	.word	0x800002e4
8000046d:	80000381          	.word	0x80000381

80000471 <string_dot>:
80000471:	0000002e          	.word	0x0000002e

80000475 <string_plus>:
80000475:	0000002b          	.word	0x0000002b

80000479 <string_bye>:
80000479:	00657962          	.word	0x00657962
	...

8000087d <FORTH_STACK_END>:
8000087d:	0000                	.insn	2, 0x
	...

80000880 <putc>:
80000880:	100002b7          	lui	t0,0x10000
80000884:	0052c303          	lbu	t1,5(t0) # 10000005 <_start-0x6ffffffb>
80000888:	02037313          	andi	t1,t1,32
8000088c:	fe030ce3          	beq	t1,zero,80000884 <putc+0x4>
80000890:	00a28023          	sb	a0,0(t0)
80000894:	00008067          	jalr	zero,0(ra)

80000898 <getch>:
80000898:	100002b7          	lui	t0,0x10000
8000089c:	0052c303          	lbu	t1,5(t0) # 10000005 <_start-0x6ffffffb>
800008a0:	00137313          	andi	t1,t1,1
800008a4:	fe030ce3          	beq	t1,zero,8000089c <getch+0x4>
800008a8:	0002c503          	lbu	a0,0(t0)
800008ac:	00008067          	jalr	zero,0(ra)

800008b0 <qemu_exit>:
800008b0:	001002b7          	lui	t0,0x100
800008b4:	00005337          	lui	t1,0x5
800008b8:	55530313          	addi	t1,t1,1365 # 5555 <_start-0x7fffaaab>
800008bc:	0062a023          	sw	t1,0(t0) # 100000 <_start-0x7ff00000>
800008c0:	0000006f          	jal	zero,800008c0 <qemu_exit+0x10>

800008c4 <token>:
800008c4:	00050e13          	addi	t3,a0,0
800008c8:	00000593          	addi	a1,zero,0
800008cc:	02100313          	addi	t1,zero,33
800008d0:	000e4283          	lbu	t0,0(t3)
800008d4:	02028463          	beq	t0,zero,800008fc <token+0x38>
800008d8:	0062d663          	bge	t0,t1,800008e4 <token+0x20>
800008dc:	001e0e13          	addi	t3,t3,1
800008e0:	ff1ff06f          	jal	zero,800008d0 <token+0xc>
800008e4:	000e0513          	addi	a0,t3,0
800008e8:	000e4283          	lbu	t0,0(t3)
800008ec:	0062c863          	blt	t0,t1,800008fc <token+0x38>
800008f0:	00158593          	addi	a1,a1,1
800008f4:	001e0e13          	addi	t3,t3,1
800008f8:	ff1ff06f          	jal	zero,800008e8 <token+0x24>
800008fc:	00008067          	jalr	zero,0(ra)

80000900 <is_number>:
80000900:	04058863          	beq	a1,zero,80000950 <is_number+0x50>
80000904:	00050293          	addi	t0,a0,0
80000908:	00058313          	addi	t1,a1,0
8000090c:	0002c383          	lbu	t2,0(t0)
80000910:	02d00e13          	addi	t3,zero,45
80000914:	03c38663          	beq	t2,t3,80000940 <is_number+0x40>
80000918:	0002c383          	lbu	t2,0(t0)
8000091c:	03000e13          	addi	t3,zero,48
80000920:	03900e93          	addi	t4,zero,57
80000924:	03c3c663          	blt	t2,t3,80000950 <is_number+0x50>
80000928:	027ec463          	blt	t4,t2,80000950 <is_number+0x50>
8000092c:	00128293          	addi	t0,t0,1
80000930:	fff30313          	addi	t1,t1,-1
80000934:	fe0312e3          	bne	t1,zero,80000918 <is_number+0x18>
80000938:	fff00513          	addi	a0,zero,-1
8000093c:	00008067          	jalr	zero,0(ra)
80000940:	00128293          	addi	t0,t0,1
80000944:	fff30313          	addi	t1,t1,-1
80000948:	00030463          	beq	t1,zero,80000950 <is_number+0x50>
8000094c:	fcdff06f          	jal	zero,80000918 <is_number+0x18>
80000950:	00000513          	addi	a0,zero,0
80000954:	00008067          	jalr	zero,0(ra)

80000958 <atoi>:
80000958:	fec10113          	addi	sp,sp,-20
8000095c:	00112023          	sw	ra,0(sp)
80000960:	00812223          	sw	s0,4(sp)
80000964:	00912423          	sw	s1,8(sp)
80000968:	01212623          	sw	s2,12(sp)
8000096c:	01312823          	sw	s3,16(sp)
80000970:	00050413          	addi	s0,a0,0
80000974:	00058493          	addi	s1,a1,0
80000978:	00000913          	addi	s2,zero,0
8000097c:	00a00293          	addi	t0,zero,10
80000980:	00000993          	addi	s3,zero,0
80000984:	02048e63          	beq	s1,zero,800009c0 <atoi+0x68>
80000988:	00044303          	lbu	t1,0(s0)
8000098c:	02d00393          	addi	t2,zero,45
80000990:	00731863          	bne	t1,t2,800009a0 <atoi+0x48>
80000994:	00100993          	addi	s3,zero,1
80000998:	00140413          	addi	s0,s0,1
8000099c:	fff48493          	addi	s1,s1,-1
800009a0:	02048063          	beq	s1,zero,800009c0 <atoi+0x68>
800009a4:	02590933          	mul	s2,s2,t0
800009a8:	00044303          	lbu	t1,0(s0)
800009ac:	fd030313          	addi	t1,t1,-48
800009b0:	00690933          	add	s2,s2,t1
800009b4:	00140413          	addi	s0,s0,1
800009b8:	fff48493          	addi	s1,s1,-1
800009bc:	fe5ff06f          	jal	zero,800009a0 <atoi+0x48>
800009c0:	00098463          	beq	s3,zero,800009c8 <atoi+0x70>
800009c4:	41200933          	sub	s2,zero,s2
800009c8:	00090513          	addi	a0,s2,0
800009cc:	00012083          	lw	ra,0(sp)
800009d0:	00412403          	lw	s0,4(sp)
800009d4:	00812483          	lw	s1,8(sp)
800009d8:	00c12903          	lw	s2,12(sp)
800009dc:	01012983          	lw	s3,16(sp)
800009e0:	01410113          	addi	sp,sp,20
800009e4:	00008067          	jalr	zero,0(ra)

800009e8 <puts>:
800009e8:	ff810113          	addi	sp,sp,-8
800009ec:	00112023          	sw	ra,0(sp)
800009f0:	00812223          	sw	s0,4(sp)
800009f4:	00050413          	addi	s0,a0,0
800009f8:	00044503          	lbu	a0,0(s0)
800009fc:	00050863          	beq	a0,zero,80000a0c <puts+0x24>
80000a00:	e81ff0ef          	jal	ra,80000880 <putc>
80000a04:	00140413          	addi	s0,s0,1
80000a08:	ff1ff06f          	jal	zero,800009f8 <puts+0x10>
80000a0c:	00012083          	lw	ra,0(sp)
80000a10:	00412403          	lw	s0,4(sp)
80000a14:	00810113          	addi	sp,sp,8
80000a18:	00008067          	jalr	zero,0(ra)

80000a1c <puts_len>:
80000a1c:	ff410113          	addi	sp,sp,-12
80000a20:	00112023          	sw	ra,0(sp)
80000a24:	00812223          	sw	s0,4(sp)
80000a28:	00912423          	sw	s1,8(sp)
80000a2c:	00050413          	addi	s0,a0,0
80000a30:	00058493          	addi	s1,a1,0
80000a34:	00048c63          	beq	s1,zero,80000a4c <puts_len+0x30>
80000a38:	00044503          	lbu	a0,0(s0)
80000a3c:	e45ff0ef          	jal	ra,80000880 <putc>
80000a40:	00140413          	addi	s0,s0,1
80000a44:	fff48493          	addi	s1,s1,-1
80000a48:	fedff06f          	jal	zero,80000a34 <puts_len+0x18>
80000a4c:	00012083          	lw	ra,0(sp)
80000a50:	00412403          	lw	s0,4(sp)
80000a54:	00812483          	lw	s1,8(sp)
80000a58:	00c10113          	addi	sp,sp,12
80000a5c:	00008067          	jalr	zero,0(ra)

80000a60 <print_unsigned_hex>:
80000a60:	fec10113          	addi	sp,sp,-20
80000a64:	00112023          	sw	ra,0(sp)
80000a68:	00812223          	sw	s0,4(sp)
80000a6c:	00912423          	sw	s1,8(sp)
80000a70:	01212623          	sw	s2,12(sp)
80000a74:	01312823          	sw	s3,16(sp)
80000a78:	00050413          	addi	s0,a0,0
80000a7c:	01c00493          	addi	s1,zero,28
80000a80:	00000913          	addi	s2,zero,0
80000a84:	03000513          	addi	a0,zero,48
80000a88:	df9ff0ef          	jal	ra,80000880 <putc>
80000a8c:	07800513          	addi	a0,zero,120
80000a90:	df1ff0ef          	jal	ra,80000880 <putc>
80000a94:	00040293          	addi	t0,s0,0
80000a98:	0092d2b3          	srl	t0,t0,s1
80000a9c:	00f2f293          	andi	t0,t0,15
80000aa0:	00029863          	bne	t0,zero,80000ab0 <print_unsigned_hex+0x50>
80000aa4:	00091663          	bne	s2,zero,80000ab0 <print_unsigned_hex+0x50>
80000aa8:	00048463          	beq	s1,zero,80000ab0 <print_unsigned_hex+0x50>
80000aac:	0240006f          	jal	zero,80000ad0 <print_unsigned_hex+0x70>
80000ab0:	00100913          	addi	s2,zero,1
80000ab4:	00a00313          	addi	t1,zero,10
80000ab8:	0062c663          	blt	t0,t1,80000ac4 <print_unsigned_hex+0x64>
80000abc:	05728293          	addi	t0,t0,87
80000ac0:	0080006f          	jal	zero,80000ac8 <print_unsigned_hex+0x68>
80000ac4:	03028293          	addi	t0,t0,48
80000ac8:	00028513          	addi	a0,t0,0
80000acc:	db5ff0ef          	jal	ra,80000880 <putc>
80000ad0:	ffc48493          	addi	s1,s1,-4
80000ad4:	fc04d0e3          	bge	s1,zero,80000a94 <print_unsigned_hex+0x34>
80000ad8:	00012083          	lw	ra,0(sp)
80000adc:	00412403          	lw	s0,4(sp)
80000ae0:	00812483          	lw	s1,8(sp)
80000ae4:	00c12903          	lw	s2,12(sp)
80000ae8:	01012983          	lw	s3,16(sp)
80000aec:	01410113          	addi	sp,sp,20
80000af0:	00008067          	jalr	zero,0(ra)

80000af4 <print_int>:
80000af4:	ff010113          	addi	sp,sp,-16
80000af8:	00112023          	sw	ra,0(sp)
80000afc:	00812223          	sw	s0,4(sp)
80000b00:	00912423          	sw	s1,8(sp)
80000b04:	01212623          	sw	s2,12(sp)
80000b08:	00050413          	addi	s0,a0,0
80000b0c:	00010493          	addi	s1,sp,0
80000b10:	00a00913          	addi	s2,zero,10
80000b14:	00041863          	bne	s0,zero,80000b24 <print_int+0x30>
80000b18:	03000513          	addi	a0,zero,48
80000b1c:	d65ff0ef          	jal	ra,80000880 <putc>
80000b20:	0440006f          	jal	zero,80000b64 <print_int+0x70>
80000b24:	00045863          	bge	s0,zero,80000b34 <print_int+0x40>
80000b28:	02d00513          	addi	a0,zero,45
80000b2c:	d55ff0ef          	jal	ra,80000880 <putc>
80000b30:	40800433          	sub	s0,zero,s0
80000b34:	00040e63          	beq	s0,zero,80000b50 <print_int+0x5c>
80000b38:	032462b3          	rem	t0,s0,s2
80000b3c:	03028293          	addi	t0,t0,48
80000b40:	ffc48493          	addi	s1,s1,-4
80000b44:	0054a023          	sw	t0,0(s1)
80000b48:	03244433          	div	s0,s0,s2
80000b4c:	fe9ff06f          	jal	zero,80000b34 <print_int+0x40>
80000b50:	00248a63          	beq	s1,sp,80000b64 <print_int+0x70>
80000b54:	0004a503          	lw	a0,0(s1)
80000b58:	d29ff0ef          	jal	ra,80000880 <putc>
80000b5c:	00448493          	addi	s1,s1,4
80000b60:	ff1ff06f          	jal	zero,80000b50 <print_int+0x5c>
80000b64:	00012083          	lw	ra,0(sp)
80000b68:	00412403          	lw	s0,4(sp)
80000b6c:	00812483          	lw	s1,8(sp)
80000b70:	00c12903          	lw	s2,12(sp)
80000b74:	01010113          	addi	sp,sp,16
80000b78:	00008067          	jalr	zero,0(ra)

80000b7c <memcmp>:
80000b7c:	02d59663          	bne	a1,a3,80000ba8 <memcmp+0x2c>
80000b80:	02058063          	beq	a1,zero,80000ba0 <memcmp+0x24>
80000b84:	00054283          	lbu	t0,0(a0)
80000b88:	00064303          	lbu	t1,0(a2)
80000b8c:	00629e63          	bne	t0,t1,80000ba8 <memcmp+0x2c>
80000b90:	00150513          	addi	a0,a0,1
80000b94:	00160613          	addi	a2,a2,1
80000b98:	fff58593          	addi	a1,a1,-1
80000b9c:	fe0594e3          	bne	a1,zero,80000b84 <memcmp+0x8>
80000ba0:	fff00513          	addi	a0,zero,-1
80000ba4:	00008067          	jalr	zero,0(ra)
80000ba8:	00000513          	addi	a0,zero,0
80000bac:	00008067          	jalr	zero,0(ra)

Amazing. 800003a9 ffffffff is .word LITERAL .word -1, you can see it in address 800003ab where we have the check_number code.

800003a1:	80000158          	.word	0x80000158
800003a5:	800000d8          	.word	0x800000d8
800003a9:	8000004c          	.word	0x8000004c
800003ad:	ffffffff          	.word	0xffffffff
800003b1:	80000264          	.word	0x80000264
800003b5:	800002c8          	.word	0x800002c8
800003b9:	800003d5          	.word	0x800003d5
800003bd:	80000158          	.word	0x80000158
800003c1:	800000b8          	.word	0x800000b8
800003c5:	80000230          	.word	0x80000230
800003c9:	8000002c          	.word	0x8000002c
800003cd:	800002e4          	.word	0x800002e4
800003d1:	80000381          	.word	0x80000381

At the end of IS_NUMBER, it calls NEXT, by jumping to 80000020, s0 will be 800003a9, and then it NEXT will move s0 to 800003ad and then jump into LITERAL at 8000004c. LITERAL will load the value from memory[800003ad] and push it into the stack, then it will move s0 to 800003b1, and call NEXT again by jumping to 80000020.

8000004c:	00042283          	lw	t0,0(s0)
80000050:	00440413          	addi	s0,s0,4
80000054:	ffc48493          	addi	s1,s1,-4
80000058:	0054a023          	sw	t0,0(s1)
8000005c:	fc5ff06f          	jal	zero,80000020 <NEXT>

See again, how the thread is woven. From NEXT to NEXT to NEXT..

Examine address 8000005c containing the machine code fc5ff06f. It is 'jal x0, -60', as we discussed in RISCV jumps are relative to the jal instruction itself. And 0x8000005c - 60 is.. you guessed it 0x80000020 :) In binary 0xfc5ff06f is 11111100010111111111000001101111. The most right bits are the jal instruction iself, then 5 bits are for rd the destination register where pc+4 will be stored, in this case it is the zero register, and then, the instruction's immediate value, the pc relative offset, in somewhat strange encoding, first we take the 20th bit, then the bits 10 to 1 then bit 11 then 19 to 12 to construct the actual value

This is how it is actually decoded:

signed Bits<21> imm = sext({$encoding[31], $encoding[19:12], $encoding[20], $encoding[30:21], 1'd0});

Sext means sign extension,if its 1 then it is a 2s complement number. and the sign must be preserved. For example if we had 4 bit number -3 1101, and want to extend it to 8 bits we must make it 11111101 not 00001101. That is what Sign Extension mean, because we want to convert 20 bit number into 32 bit number.

1'd0 means one bit of value 0 added to the end. This multiplies the result by 2, and guarantees that the address we are jumping to is multiple of 2. So if the actual immediate value is 10 we will jump to pc + 20.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1  1  1  1  1  1  0  0  0  1  0  1  1  1  1  1  1  1  1  1  0  0  0 0 0 1 1 0 1 1 1 1

So our immediate value is:

sext(1, 11111111, 1, 1111100010, 0), or in 32 sign extended bits 11111111111111111111111111000100, to convert it from twos complement, we have to invert the bits and add 1 to it, 00000000000000000000000000111011 + 1 is 00000000000000000000000000111100 which in decimal is 60, so 11111111111111111111111111000100 two's complement is -60 in decimal

I wanted to show you the manual decoding of fc5ff06f because we jump around so much, the whole Forth inner interpreter is about jumping, so understanding the jal instruction seemed appropriate, however you can do fine without understanding the bits of it. You can "forget" how the bits are, and just know that it will jump to where you want, you don't even need to know that its a relative jump rather than absolute, you can see even objdump shows absolute values for the jumps, but then they are compiled to relative offsets in the machine code. At some point however you might want to jump further than 20 bits away, and then it will error out, relocation truncated to fit: R_RISCV_JAL against 'xyz', you will google it, someone will say, 'just use call instead of jal' and you will go on with your life. A tiny specticle of confusion will be left in your soul. It makes no sense why cant you just jump to a label? You know it is somewhere in memory, you know the instruction for jumping, why cant it jump? This kind of questions happen more often than you think, and you how deep difference is between 'when I see jal error I must use call' and 'oh my jal offset is more than 20 bits away, I should use call'. You can code in assembly all their life, and be satisfied with 'I must use call', you wont even code more or less bugs, you wont be more productive. But you will be incomplete. Knowledge and understanding grows, like a huge interconnected graph, inside of this graph there are nodes of doubt and confusion that spread their tenticles. For most you are not even aware until you reach them. It is a rare and valuable opportunity to turn a doubt node into a light node, do not miss it. Cherish the moments when things make no sense, because you are about to grow. The more confused you are the better.

Lets go back to our bytecode, There are few important concepts that you should pay attention to, one is BRANCH_ON_ZERO, JUMP and EQUAL. We will discuss them in detail.

...
NEXT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  jr t0
...

  # ( f -- )
BRANCH_ON_ZERO:
  POP t0
  beqz t0, .L_do_branch
  addi :IP, :IP, 4
  j NEXT
.L_do_branch:
  lw :IP, 0(:IP)
  j NEXT

  # ( -- )
JUMP:
  lw :IP, 0(:IP)
  j NEXT

  # ( a b -- f)
EQUAL:
  POP t0
  POP t1
  beq t0, t1, .L_equal
  li t0, 0
  PUSH t0
  j NEXT
.L_equal:
  li t0, -1
  PUSH t0
  j NEXT

NEXT reads the current word from wherever :IP points, then adds 4 to it, and then jumps to the original value. So every time we call NEXT wherever we jump to, IP is pointing to the next instruction.

JUMP is the easiest, coming to it from NEXT you it will just read the current value of memory[:IP] and set :IP to there, so NEXT will jump to there.


Memory layout at 0x80000400:

Address     |  Value        | Meaning
------------------------------------
0x80000400  |  LITERAL     | Push number onto stack
0x80000404  |  42          | The number to push
0x80000408  |  JUMP        | Jump instruction
0x8000040C  |  0x80000418  | Jump target address
0x80000410  |  LITERAL     | (skipped)
0x80000414  |  99          | (skipped)
0x80000418  |  EMIT        | Print top of stack
0x8000041C  |  BYE         | Exit

Step by step execution:

1. Initial state:
   :IP = 0x80000400
   :SP = FORTH_STACK_END

2. Execute NEXT:
   - Load t0 = memory[0x80000400] = LITERAL
   - :IP += 4 (now 0x80000404)
   - Jump to LITERAL

3. Execute LITERAL:
   - Load value from memory[:IP] = 42
   - Push 42 onto stack
   - :IP += 4 (now 0x80000408)
   - Jump to NEXT

4. Execute NEXT:
   - Load t0 = memory[0x80000408] = JUMP
   - :IP += 4 (now 0x8000040C)
   - Jump to JUMP

5. Execute JUMP:
   - Load new_ip = memory[:IP] = 0x80000418
   - Set :IP = 0x80000418
   - Jump to NEXT
   (Notice we skip over addresses 0x80000410-0x80000414)

6. Execute NEXT:
   - Load t0 = memory[0x80000418] = EMIT
   - :IP += 4 (now 0x8000041C)
   - Jump to EMIT

7. Execute EMIT:
   - Pop 42 from stack
   - Print it
   - Jump to NEXT

8. Execute NEXT:
   - Load t0 = memory[0x8000041C] = BYE
   - :IP += 4 (now 0x80000420)
   - Jump to BYE

An example of infinite loop, jump that jumps to itself:

Address     |  Value
--------------------------
0x80000408  |  JUMP
0x8000040C  |  0x80000408

I hope you understand the unconditional JUMP, it is pretty much the same as the jump we did in our SUBLEQ computer, we just set PC to some value and thats where the next instruction is loaded from.

The conditional jump BRANCH_ON_ZERO is very similar but we decide if we should jump to the argument or not depending if the top of the stack is 0.

EQUAL is quite straight forward, it pops 2 elements from the stack, if they are equal it pushes -1 otherwise it pushes 0. So for example if the stack is 1 2 after equal it will be 0, if it was 3 3 it will be -1, if you examine the code for BRANCH_ON_ZERO, any non zero value is true for us, 745762 is just as true as 1 and as -1 and as -487327, anything but 0. In Forth it is convention to use -1, I am not sure why, could be because as 32 bit two's complement value its 11111111111111111111111111111111.

Let's look at a simple example that checks if two numbers are equal
and branches based on that:

Memory layout at 0x80000400:
        Address     |  Value          | Meaning
        ------------------------------------
        0x80000400  |  LITERAL        | Push first number
        0x80000404  |  42             | Value 42
        0x80000408  |  LITERAL        | Push second number
        0x8000040C  |  42             | Value 42
        0x80000410  |  EQUAL          | Compare numbers
        0x80000414  |  BRANCH_ON_ZERO | Branch if not equal
        0x80000418  |  0x80000428     | Branch target (skip to BYE)
        0x8000041C  |  LITERAL        | Push success number
        0x80000420  |  7              | Success value
        0x80000424  |  EMIT           | Print it
        0x80000428  |  BYE            | Exit

Step by step execution (when numbers are equal):

1. Start with empty stack
   :IP = 0x80000400

2. After first LITERAL: stack = [42]
   :IP = 0x80000408

3. After second LITERAL: stack = [42, 42]
   :IP = 0x80000410

4. After EQUAL: stack = [-1]  (because 42 == 42)
   :IP = 0x80000414

5. BRANCH_ON_ZERO sees -1:
   - Since top of stack is not zero, don't branch
   - :IP += 4 (moves to 0x80000418)

6. LITERAL pushes 7: stack = [7]
   :IP = 0x80000420

7. EMIT prints 7
   :IP = 0x80000424

8. BYE exits

If we changed the second LITERAL to push 43 instead:
- EQUAL would push 0 (because 42 != 43)
- BRANCH_ON_ZERO would see 0 and jump to 0x80000428
- We would skip the LITERAL 7 and EMIT
- Program would exit immediately

The key insight is that BRANCH_ON_ZERO makes a decision based on the stack's top value:
- If top of stack is 0: jump to the target address
- If top of stack is anything else: continue to next instruction

Now you can read our mini bytecode interpreter again, it can compile the program "2 2 + 3 + 5 + . bye"

  .word LITERAL
  .word human_program
next_token:
  .word PARSE_TOKEN
  .word OVER
  .word LITERAL
  .word 0
  .word EQUAL
  .word BRANCH_ON_ZERO
  .word check_is_number
  .word BYE
check_is_number:
  .word TWODUP
  .word IS_NUMBER
  .word LITERAL
  .word -1
  .word EQUAL
  .word BRANCH_ON_ZERO
  .word not_a_number
  .word TWODUP
  .word ATOI
  .word NROT
  .word PLUS
  .word JUMP
  .word next_token
not_a_number:
  .word TWODUP
  .word LITERAL
  .word 1
  .word LITERAL
  .word string_dot
  .word MEMCMP
  .word BRANCH_ON_ZERO
  .word not_a_dot
  .word ROT
  .word EMIT
  .word PLUS
  .word JUMP
  .word next_token
not_a_dot:
  .word TWODUP
  .word LITERAL
  .word 1
  .word LITERAL
  .word string_plus
  .word MEMCMP
  .word BRANCH_ON_ZERO
  .word not_a_plus
  .word TWOSWAP
  .word PLUS
  .word NROT
  .word PLUS
  .word JUMP
  .word next_token
not_a_plus:
  .word TWODUP
  .word LITERAL
  .word 3
  .word LITERAL
  .word string_bye
  .word MEMCMP
  .word BRANCH_ON_ZERO
  .word do_next_token
  .word BYE
do_next_token:
  .word PLUS
  .word JUMP
  .word next_token

If I just expand the interpreter to support WRITE then I could write a program that writes a program "800003a9 0x80000420 WRITE 4 0x8000042d WRITE ... ", you can see its not difficult to do this expansion, as WRITE is no different than + or bye. However my forth program will be completely unportable between computers, because on some other computer it wont be compiled for address 80000000, and LITERAL wont be at 800003a9. If we could only know where LITERAL is then our Forth program wont have actual hardcoded memory values. Not only that but also expending the program with hardcoded write instructions is similar to writing a program in machine code, it requires great dedication and possibly desparation, pen paper and confiedence beyond my abilities.

Forth solves this problem by having a dictionary of words, each word has a link to the previous word in the dictionary, and you can search for words.

dictionary:
word_bye:
    .word 0         # link
    .word 3         # token length
    .ascii "bye\0"  # first 4 characters of token
    .word BYE       # address of execution token

word_plus:
    .word word_bye
    .word 1
    .ascii "+\0\0\0"
    .word PLUS


word_write:
    .word word_plus
    .word 5
    .ascii "writ"
    .word WRITE

word_dup:
    .word word_write
    .word 3
    .ascii "dup\0"
    .word DUP

Each entry has at least 4 values, link, token len, first 4 characters of the token, execution address. Usually the token is actually variable length, but for simplicity I decided to use fixed size of 4 bytes, wo WRITE and WRITZ will actually find the same forth word, both are 5 letters and the first 4 are WRIT, but that is ok for our version.

The first value is very important, the link, it is the address of the previous dictionary entry. If our example dictionary starts at address 8000087d this is how our memory would look like:

8000002c <PLUS>:
8000002c:    0004a283              lw    t0,0(s1)         <-.
80000030:    00448493              addi  s1,s1,4            |
80000034:    0004a303              lw    t1,0(s1)           |
80000038:    00448493              addi  s1,s1,4            |
8000003c:    006282b3              add   t0,t0,t1           |
80000040:    ffc48493              addi  s1,s1,-4           |
80000044:    0054a023              sw    t0,0(s1)           |
80000048:    fd9ff06f              jal   zero,80000020      |
...                                                         |
80000070 <WRITE>:                                           |
80000070:    0004a283              lw    t0,0(s1)           | <-.
80000074:    00448493              addi  s1,s1,4            |   |
80000078:    0004a303              lw    t1,0(s1)           |   |
8000007c:    00448493              addi  s1,s1,4            |   |
80000080:    0062a023              sw    t1,0(t0)           |   |
80000084:    f9dff06f              jal   zero,80000020      |   |
...                                                         |   |
80000088 <BYE>:                                             |   |
80000088:    0290006f              jal   zero,800008b0 <-.  |   |
...                                                      |  |   |
800000f8 <DUP>:                                          |  |   |
800000f8:    0004a283              lw    t0,0(s1)        |  |   | <-.
800000fc:    00448493              addi  s1,s1,4         |  |   |   |
80000100:    ffc48493              addi  s1,s1,-4        |  |   |   |
80000104:    0054a023              sw    t0,0(s1)        |  |   |   |
80000108:    ffc48493              addi  s1,s1,-4        |  |   |   |
8000010c:    0054a023              sw    t0,0(s1)        |  |   |   |
80000110:    f11ff06f              jal   zero,80000020   |  |   |   |
...                                                      |  |   |   |
8000087d:   00000000 0 <--.                              |  |   |   |
80000881:   00000003 3    |                              |  |   |   |
80000885:   65796200 bye  |                              |  |   |   |
80000889:   80000088 BYE -+------------------------------'  |   |   |
8000088d:   8000087d -----'  <-.                            |   |   |
80000891:   00000001 1         |                            |   |   |
80000895:   2b000000 +         |                            |   |   |
80000899:   8000002c PLUS------+----------------------------'   |   |
8000089d:   8000088d ----------'  <-.                           |   |
800008a1:   00000005 5              |                           |   |
800008a5:   74697277 writ           |                           |   |
800008a9:   80000070 WRITE----------+---------------------------'   |
800008ad:   8000089d ---------------'                               |
800008b1:   00000003 3                                              |
800008b5:   70756400 dup                                            |
800008b9:   800000f8 DUP--------------------------------------------'

This data structure where one entry points to another is called a linked list, it is incrediblu useful and powerful, just as the stack data structure is powerful. I wont spend much time on it, but its power is in allowing variable size entries to reference each other even if they are in different places in memory. You only need to know where the last element is and you can keep adding to the chain of entries, If you know where the first element is you can add from the head (thats how the first element is called) or from the tail (thats how we call the last element). You can also remove any element without having to copy anything, as you traverse it you just make the parent's link point to the link of the element you want to remove, and it just vanishes. Anyway there are also doubly linked lists and skip lists and so on, all have different powers. For now it is safe to think of it as a chain of things. In our case it is a chain of forth words.

A pseudo code for a FIND function looks something like this:

find(tok)
    entry = last entry
    while true:
        if entry == 0
            break

        compare entry's length with tok length
        if not equal
            entry = entry's link
            continue

        compare first 4 characters of entry and tok
        if not equal
            entry = entry's link
            continue

        both the length and first 4 characters are equal
        this is our token
        return entry's execution token address

    return not found

This pattern is very common while scanning a linked list, you start from the tail, and go backwards element by element. (or the head, depending if the link is backwards or forwards).

We will do one more change, it is very annoying to keep the token on the stack, because we have to keep rotating things to get it back on top so we can calculate the next address, but we dont know how much stack would the word use. So we will just move the token length and address in a global variables. And we will modify the PARSE_TOKEN function to read them and update them, plus we will add NEXT_TOKEN that moves the the address to address + length so we can read the next token next time PARSE_TOKEN is called.

This is the modified code, plus the FIND function, and the global variables, and the refactored interpreter.

  # ...

  # ( -- )
NEXT_TOKEN:
  la a0, cur_token_address
  lw t0, 0(a0)
  la t1, cur_token_len
  lw t1, 0(t1)
  add t0, t0, t1 # len + addr
  sw t0, 0(a0)
  j NEXT

  # ( -- len addr )
PARSE_TOKEN:
  # load the variables
  la a0, cur_token_address
  lw a0, 0(a0)
  la a1, cur_token_len
  lw a1, 0(a1)

  jal token

  PUSH a1 # length
  PUSH a0 # token address

  # store the new values
  la t0, cur_token_address
  sw a0, 0(t0)
  la t1, cur_token_len
  sw a1, 0(t1)

  j NEXT

  # Input:
  #   a0: token address
  #   a1: token length
  # Output:
  #   a0: execution token address (or 0 if not found)
do_find:
  li t1, 0
  mv t3, a1

  # The shananigans here are so we can build little endian version of the token
  # in 4 bytes dont be intimidated by them, I just made the tokens in the
  # dictionary as "bye\0" instead of "\0eyb" to be easier to read

  beqz t3, .L_not_found # zero length token
  lbu t1, 0(a0)
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 1(a0)
  sll t2, t2, 8
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 2(a0)
  sll t2, t2, 16
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 3(a0)
  sll t2, t2, 24
  or t1, t1, t2

  # t1: has the input token as 4 byte number
  # a1: is the length of the input token
  # t0: pointer to the entry, we will start at the end
.L_find_start:
  la t0, dictionary_end     # t0 = last dictionary entry

.L_find_loop:
  beqz t0, .L_not_found     # if the entry is 0, means we didnt find a match
  lw t2, 4(t0)              # load the length of the entry
  bne t2, a1, .L_next_entry # compare lengths
  lw t2, 8(t0)              # load entry name
  bne t2, t1, .L_next_entry # compare names
  lw a0, 12(t0)             # load the actual execution token
  ret                       # return the execution token

.L_next_entry:
  lw t0, 0(t0)              # follow link to next entry
  j .L_find_loop

.L_not_found:
  li a0, 0                  # return 0 for not found
  ret

  # ( len addr -- xt )
FIND_WORD:
  POP a0              # token address
  POP a1              # token length
  call FIND
  PUSH a0             # push execution token or 0
  j NEXT

  # ...

human_program:
  .asciz "842 31 + 721 + 3 + . bye"

cur_token_address:
  .word human_program
cur_token_len:
  .word 0

bytecode:
next_token:
  .word NEXT_TOKEN
  .word PARSE_TOKEN
  .word OVER
  .word LITERAL
  .word 0
  .word EQUAL
  .word BRANCH_ON_ZERO
  .word check_is_number
check_is_number:
  .word TWODUP
  .word IS_NUMBER
  .word LITERAL
  .word -1
  .word EQUAL
  .word BRANCH_ON_ZERO
  .word not_a_number
  .word ATOI
  .word JUMP
  .word next_token
not_a_number:
  .word FIND_WORD
  .word DUP              # we want a copy otherwise EQUAL will pop the word we need
  .word LITERAL
  .word 0
  .word EQUAL
  .word BRANCH_ON_ZERO
  .word forth_word_found # find word is not zero, meaning we found something
  .word BYE              # word not found, just exit
forth_word_found:
  .word LITERAL
  .word execute_placeholder  # we want to write the execution token there
  .word WRITE                # ( value addr -- )
                             # value is the execution token address (XT)
                             # returned from FIND_WORD and is on the stack
                             # address is execute_placeholder
execute_placeholder:
  .word 0                    # <-- magic! WRITE will write at this location, and then NEXT will jump to it
  .word JUMP
  .word next_token

dictionary:
word_bye:
    .word 0         # link
    .word 3         # token length
    .ascii "bye\0"  # first 4 characters of token
    .word BYE       # address of execution token

word_plus:
    .word word_bye
    .word 1
    .ascii "+\0\0\0"
    .word PLUS

word_write:
    .word word_plus
    .word 5
    .ascii "writ"
    .word WRITE

word_dup:
    .word word_write
    .word 3
    .ascii "dup\0"
    .word DUP

word_emit:
dictionary_end:
    .word word_dup
    .word 1
    .ascii ".\0\0\0"
    .word EMIT

FIND_WORD and FIND are cool, I think you will understand them on your own, but I am not sure you will appreciate the beauty of WRITE.

  .word LITERAL
  .word execute_placeholder
  .word WRITE
execute_placeholder:
  .word 0
  .word JUMP
  .word next_token

When assembled looks like this

...
800004ad:       8000004c LITERAL
800004b1:       800004b9 execute_placeholder
800004b5:       80000070 WRITE
800004b9:       00000000 [ execute placeholder value ]
800004bd:       80000330 JUMP
800004c1:       80000445 next_token
...

At this point the stack is the address of the execution token, for example PLUS 8000002c, so the stack is 842 31 8000002c, then we have LITERAL 800004b9, the stack becomes 842 31 8000002c 800004b9 and then LITERAL calls NEXT which executes WRITE. WRITE will write the value 8000002c at location 800004b9, so memory[800004b9] = 8000002c or memory[execute_placeholder] = PLUS, then it will call NEXT, and lo and behold, instead of 00000000, which was going to be executed have not been for our WRITE, we have 8000002c and we will execute PLUS.

The next token again will modify this location, and again it will be executed.

The program changes itself in order to execute itself. How cool is that.

There we have it, now we can trivially add words to our dictionary and expand our language. The only thing is missing is the power to easilly expand the dictionary form the program itself. We can kind of do it now with WRITE, but, it will be beyond painfull, and will require carefull planning and patience that I dont have.


Few things are needed to expand the dictionary, first now we know where it ends by using the dictionary_end label, which has to be dynamic, and we need some helper functions to make it easy to create new words. We also need 4 more bytes per dictionary entry for flags, as you will see some words will be different than others.

Imagine this program : square dup * ; 3 square . it will create the word square, when we jump into it it will execute dup and then multiplication. : is also a word of course and ; as well. but when we get to square, we should not try to find it, but we should create it, and then dup * should not be executed, but we have to store their bytecode into the square dictionary entry to be executed when square is invoked. We will just have a MODE variable that defines if we are in compilation mode (where we are creating a dictionary entry) or evaluation mode where we are exeucting the words. Some words however will have to be immidiately executed even in compilation mode, you will see later why, so we need flags per word to know if its immediate word or not.

After we compile the square word into the dictionary it could look something like this:

...

word_emit:
    .word word_dup
    .word 1
    .ascii ".\0\0\0"
    .word 0 # flag
    .word EMIT

word_square:
    .word word_emit
    .word 6
    .ascii "squa"
    .word 0 # flag
    .word DUP
    .word MULTIPLY

The issue with this structure is that once we get to execute the execution token of square, then we must have our :IP jump to there, as if we create a new thread, we we have to break out of the thread we were on, and go there to execute DUP and MULTIPLY and then somehow we have to go back.

word_square:
    .word word_emit
    .word 6
    .ascii "squa"
    .word 0
IP->.word DUP
    .word MULTIPLY

This is very similar issue to how we jal, we need to store where are we coming back. We will use another stack for that purpose, we will store :IP there before jumping, and then before returning we will jump back.

word_square:
    .word word_emit
    .word 6
    .ascii "squa"
    .word 0
    .word DOCOL <- push :IP in the return stack, and set it to our thread and call NEXT
    .word DUP
    .word MULTIPLY
    .word EXIT  <- pop :IP from the return stack and call NEXT

Remember how NEXT works, it first loads the value from memory[:IP] then it increments it, so when it jumps somewhere, we have the intended NEXT :IP before it jumped to us.

NEXT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  jr t0

Which means if I have a thread like this, when PLUS calls NEXT, the IP we will get is the address of SQUARE.

.word LITERAL
.word 8
.word LITERAL
.word 7
.word PLUS
.word SQUARE
.word EMIT
.word BYE

in DOCOL we will capture this value, add 4 to it, and push it in the return stack, because when we return from SQUARE we want to get to EMIT.

We will have two threads to weave. 8 7 + square . bye

.word LITERAL
.word 8
.word LITERAL
.word 7
.word PLUS
.word SQUARE >------.
                     \
                      `
                       .word DOCOL
                       .word DUP
                       .word MULTIPLY
                       .word EXIT
                       .
                      /
.word EMIT <---------'
.word BYE

And of course you can imagine SQUARE being more complicated, lets make gigantize which does : double dup + ; : gigantize double dup * ; 7 gigantize . bye, so gigantize will double the stack value and then square it.

Main Thread                 GIGANTIZE Thread             DOUBLE Thread
-------------              -----------------            --------------
LITERAL 7
GIGANTIZE ---------------> DOCOL
                           DOUBLE --------------------> DOCOL
                                                        DUP
                                                        PLUS
                                                        EXIT
                                 <---------------------'
                           DUP
                           MULTIPLY
                           EXIT
EMIT     <----------------'
BYE

Stack evolution:
7                         # After LITERAL 7
7                         # Enter GIGANTIZE
7 7                       # Enter DOUBLE, DUP
14                        # PLUS
14                        # Return to GIGANTIZE
14 14                     # DUP
196                       # MULTIPLY
196                       # Return to main, EMIT

The term thread is used in Forth to mean this idea of a silk thread of instructions. In modern programming the term thread means something else, and yet somehow similar, a thread of execution in modern programming is a lightweight process (runnign program) that shares memory with the main process allowing threads to communicate and execute instructions independently of each other. You can see it is quite different than the Forth thread, but you can also see how the weaving metaphor works spot on, and thats how they call it in the Forth magazines and books, so I will stick to that word. Also the term "forth word" has nothing to do with the assembly notation ".word" in our case .word just means declaration of a 4 byte value, a forth word now you see is a entry in the dictionary.

In the code so far in do_find we do lw a0, 12(t0) which will load the actual machine code address, which I loosely call execution token. It is where NEXT will jump to.

NEXT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  jr t0

Our words at the moment have actual machine code pointer at memory[:IP], e.g. If the address of DUP is 800000f8, then the value at memory[:IP] will be 800000f8 when we are about to execute DUP, it wont be the address of the word definition of DUP in the dictionary.

This is a major decision when making a Forth interpreter, do you point to the machine code or to the word. And in general the question of "how do you actually execute words from the dictionary". In our case we will point to the machine code.

There is a slight problem with the current explanation. Making our word gigantize use the word double would look like this:

word_double:
    .word word_square
    .word 6
    .ascii "doub"
    .word 0
    .word DOCOL
    .word DUP
    .word PLUYS
    .word EXIT
word_gigantize:
    .word word_double
    .word 9
    .ascii "giga"
    .word 0
    .word DOCOL
    .word DOCOL # this DOCOL is double's execution token
    .word DUP
    .word PLUS
    .word EXIT

When we execute DOCOL for gigantize it will properly store the :IP in the return stack, but, then how are we going to move :IP inside gigantize's thread? That is the first problem. We we have EXECUTE code that at the moment writes the execution token from FIND in memory and then NEXT jumps to it, so for "dup", FIND_WORD will put the machine code of DUP then NEXT will jump to it. So far our :IP has been within the interpreter thread, jumping up and down through the bytecode, once we were executing a word when we call NEXT from the DUP machine code it will move IP onw down to JUMP in the interpreter, and we go again.

        ...
        .word BRANCH_ON_ZERO
        .word forth_word_found
        .word BYE
      forth_word_found:
        .word LITERAL
        .word execute_placeholder
        .word WRITE
      execute_placeholder:
IP ->   .word 0
        .word JUMP
        .word next_token
        ...

The question is how do we make the IP jump within the word's thread?

word_gigantize:
      .word word_double
      .word 9
      .ascii "giga"
      .word 0
IP -> .word DOCOL
      .word DOCOL
      .word DUP
      .word PLUS
      .word EXIT

We do know at FIND's time the actual address of the thread, as we have found the word, so we just have to change find to not dereference it (dereference is just a fancy name of follow the pointer), if we replace this line lw a0, 12(t0) with addi a0, t0, 12 so that we return the pointer not the dereferenced value we can then put the address in a register (usually called W or XT), and then in DOCOL we can do IP = XT + 4 (since we want to jump over the DOCOL) to start executing the thread. This will work if your word does not call other words, as you can see in the gigantize example, we just have DOCOL and then DOCOL again, so we we will lose the XT value. This is a big annoying, it can be solved in many ways, I wont go into details, but we will solve it in the coolest way. At the time when we create word, we will create machine code instructions with the value of the current address and we will set XT to this value from there and then we will jump to DOCOL.

This is how a word will look in memory:

  # : square dup * ;
  #
  # ...
  #               DOCOL:
  # 80000534:      RPUSH :IP    <-----------------.
  # 80000538:                                     |
  # 8000053c:      mv :IP, :XT                    |
  # 80000540:      j NEXT                         |
  # ...                                           |
  # 80000148 <DUP>:                               |
  # 80000148:      lw t0, 0(:SP)                  |
  # 8000014c:      PUSH t0                        |
  # ...                                           |
  # 80000880: w_square:                           |
  # 80000880:      80000..# link                  |
  # 80000884:      6      # size                  |
  # 80000888:      "squa" # token                 |
  # 8000088c:      0      # flags                 |
  # 80000890:      80000894 # CODE FIELD >--------|---.
  # 80000894:      lui :XT, 0x80001  >---.        | <-'
  # 80000898:      addi :XT, :XT, 0x8a8 >--.      |
  # 8000089c:      lui t0, 0x80000 >---.   |      |
  # 800008a0:      addi t0, t0, 0x534 >----|------'
  # 800008a4:      jr t0                   |
  # 800008a8:      80000148 # DUP <--------'
  # 800008ac:      80000... # MUL
  # 800008b0:      80000... # EXIT
  # ...

One more change we need is to add flags field, which is going to be used to tell us if a word is going to be executed in compile mode or not. In forth : is the symbol for 'create a new subroutine word', it puts the interpreter into compiler mode. For example : square creates a word square that will be put in the dictionary, then all the words after are going to be compiled into square's thread, and when ; is seen the word is complete. And the interpreter changes back to interpret mode.

Generating machine code on the fly is called just in time compilation, we are not doing exactly what modern jit compilers do, and in the context of forth words it means something slightly different, but it is just as cool. To be able to put instructions by hand in memory and jump to them is the ultimate expression of the man machine interraction. There is nothing more beautiful than that.

GCC for example takes C code and generates machine code, this is called ahead of time compilation, compilers are more complicated than assemblers, they understand more about the higher level semantics of the program and can make executive decisions about the generated code, for example:

a = 5
b = 4
a = b

An optimizing compiler can see that a=5 is irrelevant, it wont even generate the machine code for li t0, 5; sw t0, 40(sp) (if a is on the function stack).

if (0) {
    a = 6
    b = 8
    c = a + b
}

It will know that this branch will never be taken, so no code will be generated. The assembler is much simpler than that, it tries to map one to one what you wrote into machine code. An interpreter is a program that evaluates your program. Different than compilers and assemblers, interpreters themselves are compiled to machine code and they will execute the program, in our case our interpreter has bytecode that goes through and finds tokens and so on. However now we will have a compile mode which can compile new bytecode, and even more we will have a on the fly machine code generation which will assemble machine code. So it is safe to say we have everything, an interpreter, a compiler and an assembler. We actually have two interpreters, inner one, the one that is j NEXT that jumps through the memory threads, and the outer one which in our case is written in the bytecode of the inner interpreter, and now we will have a compuler and also the ability to generate machine code, and of course forth bytecode, from inside our forth program and execute it.

All is the one.

First, lets see the code, boot.s and string.s are the same, I am putting the whole code here even the code for PLUS, EMIT, etc even though it didnt change, but I think its easier to read that way, it will give you some anchor to the things you are familiar with.

Take a deep breath, and just read it, it is code made by a human, to be read by other humans. It might seem frightening, some parts of it are easy, some make no sense and that's OK.

  .section .text
  .globl forth
  .globl NEXT

  .macro PUSH reg
  addi :SP, :SP, -4
  sw \reg, 0(:SP)
  .endm

  .macro POP reg
  lw \reg, 0(:SP)
  addi :SP, :SP, 4
  .endm

  .macro RPUSH reg
  addi :RSP, :RSP, -4
  sw \reg, 0(:RSP)
  .endm

  .macro RPOP reg
  lw \reg, 0(:RSP)
  addi :RSP, :RSP, 4
  .endm

forth:
  la :SP, FORTH_STACK_END
  la :RSP, RETURN_STACK_END
  mv :IP, zero
  mv :XT, zero
  la :HERE, dictionary_end
  la :LATEST, dictionary_end - 5*4
  li :MODE, 0

  la t1, human_program
  la t0, cur_token_address
  sw t1, 0(t0)
  la t0, cur_token_len
  sw zero, 0(t0)

  la :IP, interpreter_bytecode
  la :XT, interpreter_bytecode
  j NEXT

NEXT:
  lw   t0, 0(:IP)       # load the actual code address from [IP]
  addi :IP, :IP, 4      # move IP to next cell
  jr   t0               # jump

  # ( a b -- c )
PLUS:
  POP t0
  POP t1
  add t0, t0, t1
  PUSH t0
  j NEXT

  # ( a b -- c )
MUL:
  POP t0
  POP t1
  mul t0, t0, t1
  PUSH t0
  j NEXT


  # ( -- n )
LIT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  lw t1, 0(:IP)
  PUSH t0
  j NEXT

  # ( n -- )
EMIT:
  POP a0
  jal print_int
  j NEXT

  # ( value addr -- )
BANG:
  POP t0 # address
  POP t1 # value
  sw t1, 0(t0)
  j NEXT

  # ( -- )
BYE:
  j qemu_exit

  # ( -- )
CR:
  li a0, '\n'
  jal putc
  j NEXT

  # ( len addr -- n )
ATOI:
  POP a0 # address
  POP a1 # length
  jal atoi
  PUSH a0
  j NEXT

  # ( len addr -- f )
IS_NUMBER:
  POP a0 # address
  POP a1 # length
  jal is_number
  PUSH a0
  j NEXT

  # ( a -- a a )
DUP:
  POP t0
  PUSH t0
  PUSH t0
  j NEXT

  # ( a b -- b a )
SWAP:
  POP t0 # b
  POP t1 # a
  PUSH t0
  PUSH t1
  j NEXT

  # ( a -- )
DROP:
  POP zero
  j NEXT

  # ( a b -- )
TWODROP:
  POP zero
  POP zero
  j NEXT

  # ( a b -- a b a b )
TWODUP:
  POP t0 # b
  POP t1 # a
  PUSH t1 # a
  PUSH t0 # b
  PUSH t1 # a
  PUSH t0 # b
  j NEXT

  # ( n1 n2 -- n1 n2 n1 )
OVER:
  POP t0 # n2
  POP t1 # n1
  PUSH t1 # n1
  PUSH t0 # n2
  PUSH t1 # n1
  j NEXT


  # (x1 x2 x3 x4 -- x3 x4 x1 x2)
TWOSWAP:
  POP t0 # x4
  POP t1 # x3
  POP t2 # x2
  POP t3 # x1

  PUSH t1
  PUSH t0
  PUSH t3
  PUSH t2
  j NEXT

  # (x1 x2 x3 -- x2 x3 x1 )
ROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t1 # x2
  PUSH t0 # x3
  PUSH t2 # x1
  j NEXT

  # (x1 x2 x3 -- x3 x1 x2)
NROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t0 # x3
  PUSH t2 # x1
  PUSH t1 # x2
  j NEXT

  # ( a b -- f)
EQUAL:
  POP t0
  POP t1
  beq t0, t1, .L_equal
  li t0, 0
  PUSH t0
  j NEXT
.L_equal:
  li t0, -1
  PUSH t0
  j NEXT

  # ( len1 addr1 len2 addr2 -- flag)
MEMCMP:
  POP a2
  POP a3
  POP a0
  POP a1
  call memcmp
  PUSH a0
  j NEXT

  # ( f -- )
BRANCH_ON_ZERO:
  POP t0
  beqz t0, .L_do_branch
  addi :IP, :IP, 4
  j NEXT
.L_do_branch:
  lw :IP, 0(:IP)
  j NEXT

  # ( -- )
JUMP:
  lw :IP, 0(:IP)
  j NEXT

  # just a debug function to print the whole stack
  # print debugging.. some people hate it some people love it
  # I both hate it and love it
DEBUG_STACK:
  addi sp, sp, -12
  sw ra,  0(sp)
  sw s8,  4(sp)
  sw s9,  8(sp)

  li a0, '<'
  call putc
  li a0, '>'
  call putc
  li a0, ' '
  call putc

  mv s9, :SP
  add s9, s9, -4
  la s8, FORTH_STACK_END
  add s8, s8, -4
.L_debug_stack_loop:
  beq s8, s9, .L_debug_stack_loop_end
  lw a0, 0(s8)
  call print_unsigned_hex
  li a0, ' '
  call putc
  addi s8, s8, -4
  j .L_debug_stack_loop

.L_debug_stack_loop_end:
  li a0, '\n'
  call putc
  lw ra, 0(sp)
  lw s8, 4(sp)
  lw s9, 8(sp)
  addi sp, sp, 12
  j NEXT

do_next_token:
  la a0, cur_token_address
  lw t0, 0(a0)
  la t1, cur_token_len
  lw t1, 0(t1)
  add t0, t0, t1 # len + addr
  sw t0, 0(a0)
  ret

  # ( -- )
NEXT_TOKEN:
  jal do_next_token
  j NEXT


do_parse_token:
  addi sp, sp, -4
  sw ra,  0(sp)

  # load the variables
  la a0, cur_token_address
  lw a0, 0(a0)
  la a1, cur_token_len
  lw a1, 0(a1)

  jal token # parse the token

  # store the new values
  la t0, cur_token_address
  sw a0, 0(t0)
  la t1, cur_token_len
  sw a1, 0(t1)

  lw ra,  0(sp)
  addi sp, sp, 4
  # return a0 a1 from token
  ret

  # ( -- len addr )
PARSE_TOKEN:
  jal do_parse_token

  PUSH a1 # length
  PUSH a0 # token address

  j NEXT

  # Input:
  #   a0: token address
  #   a1: token length
  # Output:
  #   a0: execution token address (or 0 if not found)
do_find:
  li t1, 0
  mv t3, a1

  # The shananigans here are so we can build little endian version of the token
  # in 4 bytes dont be intimidated by them, I just made the tokens in the
  # dictionary as "bye\0" instead of "\0eyb" to be easier to read
  beqz t3, .L_not_found # zero length token
  lbu t1, 0(a0)
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 1(a0)
  sll t2,t2, 8
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 2(a0)
  sll t2, t2, 16
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 3(a0)
  sll t2, t2, 24
  or t1, t1, t2


  # t1: has the input token as 4 byte number
  # a1: is the length of the input token
  # t0: pointer to the entry, we will start at the end
.L_find_start:
  mv t0, :LATEST

.L_find_loop:
  beqz t0, .L_not_found     # if the entry is 0, means we didnt find a match
  lw t2, 4(t0)              # load the length of the entry
  bne t2, a1, .L_next_entry # compare lengths
  lw t2, 8(t0)              # load entry name
  bne t2, t1, .L_next_entry # compare names
  add a0, t0, 16            # return the code address
  ret

.L_next_entry:
  lw t0, 0(t0)              # follow link to next entry
  j .L_find_loop

.L_not_found:
  li a0, 0                  # return 0 for not found
  ret

  # ( len addr -- xt )
FIND_WORD:
  POP a0              # token address
  POP a1              # token length
  jal do_find
  PUSH a0
  j NEXT

DOCOL:
  RPUSH :IP
  mv :IP, :XT
  j NEXT

EXIT:
  RPOP :IP
  j NEXT

COLON:
  li :MODE, -1 # enter compile mode

  jal do_create

  # we want to achieve this, creating a new word
  #
  # : square dup * ;
  #
  # ...
  #               DOCOL:
  # 80000534:      RPUSH :IP    <-----------------.
  # 80000538:                                     |
  # 8000053c:      mv :IP, :XT                    |
  # 80000540:      j NEXT                         |
  # ...                                           |
  # 80000148 <DUP>:                               |
  # 80000148:      lw t0, 0(:SP)                  |
  # 8000014c:      PUSH t0                        |
  # ...                                           |
  # 80000880: w_square:                           |
  # 80000880:      80000..# link                  |
  # 80000884:      6      # size                  |
  # 80000888:      "squa" # token                 |
  # 8000088c:      0      # flags                 |
  # 80000890:      80000894 # CODE FIELD >--------|---.
  # 80000894:      lui :XT, 0x80001  >---.        | <-'
  # 80000898:      addi :XT, :XT, 0x8a8 >--.      |
  # 8000089c:      lui t0, 0x80000 >---.   |      |
  # 800008a0:      addi t0, t0, 0x534 >----|------'
  # 800008a4:      jr t0                   |
  # 800008a8:      80000148 # DUP <--------'
  # 800008ac:      80000... # MUL
  # 800008b0:      80000... # EXIT
  # ...

  # 1. EXECUTION CODE FIELD point to HERE + 4, where we will
  # put the machine code: memory[HERE] = HERE+4
  mv t0, :HERE
  add t0, t0, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4


  # 2. Generate absolute address for where we want DOCOL to jump, in our case we want HERE+20
  mv t0, :HERE
  addi t0, t0, 20

  # 3. Generate the machine code
  # li :XT, value of :HERE + 20
  # la t0, DOCOL
  # jr t0

  # and expanded
  # lui :XT, value << 12
  # addi :XT, :XT, value << 20 >> 20
  # lui t0, value << 12
  # addi t0, t0, value << 20 >> 20
  # jr t0


  # 3.1 Generate machine code for XT = HERE + 20 at time of compilation
  li a0, 21 # XT is s5, which is register x21
  mv a1, t0
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4


  # 3.1 Generate machine code for la t0, DOCOL
  li a0, 5 # t0 is x5
  la a1, DOCOL
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4


  # 3.2 Generate machine code for jr t0
  li a0, 5 # t0 is x5
  jal do_jr
  sw a0, 0(:HERE) # jr
  addi :HERE, :HERE, 4

  j NEXT

  # ( -- )
SEMICOLON:
  mv :MODE, zero # exit compile mode
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT

  # ( x -- )
COMMA:
  POP t0
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT

  # ( -- flag )
MODE:
  PUSH :MODE
  j NEXT

do_create:
  addi sp, sp, -4
  sw ra, 0(sp)

  jal do_next_token
  jal do_parse_token
  beqz a1, .L_create_error

  # link field (4 bytes)
  sw :LATEST, 0(:HERE)

  # length field (4 bytes)
  sw a1, 4(:HERE)

  # token field (4 bytes)
  li t1, 0
  mv t3, a1

.L_create_build_token:
  lbu t1, 0(a0)
  addi t3, t3, -1

  beqz t3, .L_create_write_token
  lbu t2, 1(a0)
  sll t2, t2, 8
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_create_write_token
  lbu t2, 2(a0)
  sll t2, t2, 16
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_create_write_token
  lbu t2, 3(a0)
  sll t2, t2, 24
  or t1, t1, t2

.L_create_write_token:
  sw t1, 8(:HERE)

  # flags field
  sw zero, 12(:HERE)

  # move the dictionary end
  mv :LATEST, :HERE

  # update HERE to point to the end of the word
  addi :HERE, :HERE, 16

  lw ra, 0(sp)
  addi sp, sp, 4
  ret

.L_create_error:
  la a0, err_create_error
  j panic

panic:
  jal puts
  jal getch
  j qemu_exit

  # ( xt -- f )
SHOULD_COMPILE_WORD:
  POP t0

  beqz :MODE, .L_dont_compile
  # if we are in compile mode, check the flag

  lw t0, -4(t0) # flag value
  bnez t0, .L_dont_compile # flag is immediate, execute it

  li t1, -1
  PUSH t1

  j NEXT
.L_dont_compile:
  PUSH zero
  j NEXT

  # ( addr -- value )
AT:
  POP t0
  lw t0, 0(t0)
  PUSH t0
  j NEXT

  # ( xt -- )
EXECUTE:
  POP t0          # xt
  lw t0, 0(t0)    # load code pointer
  jr t0

  # ( -- c )
KEY:
  jal getch
  PUSH a0
  j NEXT


  # ( -- addr )
PUSH_HERE:
  PUSH :HERE
  j NEXT

  # Li ( a0: reg, a1: imm  -- a0: opcode_lui a1: opcode_addi )
do_li:
  # Extract upper immediate
  # compensating for sign extension if needed
  srli t0, a1, 12        # First get upper 20 bits
  li t3, 0x800
  and t1, a1, t3         # Check bit 11
  beqz t1, no_adjust
  addi t0, t0, 1         # Adjust for sign extension

no_adjust:
  # LUI
  #
  # bits [31:12] = immediate
  # bits [11:7]  = rd
  # bits [6:0]   = 0x37 (opcode)
  #
  li a2, 0x37            # LUI opcode
  slli t2, t0, 12        # upper immediate
  or a2, a2, t2
  slli t2, a0, 7         # rd
  or a2, a2, t2

  # ADDI
  #
  # bits [31:20] = immediate
  # bits [19:15] = rs1
  # bits [14:12] = 0 (funct3)
  # bits [11:7]  = rd
  # bits [6:0]   = 0x13 (opcode)
  #

  li a3, 0x13            # ADDI opcode
  li t1, 0xfff
  and t0, a1, t1         # lower 12 bits
  slli t2, t0, 20        # immediate
  or a3, a3, t2
  slli t2, a0, 15        # rs1
  or a3, a3, t2
  slli t2, a0, 7         # rd
  or a3, a3, t2

  mv a0, a2
  mv a1, a3
  ret

  # ( reg imm -- lui addi )
LI:
  POP a1   # imm
  POP a0   # reg

  call do_li

  PUSH a0 # lui
  PUSH a1 # addi

  j NEXT

  # JR ( a0: reg -- a0: opcode_jr )
do_jr:
  mv t0, a0

  # bits [31:20] = 0 for imm=0
  # bits [19:15] = reg
  # bits [14:12] = 0 (funct3=0)
  # bits [11:7]  = x0 => 0
  # bits [6:0]   = 0x67 (opcode for JALR)
  #
  # So the entire instruction is:
  #   (reg << 15) | 0x67

  slli t1, t0, 15      # reg << 15
  li   t2, 0x67        # opcode JALR
  or   t1, t1, t2      # final 32-bit instruction

  mv a0, t1

  ret

  # JR ( reg -- opcode_jr )
JR:
  POP a0
  call do_jr
  PUSH a0
  j NEXT

dictionary:
word_bye:
  .word 0         # link
  .word 3         # token length
  .ascii "bye\0"  # first 4 characters of token
  .word 0         # flags
  .word BYE       # address of execution token

word_plus:
  .word word_bye
  .word 1
  .ascii "+\0\0\0"
  .word 0
  .word PLUS

word_mul:
  .word word_plus
  .word 1
  .ascii "*\0\0\0"
  .word 0
  .word MUL

word_bang:
  .word word_mul
  .word 1
  .ascii "!\0\0\0"
  .word 0
  .word BANG

word_at:
  .word word_bang
  .word 1
  .ascii "@\0\0\0"
  .word 0
  .word AT

word_dup:
  .word word_at
  .word 3
  .ascii "dup\0"
  .word 0
  .word DUP

word_emit:
  .word word_dup
  .word 1
  .ascii ".\0\0\0"
  .word 0
  .word EMIT

word_cr:
  .word word_emit
  .word 2
  .ascii "cr\0\0"
  .word 0
  .word CR

word_debug_stack:
  .word word_cr
  .word 2
  .ascii ".s\0\0"
  .word 0
  .word DEBUG_STACK

word_colon:
  .word word_debug_stack
  .word 1
  .ascii ":\0\0\0"
  .word 0
  .word COLON

word_semicolon:
  .word word_colon
  .word 1
  .ascii ";\0\0\0"
  .word 1 # immediate
  .word SEMICOLON

word_li:
  .word word_semicolon
  .word 2
  .ascii "li\0\0"
  .word 0
  .word LI

word_jr:
  .word word_li
  .word 2
  .ascii "jr\0\0"
  .word 0
  .word JR

word_key:
  .word word_jr
  .word 3
  .ascii "key\0"
  .word 0
  .word KEY


word_here:
  .word word_key
  .word 4
  .ascii "here"
  .word 1
  .word PUSH_HERE

word_comma:
  .word word_here
  .word 1
  .ascii ",\0\0\0"
  .word 1
  .word COMMA

dictionary_end:
  # forth stack
  .space 2048
FORTH_STACK_END:
  # forth return stack
  .space 2048
RETURN_STACK_END:

  # token variables
cur_token_address:
  .word 0
cur_token_len:
  .word 0

  # the outer interpreter
interpreter_bytecode:
next_token:
  .word NEXT_TOKEN
  .word PARSE_TOKEN
  .word OVER
  .word BRANCH_ON_ZERO
  .word exit
check_is_number:
  .word TWODUP
  .word IS_NUMBER
  .word BRANCH_ON_ZERO
  .word not_a_number
  .word ATOI         # the number is on the stack
  .word MODE
  .word BRANCH_ON_ZERO
  .word next_token # we are in eval mode
  .word LIT
  .word LIT
  .word COMMA
  .word COMMA
  .word JUMP
  .word next_token
not_a_number:
  .word FIND_WORD
  .word DUP
  .word BRANCH_ON_ZERO
  .word exit # word not found, just exit for now
forth_word_found:
  .word DUP
  .word SHOULD_COMPILE_WORD
  .word BRANCH_ON_ZERO
  .word execute_word # we are in eval mode, execute the word
  .word AT           # we are in compile mode, dereference the execution token
  .word COMMA        # write the code address in the thread
  .word JUMP
  .word next_token
execute_word:
  .word EXECUTE
  .word JUMP
  .word next_token
exit:
  .word BYE
  # error messages
err_create_error:
  .asciz "\nerror: create missing name, usage: create [name]\n"

  # our actual human readable program
human_program:
  .asciz "
: plus3 3 + ; 2 plus3 + . cr
: square dup + ;
: double dup * ;
: gigantize square double ;
3 gigantize . cr
bye
  "
  .end

There are few important things, oen I renamed WRITE to BANG which is a synonym for !, thats how Forth calls it, and I added AT which is just reading a value from memory and pushes it to the stack. I changed FIND to return the exeuction token instead of dereferencing it, and added few helper functtions. I added the flag field in the dictionary entries, and I added 5 new registers, :RSP (s2). :HERE (s4), :XT (s5), :MODE (s6), and :LATEST (s7), will explain them in a bit. Renamed LITERAL to LIT, we wiull add LITERAL as a different kind of word that will use LIT, and added new macros RPUSH and RPOP that push and pop values from the Forth Return Stack. Changed do_find to use :LATEST as the end of the dictionary, since we will modify it, we need to know where it ends in order to add to it and to search it.

This is how the interpreter bytecode looks in memory.

address         value                        label
--------------------------------------------------------
80001988 <next_token>:
80001988:       800003cc               # NEXT_TOKEN
8000198c:       8000041c               # PARSE_TOKEN
80001990:       800001d4               # OVER
80001994:       80000310               # BRANCH_ON_ZERO
80001998:       80001a44               # exit

8000199c <check_is_number>:
8000199c:       800001a0               # TWODUP
800019a0:       80000120               # IS_NUMBER
800019a4:       80000310               # BRANCH_ON_ZERO
800019a8:       80001a04               # not_a_number
800019ac:       80000100               # ATOI
800019b0:       80000590               # MODE
800019b4:       80000310               # BRANCH_ON_ZERO
800019b8:       80001988               # next_token
800019bc:       800000b0               # LIT
800019c0:       800000b0               # LIT
800019c4:       8000057c               # COMMA
800019c8:       8000057c               # COMMA
800019cc:       8000032c               # JUMP
800019d0:       80001988               # next_token

80001a04 <not_a_number>:
80001a04:       800004b0               # FIND_WORD
80001a08:       80000140               # DUP
80001a0c:       80000334               # BRANCH_ON_ZERO
80001a10:       80001a44               # exit

80001a18 <forth_word_found>:
80001a18:       80000140               # DUP
80001a1c:       80000634               # SHOULD_COMPILE_WORD
80001a20:       80000310               # BRANCH_ON_ZERO
80001a24:       80001a38               # execute_word
80001a28:       80000678               # AT
80001a2c:       8000057c               # COMMA
80001a30:       8000032c               # JUMP
80001a34:       80001988               # next_token

80001a38 <execute_word>:
80001a38:       80000690               # EXECUTE
80001a3c:       8000032c               # JUMP
80001a40:       80001988               # next_token

80001a44 <exit>:
80001a44:       800000f0               # BYE

Lets start by looking at what the bytecode does, step by step.

next_token:
  .word NEXT_TOKEN
  .word PARSE_TOKEN
  .word OVER
  .word BRANCH_ON_ZERO
  .word exit

NEXT_TOKEN moves just addst the current token's length to the current token's pointer, hence it advances the token pointer by len. PARSE_TOKEN will then skip whitespaces until it finds the next token and it will push its length and address on the stack, OVER will copy the length and push it on top, and then BRANCH_ON_ZERO will not branch to BYE if the length is zero, if not we continue. You can see i removed a bunch of EQUAL and LITERAL 0 and so on, they of course are not needed, I was using them in order to exercise your ability to step through the bytecode and think what it is doing.

check_is_number:
  .word TWODUP
  .word IS_NUMBER
  .word BRANCH_ON_ZERO
  .word not_a_number
  .word ATOI         # the number is on the stack
  .word MODE
  .word BRANCH_ON_ZERO
  .word next_token # we are in eval mode
  .word LIT
  .word LIT
  .word COMMA
  .word COMMA
  .word JUMP
  .word next_token

When we enter check_is_number the stack is again length, address as BRANCH_ON_ZERO popped OVER's copy of the length. TWODUP will copy the top 2 elements of the stack, so it will become length, address, length, address, we need to duplicate them so we can give a copy to IS_NUMBER which will consume top 2 elements and return a flag if the string is a number or not, then BRANCH_ON_ZERO will jump to not_a_number if IS_NUMBER returned 0, otherwise we continue, at the point of ATOI the stack is again length, address, and ATOI will take those and convert the string of ASCII symbols into a single 4 byte integer and will push it to the stack, after that we will use the MODE keyword which just pushes the value of the :MODE register to the stack, in our case thats s7, and we use it to keep track if we are in compile mode or in interpreter mode. If we are in interpreter mode at this point we are good to go, and just jump to the next token, as the number is properly on the stack, if not we must compile the vauke of the number into the dictionary of the word we are creating. For example lets say we have this definition : plus3 3 + ; we could use it like this 2 plus3 + . which will push 2 to the stack then we jump to plus3 which pushes 3 on the stack, calls plus which pushes the result on the stack and then we print the number top of the stack, which will be 5. The plus3 word should inside of it have code that pushes the number 3 on the stack, so it should have .word LIT .word 3 .word PLUS inside its thread. To do that we must write the address of LIT and the value from the stack into the thread..

The stack at this point is just the number, e.g. 3, LIT LIT will push the address of LIT on top, so it will become 3 800000b8 (3 is the number, and 800000b8 is the address of LIT), COMMA takes the top of the stack and writes it wherever :HERE is, and increments :HERE += 4, :HERE is a register (s4) where we keep the value of where we are writing now in the dictionary.

The first COMMA call will take the top of the stack, which now is the address of LIT and write it wherever HERE is pointing, and set HERE += 4 , then the second COMMA will write the number 3 at HERE (which is now HERE+4).

This writes exactly what we want in the dictionary entry .word LIT .word 3.

Then it will just jump to next_token.


not_a_number:
  .word FIND_WORD
  .word DUP
  .word BRANCH_ON_ZERO
  .word exit # word not found, just exit for now

Reaching not_a_number we have the length, address, we are coming here all the way up from the IS_NUMBER BRANCH_ON_ZERO check, we call FIND_WORD which is going to return the execution token, or 0 if not found, we dup it so we can check if its zero, meaning word is not found, BRANCH_ON_ZERO will jump to BYE Otherwise we continue to process the word.

forth_word_found:
  .word DUP
  .word SHOULD_COMPILE_WORD
  .word BRANCH_ON_ZERO
  .word execute_word # we are in eval mode, execute the word
  .word AT           # we are in compile mode, dereference the execution token
  .word COMMA        # write the code address in the thread
  .word JUMP
  .word next_token

Again we copy the execution token, and check if we should compile it or not, if SHOULD_COMPILE_WORD returns zero means we should execute it, SHOULD_COMPILE_WORD will return 0 either if we are in compile mode, meaning we are writing bytecode into a thread, or the word is marked as immediate in which case it will be executed in immediate mode. If the word is supposed to be compiled, we continue to AT, COMMA, The top of the stack is still the execution token, AT reads the value at specific address, which means it will read the value at the address of the execution token and dereference it, as we will get the pointer to the actual machine code, then COMMA will write it at HERE.

execute_word:
  .word EXECUTE
  .word JUMP
  .word next_token

At this point again we have the execution token at the top of the stack, we will jump to EXECUTE which will dereference it and jump to the machine code. At this point IP is pointing to JUMP so when the word is executed we will come back to our JUMP and go to the start again.

exit:
  .word BYE

This is quite self explanatory, just exit qemu by NEXT jumping to BYE.

You can guess by now, maybe SHOULD_COMPILE_WORD was a big enough hint, that this whole interpreter can be also written in few lines of assembly, there is zero reason to write it in the inner interpreter's bytecode, but I thought this way is more fun, and I think we should have more fun with computers. Make them do things, the more bizzare the better. Language that writes itself in itself while overwriting itself with machine code of the machine that is running, whats better than that.

Stepping through this forth program : square dup * ; 5 square . cr bye. First : will set the interpreter in compile mode then create the word square in the dictionary, inside with the thread of dup * then ; as it is immediate word, will be executed in compile mode and will set the interpreter back in evaluation mode, then 5 will push 5 to the stack and square will execute the word square, it will move IP to its thread and execute dup and then +, dup will dup the top of the stack which is 5, so now the stack will be 5 5, then * will multiply the top 2 elements and push the result, after that we will exit from square and go to the main thread, . will print the top of the stack which is noe 25, cr will print a new line and bye will finally exit.

I want to talk speicfically in how we move IP from the main thread into the word's thread, as I think its really cool, and for that we will have to dig into :.

Imagine we are the 'square' word. Empathize with it, think as if you are it, and the other words will interract with you and from somewhere, you dont know where, they will jump into your execution address.

Reminder of how the of the dictionary word looks like:

  LINK : points to the previous word
LENGHT : the token length, e.g `begin` is 5 letters
 TOKEN : first 4 characters of the token, begi in case of begin
 FLAGS : is the word going to be executed in compile time or not
  EXEC : where to jump to when the word is executed

Lets look at the dictionary around DUP


800001ac <DUP>:
800001ac:       0004a283 <-----.  lw      t0,0(s1)
800001b0:       00448493       |  addi    s1,s1,4
800001b4:       ffc48493       |  addi    s1,s1,-4
800001b8:       0054a023       |  sw      t0,0(s1)
800001bc:       ffc48493       |  addi    s1,s1,-4
800001c0:       0054a023       |  sw      t0,0(s1)
800001c4:       eedff06f       |  jal     zero,800000b0 <NEXT>
                               |
...                            |
                               |
                               |
80000aa8 <word_at>:            |
80000aa8:       80000a94 <---. |
80000aac:       00000001     | |
80000ab0:       00000040     | |
80000ab4:       00000000     | |
80000ab8:       800006dc     | |
                             | |
80000abc <word_dup>:         | |
80000abc:       80000aa8 ----' | points to previous word at 80000aa8
80000ac0:       00000003       | lenght 3
80000ac4:       00707564       | the ascii for d u p
80000ac8:       00000000       | flags are 0
80000acc:       800001ac ------' address of the DUP function

80000ad0 <word_emit>:       |
80000ad0:       80000abc ---' points to previous word at 80000abc
80000ad4:       00000001
80000ad8:       0000002e
80000adc:       00000000
80000ae0:       80000134

...

when we have the code 3 dup our interpreter will first push 3 on the stack, then it will find the word rup, and we will call EXECUTE, which will load the value at the code field and jump to it, in our case at address 80000acc, and the value is 800001ac, and so it will jump to 800001ac, where we have the machine code for DUP, we will execute the machine code which pops the value from the stack and pushes it twoce and it will then jump to NEXT.

Now thats all OK because NEXT will jump to the value of the :IP and then do :IP + 4, and our :IP is in the interpreter thread, so all good, NEXT will jump back to the interpreter. For user defined words however we need to make IP inside the word's thread. As we discussed we will create a tiny bit of machine code at the time we are creating the word that stores the location of its thread inside the machine code, so later when we jump to it the machine code has the correct value.

OK time to imagine you are the word 'square', someone jumps to you, first you want to jump to the machine code you have prepared, lets say you are at address 80000880 in the dictionary, and your thread starts at 800008a8 you want to do this

li :XT, 0x800008a8
la t0, DOCOL
jr t0

DOCOL will push the current :IP, which in our case will be somewhere in the interpreter's thread, to the return stack. and then move the :IP thread to :XT which our tiny machine code would've set to 800008a8.

DOCOL:
  RPUSH :IP
  mv :IP, :XT
  j NEXT

As our word is being compiled inside COLON: we know exactly where are writing in memory, we keep moving the :HERE register to the right location. You know when square is to be executed someone will jump to your code field's value, your execution token, so we will use that, we will write our machine code just below it, and make it point to our machine code. Then inside the machine code, as we know exactly how many instructions we need for it, we will put :XT to just after the machine code itself, then DOCOL will do the rest and jump after. We could ofcourse write the machine code for DOCOL itself, but this way seemed more fun for me.

li and la are pseudo instructions, both are broken into lui and addi; lui loads the upper 20 bits of the value, and addi the lower 12 bits. So our machine code is exactly 5 instructions, or 20 bytes.

This is what we want for : square dup * ;

              DOCOL:
80000534:      RPUSH :IP    <-----------------.
80000538:                                     |
8000053c:      mv :IP, :XT                    |
80000540:      j NEXT                         |
...                                           |
80000148 <DUP>:                               |
80000148:      lw t0, 0(:SP)                  |
8000014c:      PUSH t0                        |
...                                           |
80000880: w_square:                           |
80000880:      80000..# link                  |
80000884:      6      # size                  |
80000888:      "squa" # token                 |
8000088c:      0      # flags                 |
80000890:      80000894 # CODE FIELD >--------|---.
80000894:      lui :XT, 0x80001  >---.        | <-'
80000898:      addi :XT, :XT, 0x8a8 >--.      |
8000089c:      lui t0, 0x80000 >---.   |      |
800008a0:      addi t0, t0, 0x534 >----|------'
800008a4:      jr t0                   |
800008a8:      80000148 # DUP <--------'
800008ac:      80000... # MUL
800008b0:      80000... # EXIT

Thats a lot of arrows, but I hope you get the idea, our execution token is just below our codefield, if our code field is at 80000890 then the exeuction token will be 80000894, so when someone finds our word in the dictionary, they will dereference is as in they will load the avalue at address 80000890 and jump to the value, which will be 80000894, and thats where our machine code lives, then the machine code in the end will jump to DOCOL which will make NEXT jump to our actual thread, in our case DUP and MUL.

Then we have EXIT which will pop :IP from the return stack and call NEXT to go back wherever we were called from.

Now lets discuss now would we make lui, addi and jr as machine code. Imagine we want to write the instruction li :XT, 0x80000534. for us :XT is s5, s5 is register 21.

| x0     | zero  |
| x1     | ra    |
| x2     | sp    |
| x3     | gp    |
| x4     | tp    |
| x5     | t0    |
| x6     | t1    |
| x7     | t2    |
| x8     | s0/fp |
| x9     | s1    |
| x10    | a0    |
| x11    | a1    |
| x12    | a2    |
| x13    | a3    |
| x14    | a4    |
| x15    | a5    |
| x16    | a6    |
| x17    | a7    |
| x18    | s2    |
| x19    | s3    |
| x20    | s4    |
| x21    | s5    |
| x22    | s6    |
| x23    | s7    |
| x24    | s8    |
| x25    | s9    |
| x26    | s10   |
| x27    | s11   |
| x28    | t3    |
| x29    | t4    |
| x30    | t5    |
| x31    | t6    |

This li is going to be split into two instructions, liu x21, 0x80000 and addi x21, x21, 0x534, if you take the number 0x80000, 10000000000000000000 in binary, and shift it to the left 12 bits, it becomes 10000000000000000000000000000000, or 2147483648 in decimal or 0x80000000 in hex, and when you add 0x534 to it, or 10100110100 in binary, or 1332 in decimal, you 0x80000000 + 0x534 = 0x80000534, which is what we wanted to do.

In 32 bit RISCV there is no one instruction which can move 32 bits to a register, and you might have guessed why, the instructions themselves are 32 bits, and they have parameters, as in we need few bits to know which register destination we will use, and what is the instruction itself so we can execute the right sequence of micro instructions on the wires, enable this on the bus, disable that on the bus..

The machine code for liu x21, 0x80000 is 80000ab7. 10000000000000000000101010110111 in binary.

For addi addi x21, x21, 0x534 is 534a8a93, 01010011010010101000101010010011 in binary.

addi x21, x21, 0x534 means x21 = x21 + 0x534 and lui before that put 0x80000000 into x21 so we get 0x80000000 + 0x534.

You see addi has two registers as paramters, rd and rs1, the format is addi rd, rs1, 12 bit value, in out case both rd and rs are the same, 21, or in binary 10101, you can see those in the machine code. The left most 12 bits of the instructions are the actual value we will add to rs1 and the result will be stored in rd. You can see 000 those are also part of the instruction, actually 0010011 just means integer instruction, 000 means addi, 111 means andi 110 ori and etc, its just different kind of integer operations, if you remember 74LS181 how you can control what exact operation it does with S 0 1 2 3, so I think thats why they decided to put integer instructions closer together, so you can decode the fact that its integer operation and then route the operation kind to the wires.

OK now, we have to come up with a function that when give the parameter 21 and 0x80000534, it produces the numbers 80000ab7 and 534a8a93.

The recipe is quite straight forward, but there is a slight complication with the sign extention.

This is the snippet of the code, with ridiculous amount of comments.

# Input:
#   a0 = destination register number (e.g., 21 for x21/:XT)
#   a1 = immediate value we want to load (e.g., 0x80000534)
# Output:
#   a0 = LUI instruction machine code
#   a1 = ADDI instruction machine code
do_li:
  # For example, for li x21, 0x80000534:
  # 0x80000534 = 1000 0000 0000 0000 0000 0101 0011 0100

  # First, handle the upper bits for LUI
  srli t0, a1, 12        # Shift right by 12 to get upper 20 bits
                         # 0x80000534 >> 12 = 0x80000
                         #                    1000 0000 0000 0000 0000
  # Check if we need to adjust for sign extension
  # This is needed because ADDI sign-extends its 12-bit immediate
  li t3, 0x800          # 0x800 = 1000 0000 0000
  and t1, a1, t3        # Check bit 11 of original value
                        # If bit 11 is 1, ADDI will sign-extend negatively
                        # So we need to add 1 to upper bits to compensate

  beqz t1, no_adjust    # If bit 11 is 0, no adjustment needed
  addi t0, t0, 1        # Add 1 to upper bits to compensate for sign extension

no_adjust:
  # Build LUI instruction: lui rd, imm
  # Format: [imm[31:12]] [rd] [0110111]
  #         [20 bits   ] [5 ] [7 bits ]
  li a2, 0x37           # 0x37 = 0110111 = LUI opcode
  slli t2, t0, 12       # Shift immediate to bits 31:12
  or a2, a2, t2         # Combine with opcode
  slli t2, a0, 7        # Shift rd (dest reg) to bits 11:7
  or a2, a2, t2         # Combine with prev result

  # Example for x21, 0x80000534:
  # LUI x21, 0x80000 becomes:
  # imm=10000000000000000000 rd=10101 opcode=0110111
  # = 1000 0000 0000 0000 0000 1010 1011 0111 = 0x80000ab7

  # Build ADDI instruction: addi rd, rs1, imm
  # Format: [imm[11:0]] [rs1] [000] [rd] [0010011]
  #         [12 bits  ] [5  ] [3  ] [5 ] [7 bits ]
  li a3, 0x13           # 0x13 = 0010011 = ADDI opcode
  li t1, 0xfff          # Mask for lower 12 bits
  and t0, a1, t1        # Get lower 12 bits of immediate
  slli t2, t0, 20       # Shift immediate to bits 31:20
  or a3, a3, t2         # Combine with opcode
  slli t2, a0, 15       # Shift rs1 (source reg) to bits 19:15
  or a3, a3, t2         # Combine with prev result
  slli t2, a0, 7        # Shift rd (dest reg) to bits 11:7
  or a3, a3, t2         # Combine with prev result

  # Example for x21, 0x80000534:
  # ADDI x21, x21, 0x534 becomes:
  # imm=010100110100 rs1=10101 f3=000 rd=10101 opcode=0010011
  # = 0101 0011 0100 1010 1000 1010 1001 0011 = 0x534a8a93

  mv a0, a2             # Return LUI instruction in a0
  mv a1, a3             # Return ADDI instruction in a1

We call the function like so:

    li a0, 21
    li a1, 0x80000534
    jal do_li

    # a0 contains LUI
    # a1 contains ADDI

We will use this function to do both li :XT, HERE + 20 and la t0, DOCOL

COLON first creates the base of the word, where we have its link, length, token, and flags. :HERE is a register that points to the last value we added to the dictionary, we keep moving it as we add more and more values.

COLON:
  ...
  # word is created, HERE points to just after the flags

  mv t0, :HERE
  add t0, t0, 4 # t0 = HERE + 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4

Then we store here + 4 at memory[here] and increment here +4 for the next write.

Compute HERE + 20, that is the address that we want to put in :XT so DOCOL moves :IP to it, and then generate the lui and addi machine code for it.

mv t0, :HERE
addi t0, t0, 20 # t0 = HERE + 20

# 3.1 Generate machine code for XT = HERE + 20 at time of compilation
li a0, 21 # XT is s5, which is register x21
mv a1, t0
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4

After that we do the same but we want to put DOCOL's address in t0

li a0, 5 # t0 is x5
la a1, DOCOL
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4

Now we have written li :XT, HERE+20 and la t0, DOCOL, next we want to write the machine code for jalr zero, 0(t0)

jalr zero, 0(t0) is 0x28067 or 00000000000000101000000001100111.

For our purposes this is actually a constant value, as none of the parameters change, t0 is always the 5th register or 00101, zero or x0 is always 00000, and the offset is alwas 000000000000, we dont have to recompute it, it will always be 0x28067, but we will do it anyway.

# Input:
#   a0 = register number to jump to (e.g., 5 for t0)
# Output:
#   a0 = JALR instruction machine code to jump to that register
do_jr:
  mv t0, a0            # Save register number in t0

  # We want to generate: jalr x0, reg, 0
  # This means: jump to address in 'reg', don't save return address
  #
  # JALR instruction format:
  # [imm[11:0]] [rs1] [000] [rd] [1100111]
  # [12 bits  ] [5  ] [3  ] [5 ] [7 bits ]
  #
  # For jr, we want:
  # - imm = 0 (no offset to add to jump address)
  # - rs1 = input register (where to jump to)
  # - funct3 = 000 (JALR variant)
  # - rd = x0 (don't save return address)
  # - opcode = 1100111 (0x67) (JALR opcode)
  #
  # Example for jr t0 (x5):
  # imm=000000000000 rs1=00101 000 rd=00000 1100111
  # = 0000 0000 0000 0010 1000 0000 0110 0111
  # = 0x00028067

  slli t1, t0, 15      # Shift register number to rs1 position (bits 19:15)
                       # e.g., 5 << 15 = 0x00028000

  li   t2, 0x67        # Load JALR opcode (0x67 = 1100111)

  or   t1, t1, t2      # Combine register bits with opcode
                       # e.g., 0x00028000 | 0x67 = 0x00028067
                       # The middle zeros are:
                       # - imm[11:0] = 0 (bits 31:20)
                       # - funct3 = 0 (bits 14:12)
                       # - rd = 0 (bits 11:7)

  mv a0, t1            # Return final instruction

  ret

In COLON we use it like this:

# 3.2 Generate machine code for jr t0
li a0, 5 # t0 is x5
jal do_jr
sw a0, 0(:HERE) # jr
addi :HERE, :HERE, 4

Now when COLON finishes, :HERE points just after the jr, so the execution toknes will be added just below, whatever we want DUP MUL etc, as we are parsing the tokens since we are in compile mode, we will keep adding execution toknes to the thread, until ; is executed, and you see ; is immediate word, which means it is executed in compile mode, and what it does is it adds the execution token EXIT to the end of the word, it actually just adds it to wherever :HERE is, which is at the end of the current word, and moves back intro evaluation mode.

SEMICOLON:
  mv :MODE, zero # exit compile mode
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT

You might notice HERE is a bit like a stack for the dictionary memory, we just keep pushing. We can do soooo much with it, for example as we are creating a new word, we can store here on the data stack, then we can use it as a parameter to the jump word. or we can use a placeholder and go back to patch with a value. This is how we will create all kinds of control flow logic in Forth, from if to loops.

Again there are many many ways to make a Forth, what is the absolute minimum needed to build a complete and expressive language, which words are fundamental? I am actually quite new to Forth, learned about it few months ago, but I got excited by this exact question. In math for example I can say a = a, which means a thing must be equal to itself, 5 = 5, 3.2 = 3.2, there cant be anything more fundamental than it right? But what about a - a = 0, as 5 - 5 = 0, if we say that a thing subtracted from itself gives nothing, it follows that a thing is equal to itself, so which one is more fundamental? Are they not the same? The symbolic manipulation, when evaluated, must be evaluated in context of the evaluator. For example there could be a system where a = a is broken, imagine the evaluator is evaluating the expression symbol by symbol, and there is some temporal nature to a as in a changes with time, how would it know that it is the same a? How much time would it take for the expression to be evaluated, and by the time it is done, how would we know that a is the same as it was? So we abstract it away, we pretend that the evaluation is instant, and there are things in our universe that are like that. For example gravity is instant, it seems, once object moves its field moves with it, electricity however is not instant, once you move the electron, the electric field takes time realize, wait.. my electron moved, I have to move. Quite strange that we have both instant and non instant evaluations of the fundamental forces. Anyway.. a = a needs some contex when you are evaluating it, you must know its surroundings, it itself is nothing, but it plus its evaluator together, they are something!

You see in this language, symbols exist in so many layers, for example I can expose the most primitive words PUSH and POP, that all use the same temporary storage, t0 for example,, and then make DUP a word : dup pop push push ;, push and pop are few machine code instructions, which are then few sequential micro instructions, voltage or no voltage on some wires driven by a clock. How related are the wires to our dup word? We could make a dup word from biological cells, or from dominos, we can make it with water and valves as well. It seems that programming languages live somewhere else, not exactly in the machine, not exactly in their syntax, not exactly in their grammar, and not exactly in the programmer, what a weird place must that be.

In our Forth we have the words jr and li, they take values from the stack and push the assembled instructions into the stack. We also have the word here that pushes the current location of the word we are compiling, and of course we have ! that can write to any memory location.

  # JR ( reg -- opcode_jr )
JR:
  POP a0
  call do_jr
  PUSH a0
  j NEXT

  # ( reg imm -- lui addi )
LI:
  POP a1   # imm
  POP a0   # reg

  call do_li

  PUSH a0 # lui
  PUSH a1 # addi

  j NEXT

We could write machine code from forth itself with cleaver manipulations something like here 20 + dup dup 5 12345 li rot ! swap 4 + ! would write li t0, 12345, lui at here + 20 and and addi at here + 24. So now it is even harder to say what is the language, and what is the machine. As Ada Lovelance said, the limit is in us, what we can think of, the possibilities are endless.


We will build few more forth functions that allow us to manipulate the return stack and we will add one more stack, called control flow stack to save jump addressess for if and else, and we will add the ability for a word to write bytecode in the word that is compiling it, you will see how powerful that is.

We will add a macto to PUSH and POP from the control flow stack, we will use register s3 for :CSP, and we will setup some stack space after the return stack.

  .macro CFPUSH reg
  addi :CSP, :CSP, -4
  sw \reg, 0(:CSP)
  .endm

  .macro CFPOP reg
  lw \reg, 0(:CSP)
  addi :CSP, :CSP, 4
  .endm

forth:
  la :CSP, CONTROL_FLOW_STACK_END
  ...
...

  .space 2048
FORTH_STACK_END:
  # forth return stack
  .space 2048
RETURN_STACK_END:
  # forth control flow stack
  .space 2048
CONTROL_FLOW_STACK_END:

Few words are needed to copy data from the return stack into the data stack and from vice versa. r> pops from the return stack and pushes to data stack, >r pops from the data stack and pushes to the return stack and r@ copies the top element from the return stack and pushes it to the data stack, leaving the return stack unchainged. We have the same for the control flow stack, cf> >cf cf@

  # ( x -- ) (R: -- x)
TO_R:
  POP t0
  RPUSH t0
  j NEXT

  # ( -- x ) (R: x -- )
FROM_R:
  RPOP t0
  PUSH t0
  j NEXT

  # ( -- x ) (R: x -- x)
R_FETCH:
  lw t0, 0(:RSP)
  PUSH t0
  j NEXT

  # ( x -- ) (CF: -- x)
TO_CF:
  POP t0
  CFPUSH t0
  j NEXT

  # ( -- x ) (CF: x -- )
FROM_CF:
  CFPOP t0
  PUSH t0
  j NEXT

  # ( -- x ) (CF: x -- x)
CF_FETCH:
  lw t0, 0(:CSP)
  PUSH t0
  j NEXT

...
word_to_r:
  .word ...
  .word 2
  .ascii ">r\0\0"
  .word 0
  .word TO_R

word_from_r:
  .word word_to_r
  .word 2
  .ascii "r>\0\0"
  .word 0
  .word FROM_R

word_r_fetch:
  .word word_from_r
  .word 2
  .ascii "r@\0\0"
  .word 0
  .word R_FETCH

word_to_cf:
  .word word_r_fetch
  .word 3
  .ascii ">cf\0"
  .word 0
  .word TO_CF

word_from_cf:
  .word word_to_cf
  .word 3
  .ascii "cf>\0"
  .word 0
  .word FROM_CF

word_cf_fetch:
  .word word_from_cf
  .word 3
  .ascii "cf@\0"
  .word 0
  .word CF_FETCH

The other two very important words we will add are postpone and immediate

  # ( -- )
IMMEDIATE:
  li t1, 1
  sw t1, 12(:LATEST) # flag value
  j NEXT

POSTPONE:
  jal do_next_token
  jal do_parse_token
  jal do_find
  beqz a0, .L_word_not_found

  la t1, LIT
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4

  lw a0, 0(a0) # dereference
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4

  la t1, COMMA
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4

  j NEXT

.L_word_not_found:
  la a0, err_word_not_found
  j panic

...

word_immediate:
  .word ...
  .word 9
  .ascii "imme"
  .word 0
  .word IMMEDIATE

word_postpone:
  .word word_immediate
  .word 8
  .ascii "post"
  .word 1 # immediate
  .word POSTPONE

immediate sets the flag of the latest word to 1, so the interpreter will execute it in compile time instead of embedding its execution token in the thread of the word being compiled. Its pretty straight forward, once we create a word with do_create we update the :LATEST register, so it always points to the right place, and LATEST + 12 is the exact place of the flag field.

postpone however is a bit more subtle, I mean it is easy when you read it, it compiles LIT, execution token, COMMA into the compiled word's thread.

: begin
   here
   >cf
; immediate

: again
   postpone jump
   cf>
   ,
; immediate

: forever
  begin
    1 . cr
  again
;

forever

begin-again loops are infinite loops in forth, there is no way to exit them, we build begin and again just with here, jump , and postpone, I am using the control flow stack instead of the return stack, because of the way I made EXIT work, and begin's EXIT will pop the wrong value from the return stack, so we use the return stack only for subroutines and do-loops. Lets see what gets compiled in the threads.

After compilation begin's bytecode looks like this in memory:

80000e40 <word_begin>:
80000e40:   80000e10    .word 0x80000e10   # link to previous word
80000e44:   00000005    .word 0x5          # length
80000e48:   69676562    .ascii "begi"      # token
80000e4c:   00000001    .word 0x1          # immediate flag
80000e50:   80000e54    .word 0x80000e54   # code field

80000e54:   80001537    lui     a0,0x80001 # jit code
80000e58:   0f450513    addi    a0,a0,244
80000e5c:   800002b7    lui     t0,0x80000
80000e60:   53428293    addi    t0,t0,1332
80000e64:   00028067    jr      t0

80000e68:   80000534    .word DOCOL
80000e6c:   80000678    .word PUSH_HERE
80000e70:   80000690    .word TO_CF
80000e74:   800004f8    .word EXIT

Those addressess are just some plausible numbers, but this is very effective method to think like the computer, just pick some number and imagine values there, on those addresses. I usually pick small numbers, like 1042 or something, but now I want to make numbers kind of consistend with what you would see from objdump.

Again's code will look a bit weird at first, but thats OK.

80000e78 <word_again>:
80000e78:   80000e40    .word 0x80000e40   # link to previous word
80000e7c:   00000005    .word 0x5          # length
80000e80:   69616761    .ascii "agai"      # token
80000e84:   00000001    .word 0x1          # immediate flag
80000e88:   80000e8c    .word 0x80000e8c   # code field

80000e8c:   80001537    lui     a0,0x80001 # jit code
80000e90:   0f450513    addi    a0,a0,244
80000e94:   800002b7    lui     t0,0x80000
80000e98:   53428293    addi    t0,t0,1332
80000e9c:   00028067    jr      t0

80000ea0:   80000534    .word DOCOL
80000ea4:   800000b0    .word LIT
80000ea8:   8000032c    .word JUMP         # address of JUMP
80000eac:   8000057c    .word COMMA
80000eb0:   80000698    .word FROM_CF
80000eb4:   8000057c    .word COMMA
80000eb8:   800004f8    .word EXIT

This is what postpone does, it will add LIT, X, COMMA to the bytecode, LIT X will put the value X on the stack, and COMMA will write the value on the stack to memory at location :HERE, and will move :HERE += 4, so LIT,X,COMMA is the same as memory[:HERE] = X.

Pretend you are the outer interpreter, in compile mode, compiling the word forever, go step by step. First : creates a dictionary word for the next token, which is forever, it creates just the basic word, link, length, token, flags, :HERE points to the end of it. Next token is begin, you lookup the word, find it in the dictionary, see its flag is immediate, you jump to it to execute it, it executes here which will put the value of :HERE on the stack, and then >cf, will pop the data stack and push to the control flow stack. In the end of executing begin nothing new has been added to the forever word's thread, then we have 1 . cr which gets compiled to

80000f00 <word_forever>:
80000f00:   80000e78    .word 0x80000e78   # link to previous word
80000f04:   00000008    .word 0x8          # length
80000f08:   65726f66    .ascii "fore"      # token
80000f0c:   00000000    .word 0x0          # flag
80000f10:   80000f14    .word 0x80000f14   # code field

80000f14:   80001537    lui     a0,0x80001 # jit code
80000f18:   0f450513    addi    a0,a0,244
80000f1c:   800002b7    lui     t0,0x80000
80000f20:   53428293    addi    t0,t0,1332
80000f24:   00028067    jr      t0

80000f28:   80000534    .word DOCOL
80000f2c:   800000b0    .word LIT   # <- HERE when begin was executed
80000f30:   00000001    .word 0x1
80000f34:   80000134    .word EMIT
80000f38:   800000f0    .word CR
80000f48:   00000000    .word  ____ # <- HERE before again is executed

Finding the execution tokens one by one in the dictionary. Now we reach again. and HERE points to the end of the current definition, so again has LIT, ADDR OF JUMP, COMMA which will write the address of JUMP to the location of HERE. And our forever word will look like this:

80000f28:   80000534    .word DOCOL
80000f2c:   800000b0    .word LIT
80000f30:   00000001    .word 0x1
80000f34:   80000134    .word EMIT
80000f38:   800000f0    .word CR
80000f3c:   8000032c    .word JUMP
80000f48:   00000000    .word  ____ <--- HERE

We continue to execute again, next we have FROM_CF and COMMA, FROM_CF will pop from the value begin stored in the control flow stack and push to the data stack, then comma will write it to the location at HERE.

This is how the forever word would look after again is executed.

80000f00 <word_forever>:
80000f00:   80000e78    .word 0x80000e78   # link to previous word
80000f04:   00000008    .word 0x8          # length
80000f08:   65726f66    .ascii "fore"      # token
80000f0c:   00000000    .word 0x0          # flag
80000f10:   80000f14    .word 0x80000f14   # code field

80000f14:   80001537    lui     a0,0x80001 # jit code
80000f18:   0f450513    addi    a0,a0,244
80000f1c:   800002b7    lui     t0,0x80000
80000f20:   53428293    addi    t0,t0,1332
80000f24:   00028067    jr      t0

80000f28:   80000534    .word DOCOL
80000f2c:   800000b0    .word LIT          # <--------------------.
80000f30:   00000001    .word 0x1                                 |
80000f34:   80000134    .word EMIT                                |
80000f38:   800000f0    .word CR                                  |
80000f3c:   8000032c    .word JUMP                                |
80000f40:   80000f2c    .word 0x80000f2c   # jumps back to LIT ---'
80000f44:   800004f8    .word EXIT
80000f48:   00000000    .word  ____ <--- HERE

Pretty cool right?

: until
   postpone 0branch
   cf>
   ,
; immediate

Using almost the same code we can implement begin-until, at the end of the loop we check if the stack has 0 and if it does we break.

: test-until
  begin
    key
    dup . cr
    113 =
  until
;

test-until

This code for example will exit the begin loop if you press 'q' (ascii 113). Everything is the same as begin again but here we use BRANCH_ON_ZERO , so we will jump back only if there is 0 on the stack.

80001000:   80000534    .word DOCOL
80001004:   800006a8    .word KEY       <-------.
80001008:   80000140    .word DUP               |
8000100c:   80000134    .word EMIT              |
80001010:   800000f0    .word CR                |
80001014:   800000b0    .word LIT               |
80001018:   00000071    .word 113               |
8000101c:   80000264    .word EQUAL             |
80001020:   80000310    .word BRANCH_ON_ZERO    |
80001024:   80001004    .word 0x80001004 -------'
80001028:   800004f8    .word EXIT

We will use similar methods to create for loops, if, else, while, until etc, and then we will have quite expressive language, that is build from itself.

The code that is compiled is quite efficient, we did a lot of work during the compilation but the actual bytecode of the word is just the things needed for the BEGIN AGAIN to work, there is no control flow stack shananigans there, just JUMP to specific address.

We will make a program that waits forever for the key 'q' to be pressed and if it is it quits the program.

: if
   postpone 0branch
   here
   0
   ,
   >cf
; immediate

: then
   here
   cf>
   !
; immediate

: forever
   begin
    key dup 113 = if
      bye
    then
   . cr
   again
;

forever

First we need if and then, it is quite similar to begin again, but instead of jump we need to use BRANCH_ON_ZERO

Start by thinking how forever's bytecode should look, we have the old unconditionkal jump to the top, but inside we have a branch that jumps over the if content if the top of the stack is 0.

80001000:   80000534    .word DOCOL
80001004:   800006a8    .word KEY   <-----------.
80001008:   80000140    .word DUP               |
8000100c:   800000b0    .word LIT               |
80001010:   00000071    .word 113               |
80001014:   80000264    .word EQUAL             |
80001018:   80000310    .word BRANCH_ON_ZERO    |
8000101c:   80001024    .word 0x80001024  ---.  |
80001020:   800000f0    .word BYE            |  |
80001024:   80000134    .word EMIT  <--------'  |
80001028:   800000f0    .word CR                |
8000102c:   8000032c    .word JUMP              |
80001030:   80001004    .word 0x80001004  ------'

Now how would we construct that at compile time, if must store the location of the 0branch's jump argument, 8000101c in this case somewhere, then leave an empty placeholder cell, when then is compiled it it has to write the value of HERE at the placeholder, so that BRANCH_ON_ZERO will jump over the if block.

: if
   postpone 0branch \ put BRANCH_ON_ZERO in the word that is being compiled
   here \ push the current end of the word's bytecode
   0 ,  \ write a placeholder with value 0 and move HERE + 4
   >cf  \ store the previous HERE location in CF stack
; immediate

: then
   here \ push the current end of the word's bytecode
   cf>  \ pop the placeholder location from CF stack
   !    \ write the value of here into the placeholder
; immediate

So again, if prepares a placeholder for BRANCH_ON_ZERO and stores the placeholder's address on CF, then loads the placeholder's address and inside of it writes the last bytecode address in it.

else is a bit more involved.

: else
   postpone jump
   here
   0
   ,
   here
   cf>
   !
   >cf
; immediate

: forever
   begin
    key dup 113 = if
     drop
     0 . cr
     bye
    else
     . cr
    then
   again
;

forever

Now we will change the program a bit, it will wait for 'q' and exit but before that it will drop the top of the stack and print 0, otherwise we will print the ascii of the key.

80001000:   80000534    .word DOCOL
80001004:   800006a8    .word KEY <-----------------------.
80001008:   80000140    .word DUP                         |
8000100c:   800000b0    .word LIT                         |
80001010:   00000071    .word 113                         |
80001014:   80000264    .word EQUAL                       |
80001018:   80000310    .word BRANCH_ON_ZERO              |
8000101c:   80001034    .word 0x8000103c ----------.      |
80001020:   80000138    .word DROP                 |      |
80001024:   800000b0    .word LIT                  |      |
80001028:   00000000    .word 0                    |      |
8000102c:   80000134    .word EMIT                 |      |
80001030:   800000f0    .word CR                   |      |
80001032:   800000f0    .word BYE                  |      |
80001034:   8000032c    .word JUMP                 |      |
80001038:   80001044    .word 0x80001044 -----.    |      |
8000103c:   80000134    .word EMIT <----------|----'      |
80001040:   800000f0    .word CR              |           |
80001044:   8000032c    .word JUMP <----------'           |
80001048:   80001004    .word 0x80001004 -----------------'

You can do this, take your time, take pen and paper, and think through it. I will just give you the high level overview of what is going onif adds branch_on_zero and leaves a placeholder address on the control flow stack to be patched later, else adds jump and creates another placeholder to be patched by then, it also patches the if's placeholder to point at here, so if if fails we jump into the else code, after that then patches the placeholder with here. And so if you follow the bytecode, if the user presses 'q' which has ascii 113, EQUAL pushes -1 to the top of the stack, BRANCH_ON_ZERO pops it and does not jump, because it is not zero, then we execute the code inside the if, in the end of the code you see we have the JUMP left there by else, which jumps over the code we had in the else block . cr in our case. However if the user presses any other key than 'q', EQUAL will push 0 to the stack and BRANCH_ON_ZERO will jump over the if block into 8000103c, where we have the else code . cr. And we still have the begin again jump that jumps back to the top.

One thing you might notice, is what happens if we call if not in compile mode?

3 0 = if bye then

What would this do? Well, worse than nothing really, it will add junk to wherever HERE is pointing to, at the end of the last defined word. Cnomplete Forth implementations will warn when calling certain words while not in compile mode.

If you try this code in gforth you will see the warning.

3 0 = if bye then
*terminal*:1:7: warning: IF is compile-only
*terminal*:1:7: warning: Compiling outside a definition

Our purpose is not to make a complete Forth implementation, but to understand the very core of symbol manipulation and programming languages, so we will leave ours as simple as possible.

Lets add loops.

  # ( limit index -- R: limit index)
DO_DO:
  POP t0            # t0 = index
  POP t1            # t1 = limit
  RPUSH t1          # limit
  RPUSH t0          # index
  j NEXT


  # ( R: limit index -- R: limit index )
DO_LOOP:
  RPOP t0    # pop index
  RPOP t1    # pop limit
  addi t0, t0, 1
  blt t0, t1, .L_do_loop_jump # if limit < index

  # skip over the jump address
  addi :IP, :IP, 4
  j NEXT
.L_do_loop_jump:
  # push them back on Rstack if still looping
  RPUSH t1   # push limit
  RPUSH t0   # push index
  # read the jump address from IP (the next cell in the thread)
  lw :IP, 0(:IP)
  j NEXT

LOOP_I:
  lw t0, 0(:RSP)
  PUSH t0
  j NEXT

LOOP_J:
  lw t0, 8(:RSP)
  PUSH t0
  j NEXT

LOOP_K:
  lw t0, 16(:RSP)
  PUSH t0
  j NEXT

...

word_do_do:
  .word ...
  .word 4
  .ascii "(do)"
  .word 0
  .word DO_DO

word_do_loop:
  .word word_do_do
  .word 6
  .ascii "(loo"
  .word 0
  .word DO_LOOP

word_i:
  .word word_do_loop
  .word 1
  .ascii "i\0\0\0"
  .word 0
  .word LOOP_I

word_j:
  .word word_i
  .word 1
  .ascii "j\0\0\0"
  .word 0
  .word LOOP_J


word_k:
  .word word_j
  .word 1
  .ascii "k\0\0\0"
  .word 0
  .word LOOP_K

And we need to add support to put (do) and (loop) into the compiled word.

: do
  postpone (do)
  here
  >cf
; immediate

: loop
  postpone (loop)
  cf>
  ,
; immediate


: test-simple-loop
  10 0 do
    i . cr
  loop
;
test-simple-loop

The syntax for loops in Forth is LIMIT INDEX DO ... LOOP, the code ... will be executed until INDEX is less than LIMIT, so 10 0 means start from 0 and go up to 9, 10 3 means start from 3 and go to 10, and so on -30 -40 means start from -40 and go up to -31, or 0 -5 start from -5 and go up to -1.

Inside the loop you have access to the word i, if you have nested loop you get j and k. First I will explain how do loops work and then will discuss how ijk works.

Again, when you want to understand something, start from itself, how does test-simple-loop look in memory:

80001000:   80000534    .word DOCOL
80001004:   800000b0    .word LIT
80001008:   00000000    .word 10
8000100c:   800000b0    .word LIT
80001010:   0000000a    .word 0
80001014:   80000714    .word DO_DO
80001018:   80000734    .word LOOP_I  <-------. `here` at time of :do
8000101c:   80000134    .word EMIT            |
80001020:   800000f0    .word CR              |
80001024:   800000b4    .word DO_LOOP         |
80001028:   80001018    .word 0x80001018 -----'
8000102c:   800004f8    .word EXIT

do will embed DO_DO into the thread of the compiled word. DO_DO will push the values from the data stack into the return stack, pretty clear, it will also push here to the control flow stack so later DO_LOOP can jump back if we have not reached the LIMIT. loop will embed DO_LOOP into the thread, pop the address stored at do from the control flow stack and write it in the thread, DO_LOOP uses this address to jump to while the loop is still going, when the loop is done it simply does IP += 4 and continues.

DO_LOOP and DO_DO are using the return stack to update the index value, so if we have 3 nested loops our return stack looks like this:

+----------------+ <-- RSP
| innermost idx  | <-- i (offset 0)
| innermost lim  |
+----------------+
| middle idx     | <-- j (offset 8)
| middle lim     |
+----------------+
| outermost idx  | <-- k (offset 16)
| outermost lim  |
+----------------+

i, j and k are simply the values at memory[RSP], memory[RSP-8] and memory[RSP-16]. Depending where you use them they change meaning, for example

: test-simple-loop
  10 0 do
    i . cr
    20 15 do
      i . cr
    loop
  loop
;
test-simple-loop

Both i will look at memory[RSP], so inside each loop it always refers to itself, if you want to reach the index of the outer loop from the inner loop you need to use j, but if you try to use j in the outer loop it makes no sense.

That is not how it actually works in real Forth, but I think its OK since we know its limitation.

Now everything together:

: test-loop
   begin
    10 5 do
     35 30 do
      53 50 do
       i 52 = if
        i 2 + i do
         999999 . cr
        loop
       else
        i 2 + i do
         5555 . cr
        loop
       then
       i . cr
       j . cr
       k . cr
      loop
     loop
    loop
   again
;
test-loop

Few things are missing, but I think most importantly and the ability exit early from words and the ability to create variables and arrays, we are also missing a lot of quality of life improvements, like comments and strings.

I think this is great time for you to pause and think how would you implement those things, if the book ends right here. How would you do it? What would your Forth look like? By now you know there are infinite many ways to create something. My mind works in certain way, I like certain patterns and structures, sometimes I am willing to sacrifice beauty for performance, or for education, sometimes I sacrifice performance for what I think is elegance. That means nothing.

Look again at our tic tac toe program.

create board 9 allot

: board[] board + ;

: reset-board ( -- )
  9 0 do
    '-' i board[] c!
  loop
;

: print ( -- )
  3 0 do   \ j
    3 0 do \ i
      j 3 * i + board[] c@ emit
    loop
    cr
  loop
;

: check-line ( a b c -- flag )
  board[] c@ rot board[] c@ rot board[] c@
  dup '-' = if
    drop drop drop 0
  else
    over    \ a b c -> a b c b
    =       \ a b c==b
    rot rot \ c==b a b
    =       \ c==b a==b
    and     \ c==b && a==b
  then
;

: check-win ( -- )
  0 1 2 check-line if 1 exit then
  3 4 5 check-line if 1 exit then
  6 7 8 check-line if 1 exit then
  0 3 6 check-line if 1 exit then
  1 4 7 check-line if 1 exit then
  2 5 8 check-line if 1 exit then
  0 4 8 check-line if 1 exit then
  2 4 6 check-line if 1 exit then
  0
;

: play ( -- )
  'X' 'O'
  begin
    over emit ." 's turn" cr
    print
    over key '0' - board[] c!
    swap
    1 check-win = if
        print cr emit ."  wins" cr
        exit
    then
  again
;

reset-board play bye

The first time you saw it, it must'have been like seing alien language, now, you can see through it, you can understand even the symbols you have never seen, and imagine how would they work, what would ." or create board 9 allot do. This ability to say: "If I was creating this, how would I make it", requires both to deeply believe in yourself, to understand what you dont understand, to listen to yourself and to have courage to dive into your doubt. The shadow of doubt stretches long through the graph of knowledge. You can not swim in the sea of doubt if you do not believe you will get through it, and I promise you, you will get through, you just have to listen carefully, as the doubt only whispers.

Ignorance is required for understanding, as ignorance allows you to do impossible things, and understanding is impossible at first, anything that you have understood seem simple, but it seemed impossible before. Just look at this line over = rot rot = and, looks like absoloute nonsense, I guarantee you that if you saw it before reading this chapter you would've said it is impossible to understand this alien technology. Curiocity is required as well, otherwise over = rot rot = and will stay just meaningless string of characters, unless you get uncontrollable desire to demistify it, which is what curiocity is, desire to understand.

When you are creating your language, or any program, you know the machine, you know yourself, you know what is important for you, and what is important for the machine. This is not how modern programmers think, they are thought in school about design patterns and separation of concerns, SOLID principles and so on, how to make composable, maintainable software, how to manage complexity, how to work on the same project with a thousand other people chainging the same code, or a million other people. I think this has nothing to do with programming computers. To use a computer means to program it to do what you want. If someone else makes a program for you, it does not matter if 1 programmer made it or 1 million programmers worked on it, it will always be incomplete, as they have to guess what you want from the computer, but only you know. It is the same when you make a chair, or a bed, or a spoon, or a cup of coffee.

Ignorance will allow you to make a better cup of coffee for you.

When you drink coffee you have two choices, you can say 'those peole are experts, they make coffee all their life, they have read research from scientists, they know everything that is to know about coffee, this must be the best coffee that humanity will ever make', or you could be ignorant, and curious, and say 'It must be possible to make a better cup of coffee, I will try to make one'.

Lets go back to our Forth, I will take some shortcuts, as I am also learning how to write Forth as I am writing this book, and by now the chapter is too long. First I will take advantage that we have 11 save registers in RISCV, the s registers are normal general purpose registers but the convention is if you use them in your function you have to save and restore them, like we do with ra, so I will use more registers to track more stacks, instead of having the return stack holding both the loop limit/index it also holds the return address for the word, and now if we want to do early exit, we must unroll all the loops to get to the return address to jump to,

Imagine we change the program : test 10 0 do i . cr loop ; to : test 10 0 do i dup . cr 5 = if exit then loop ; so we return from test early if i becomes 5.

: test
   10 0 do
      i dup . cr
      5 = if
         drop exit
      then
   loop
;

80001000:   80000534    .word DOCOL
80001004:   800000b0    .word LIT
80001008:   00000000    .word 10
8000100c:   800000b0    .word LIT
80001010:   0000000a    .word 0
80001014:   80000714    .word DO_DO
80001018:   80000734    .word LOOP_I  <-------.
                        .word DUP             |
8000101c:   80000134    .word EMIT            |
80001020:   800000f0    .word CR              |
                        .word LIT             |
                        .word 5               |
                        .word BRANCH_ON_ZERO  |
                        .word 0x80001024---------.
                        .word DROP            |  |
                        .word EXIT            |  |
80001024:   800000b4    .word DO_LOOP <-------|--'
80001028:   80001018    .word 0x80001018 -----'
8000102c:   800004f8    .word EXIT

EXIT just pops IP from the return stack, but now we have do-loop's limit and index there as well, so we cant just pop once, we need to pop pop pop to get to the actual value, but how do we know how many do loops we have on top of each other? This of course can be done if we store more information on the return stack, like instead of just limit/index we store 'limit, index, 7' and when we jump into a subroutine we dont just store the Instruction Pointer but we store some sort of tag about what the value means, e.g. 'IP, 9', 9 means subroutine return address, 7 means do-loop loop data, and then if we want to EXIT we just start popping until we see the first 9 and then we know we have reached the closest exit address. You can do everything with cleaver stack manipulations.

We can also just split the stacks, so the subroutine stack is separate and it only contains return addresses, since our computer has many registers things will be fast, BTW, I am using registers, but you can do it with fewer registers and just storing the top of the stack on some memory location, just every stack push and pop will require more instructions to do.

We will call this the exit stack.

[ FIXME ]

It is not often that we know exactly what we want, in this case I want to be able to exit early from a word, I know the code I have written so far, I know the machine, I know I could do it in many ways, but now, I feel like adding few more stacks, so each stack has its own purpose, this is not what common Forth interpreters do, but, I am not making a "common" forth interpreter am I? I am writing code to learn about Forth, and in the same time to teach you about it. At this very moment, I dont know if this is a good idea or not to add more stacks, it seems fine, but in few chapters I might need registers to store something and I wont have any available, and that is OK, its OK to not know if a decision is good or bad, just listen to your intuition, try to have as much foresight as you can, and then go, if things become unmaintainable or the price you pay for a bad decision is too high, you must promise yourself that you will go back and fix all the broken things. This allows you to not overthink, few people are prophets of complexity, some can see much further than others, like Rob Pike, or Ken Thompson, but most people like me can see a bit further from their nose. Make mistakes, go back and fix them, then make more mistakes. This is the way. Some mistakes require you to start from scratch, and you have to allow yourself to do so, do not overestimate the work needed to start from scratch. In life this is not the case, almost always the future holds infinite possibilities, and the past is closed, but when you design systems, every decision constrains the future possibilities, but the past remains open.

Can you imagine, after writing million lines of code, thinking "NOOO, I need one more register, I could've had it if 5 years ago I didnt split the stacks", and I will be honest, this happens more often than you think, you find a work around, and move on, but deep down you know, you have made a grave mistake that will haunt you forever while you are adding code to this project. But next time, for the next project, your foresight will extend just a bit further than before, and then again, and again, mistake after mistake. This is how we grow.

Lets add the new exit stack, :ESP will be s8.

  .macro EPUSH reg
  addi :ESP, :ESP, -4
  sw \reg, 0(:ESP)
  .endm

  .macro EPOP reg
  lw \reg, 0(:ESP)
  addi :ESP, :ESP, 4
  .endm

forth:
  ...
  la :ESP, EXIT_STACK_END
  ...

...
CONTROL_FLOW_STACK_END:
  .space 2048
EXIT_STACK_END:

Changing DOCOL and EXIT to use the exit stack instead of the return stack. We also add unloop in case you want to exit from a loop you want to clear the return stack, otherwise it will contain left over garbage.

DOCOL:
  EPUSH :IP
  mv :IP, :XT
  j NEXT

EXIT:
  EPOP :IP
  j NEXT

UNLOOP:
  RPOP zero
  RPOP zero
  j NEXT

Add the words exit and unloop to the dictionary

word_exit:
  .word ...
  .word 4
  .ascii "exit"
  .word 0
  .word EXIT
word_unloop:
  .word word_exit
  .word 6
  .ascii "unlo"
  .word 0
  .word LOOP_UNLOOP

That was easy, exit works now, and we can exit early from words.

: wait
  begin
    key dup 113 = if
      drop exit
    else
      . cr
    then
  again
;

wait bye

If you want to exit from within a loop you have to unloop it:

: wait
  10 0 do
    key dup 113 = if
      unloop drop exit
    else
      . cr
    then
  loop
;

wait bye

For our purposes that is enough, we just need to add the ability to create arrays and be able to manipulate bytes in memory.

We will add CREATE word that we can use for variables, we want when we use the word to get the address of its data field, we dont need to push :IP and we dont need DOCOL and EXIT, we just need to push the data field address to the stack. For that we will generate a bit different jit code than what we do in COLON. For the jit we need to add support for sw and addi.

Then we will add support for ALLOT which just moves :HERE some amount, so we can allocate memory space in the current word, for example 9 allot will just do :HERE = :HERE + 9, this creates a small problem because now the next word we create wont be aligned to be at address multiple by 4, which is a problem since we jump into jitted code of some words, so we will patch do_create to make sure it always increments :HERE up to the closest multiple of 4.

And we need to add byte level AT and BANG, called c@ and c!, which is the same as @ and ! but it uses lbu and sb instead of lw and sw, we also need few helper functions like the ability to print characters, and to do AND to check for multiple flags. We will expose ROT and OVER, we had them before but we only used them in the inner interpreter, now we will add them to the dictionary.

# update do_create to always create words at 4 byte boundary
do_create:
  addi sp, sp, -4
  sw ra, 0(sp)

  jal do_next_token
  jal do_parse_token
  beqz a1, .L_create_error

  # align to closest multiple of 4
  addi t0, :HERE, 3
  li t1, -4
  and :HERE, t0, t1
  ...
  # rest of do_create remains the same

# sw ( a0: rs1, a1: rs2 source -- a0: opcode_sw )
do_sw:
  # bits [31:25] = 0         (imm[11:5] = 0)
  # bits [24:20] = rs2       (source register to store)
  # bits [19:15] = rs1       (base address register)
  # bits [14:12] = 0x2       (funct3 for SW)
  # bits [11:7]  = 0         (imm[4:0] = 0)
  # bits [6:0]   = 0x23      (opcode for store)
  
  li a4, 0x23   # opcode
  li t0, 0x2000 # 2 << 12
  or a4, a4, t0 

  slli t0, a0, 15
  or a4, a4, t0

  slli t0, a1, 20
  or a4, a4, t0

  mv a0, a4
  ret

# addi ( a0: rd, a1: rs1, a2: imm -- a0: opcode_addi )
do_addi:
  # ADDI instruction format:
  # bits [31:20] = immediate
  # bits [19:15] = rs1       (source register)
  # bits [14:12] = 0x0       (funct3)
  # bits [11:7]  = rd        (destination register)
  # bits [6:0]   = 0x13      (opcode)

  li t0, 0x13            # ADDI opcode

  slli t1, a0, 7         # Shift rd to position [11:7]
  or t0, t0, t1

  slli t1, a1, 15        # Shift rs1 to position [19:15]
  or t0, t0, t1

  li t1, 0xfff
  and t2, a2, t1         # Mask to 12 bits
  slli t2, t2, 20        # Shift immediate to position [31:20]
  or t0, t0, t2

  mv a0, t0
  ret


CREATE:
  jal do_create

  # point the execution token to the machine code, same as COLON
  addi t0, :HERE, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4

  # create foo
  #
  # 80000880: w_foo:
  # 80000880:       80000..# link
  # 80000884:       3      # size
  # 80000888:       "foo\0"# token
  # 8000088c:       0      # flags
  # 80000890:       80000894 # CODE FIELD >--------.
  # 80000894:       lui t0, HIGH(HERE+24)  <-------'   >-.
  # 80000898:       addi t0, t0, LOW(HERE+24) >-----------.
  # 8000089c:       addi SP, SP, -4                       |
  # 800008a0:       sw t0, 0(SP)                          |
  # 800008a4:       lui t0, HIGH(NEXT)                    |
  # 800008a8:       addi t0, t0, LOW(NEXT)                |
  # 800008ac:       jr t0                                 |
  # 800008b0:       <data field...>    <------------------'

  # li t0, :HERE
  # addi :SP, :SP, -4
  # sw t0, 0(SP)
  # la t0, NEXT
  # jr t0

  addi t1, :HERE, 28 # HERE + 28, 7 instructions 4 bytes each

  # li t0, value of :HERE + 28
  li a0, 5  # t0 is x5
  mv a1, t1 # HERE + 28
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4

  # addi :SP, :SP, -4
  li a0, 9  # :SP is s1, x9
  li a1, 9  # :SP is s1, x9
  li a2, -4
  call do_addi
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4

  # sw t0, 0(:SP)
  li a0, 9  # :SP is s1, x9
  li a1, 5  # t0 is x5
  call do_sw
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4

  # la t0, NEXT
  li a0, 5 # t0 is x5
  la a1, NEXT
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4

  # jr t0
  li a0, 5 # t0 is x5
  jal do_jr
  sw a0, 0(:HERE) # jr
  addi :HERE, :HERE, 4

  j NEXT

  # ( n -- )
ALLOT:
  POP t0
  mv a0, t0
  add :HERE, :HERE, t0
  j NEXT

  # ( addr -- value )
C_AT:
  POP t0
  lbu t0, 0(t0)
  PUSH t0
  j NEXT

  # ( value addr -- )
C_BANG:
  POP t0 # address
  POP t1 # value
  sb t1, 0(t0)
  j NEXT

# ( x1 x2 -- flag )
AND:
  POP t0
  POP t1
  # Check if either value is zero
  beqz t0, .L_false
  beqz t1, .L_false
  # Both non-zero, return TRUE (-1)
  li t0, -1
  PUSH t0
  j NEXT
.L_false:
  # At least one zero, return FALSE (0)
  mv t0, zero
  PUSH t0
  j NEXT

  # ( n -- )
EMIT_CHAR:
  POP a0
  jal putc
  j NEXT

  # ( a b -- c )
MINUS:
  POP t0
  POP t1
  sub t0, t1, t0
  PUSH t0
  j NEXT

...

word_create:
  .word ...
  .word 6
  .ascii "crea"
  .word 0
  .word CREATE

word_allot:
  .word word_create
  .word 5
  .ascii "allo"
  .word 0
  .word ALLOT

word_c_bang:
  .word word_allot
  .word 2
  .ascii "c!\0\0"
  .word 0
  .word C_BANG

word_c_at:
  .word word_c_bang
  .word 2
  .ascii "c@\0\0"
  .word 0
  .word C_AT

word_emit_char:
  .word word_c_at
  .word 4
  .ascii "emit"
  .word 0
  .word EMIT_CHAR

word_rot:
  .word word_emit_char
  .word 3
  .ascii "rot\0"
  .word 0
  .word ROT

word_over:
  .word word_rot
  .word 4
  .ascii "over"
  .word 0
  .word OVER

word_and:
  .word word_over
  .word 3
  .ascii "and\0"
  .word 0
  .word AND

word_minus:
  .word word_minus
  .word 1
  .ascii "-\0\0\0"
  .word 0
  .word MINUS

And this is a slightly modified tic-tac-toe, since we still dont support comments or strings, but it is close enough:

: begin
   here
   >cf
; immediate

: again
   postpone jump
   cf>
   ,
; immediate

: until
   postpone 0branch
   cf>
   ,
; immediate

: if
   postpone 0branch
   here
   0
   ,
   >cf
; immediate

: then
   here
   cf>
   !
; immediate

: else
   postpone jump
   here
   0
   ,
   here
   cf>
   !
   >cf
; immediate

: do
  postpone (do)
  here
  >cf
; immediate


: loop
  postpone (loop)
  cf>
  ,
; immediate

create board 9 allot

: board[] board + ;
: reset-board
  9 0 do
    45 i board[] c!
  loop
;

: print
  3 0 do
    3 0 do
      j 3 * i + board[] c@ emit
    loop
    cr
  loop
;

: check-line
  board[] c@ rot board[] c@ rot board[] c@
  dup 45 = if
    drop drop drop 0
  else
    over
    =
    rot rot
    =
    and
  then
;

: check-win
  0 1 2 check-line if 1 exit then
  3 4 5 check-line if 1 exit then
  6 7 8 check-line if 1 exit then
  0 3 6 check-line if 1 exit then
  1 4 7 check-line if 1 exit then
  2 5 8 check-line if 1 exit then
  0 4 8 check-line if 1 exit then
  2 4 6 check-line if 1 exit then
  0
;


: play
  88 79
  begin
    over emit  cr
    print
    over key 48 - board[] c!
    swap
    1 check-win = if
        print cr emit cr
        exit
    then
  again
;

reset-board play bye

And our game works!

X
---
---
---
O
X--
---
---
...

What a journey! What joy! We actually made a language from scratch! Now its time to make an operating system.