The Interpreter
Interpreted languages are executed indirectly by the machine, there is a program which reads your source code, and then executes it, but your program is never translated into machine code. In contrast with compiled languages which take your source code and make machine code out of it, then the program is loaded into RAM and the CPU jumps to it and starts executing instruction by instruction.
There are Forth compilers, and even Forth computers where the machine code is basically Forth, but we will make a lightweight Forth interpreter, as close to the metal as possible.
As with everything we will start small and build up, we need to execute this
program: 2 3 + 4 + . cr bye
. You havent seen the word .
so far in Forth it
means pop a value from the stack and show it on screen. In our SUBLEQ computer
we didnt have a screen, but you can imagine how we can create a circuit with a
grid of LEDs to and maybe few 28at64c eeproms to control them via their I/O pins
and a register that controls the eeproms's address lines, so we just set the
register to a value which will then set some address to the eeproms and then
they will enable or disable specific LEDs.
If we have 8x8 grid of leds, We could create the number 2 by enabling the right rows and disabling the right columns (to drive the LEDs to ground).
---**---
--*--*--
-*----*-
------*-
-----*--
----*---
--*-----
--*****-
The screen itself is beyond the scope of this book, though I encourage you to
look up the various ways to show pixels, from huge led arrays to oled screens,
eink, 7 segment displays, liquid crystal displays and so on. What is more
important for me is how does the CPU "talk" to a complicated circuit like a
screen. Or a keyboard or mouse for that matter. If you have enough wires between
the two components so that you can fit all the information in one go, you just
set them up, HIGH
, LOW
.. HIGH
.. whatever the information is, pulse a clock
so the other circuit knows to latch or use them however it sees fit, and it is
done, but if we want to send 'hello world' to a screen, and each character is 8
bits, we will need 88 wires plus 1 for the clock, so 89 wires to send it in one
go. Not that its not impossible to have that many wires, its just impractical.
We could build a circuit which expects the data to come piece by piece, so we
send it 'h', 'e', 'l', 'l', 'o', one by one, each time the clock pulses, the
screen will append the character to an internal buffer, maybe small RAM or few
registers depending on the size, and then display it. We might have few bit
patterns that tell the screen to clear the buffer, or maybe move the cursor so
that the next character will be displayed on a specific position. This is a
communication protocol
. A protocol
sounds a bit scary, but you know a lot of
social protocols, for example when you meet somebody you say 'hello', this is
expected of you, and you expect the other person to say 'hello' back. If they
dont, the social protocol is not followed, and there are some consequences and
the communication is broken (not always, but you see my point). A protocol is
just a series of expectations. Some protocols have extreme consequences and very
strict rules, for example, you must pay for what you buy from the store, or you
will go to jail. The circuit designer wants to make it as easy as possible for
us to use their circuit at its maximum potential, in the same time they have
certain limitations, cost of manufacturing for example, why do you think the
74LS181's outputs are inverted, I doubt it is just to annoy us. So the circuit
designer says 'ok, if you send this bit pattern, the circuit will do this, and
this is what you should expect, this is how long the clock pulse should be..'
and so on. If we follow the expected protocol, and the circuit is not damaged,
we should be able to display the information we want. And we have never met the
manufacturer, nor the designer. There could be hundreds, maybe thousands, of
people working on the parts of that circuit, and we never met any of them, we
just read few pages of text they wrote explaining the communication protocol and
bam! we could use their circuit. The fact that this happens just blows my mind.
Lets look at an example of an imaginary protocol for our 8x8 LED display. Imagine we have 8 data wires (D0-D7), and 2 control wires: CLK (clock) and CMD (command mode). When CMD is HIGH, the data is interpreted as a command, when LOW it's interpreted as regular data.
Command format (CMD = HIGH):
0000 0001: Clear display
0000 0010: Home cursor
0000 0100: Move cursor right
0000 1000: Move cursor left
Data format (CMD = LOW):
Just send ASCII character codes. For example:
0110 1000: 'h'
0110 0101: 'e'
0110 1100: 'l'
0110 1100: 'l'
0110 1111: 'o'
To send "hello" and clear screen:
1. Set CMD HIGH, send 0000 0001 (clear)
2. Pulse CLK
3. Set CMD LOW
4. Send 'h' (0110 1000)
5. Pulse CLK
6. Send 'e' (0110 0101)
7. Pulse CLK
...and so on.
This is a very simple protocol and it will not handle the "real world" properly, for example what if there is noise in the wires? or what how do we know if the data is done sending? how do we know if the display is done showing the data?
Real World protocols like i2c, SPI, UART, USB, PCIe, and etc handle tremendous amount of edge cases and have various tradeoffs with speed and complexity. The important thing is that a protocol is is just an agreed upon sequence of actions or signals.
So how would our .
pop from stack and display on screen word work? We will use
a virtual computer. We built our SUBLEQ computer with flipflops and wires, we
could simulate it - create a program that pretends to be those chips.
QEMU is a machine emulator - it pretends to be a computer. When you run QEMU, it creates a virtual CPU, virtual RAM, virtual devices, all living inside your real computer's memory. Just as we can write a program that simulates our SUBLEQ computer's ALU and RAM, QEMU simulates entire processors like RISC-V or x86.
When the virtual CPU executes an instruction like addi x5, x0, 42
, QEMU
calculates what would happen if a real CPU executed that instruction - which
registers would change, how the flags would be set, what memory would be
accessed. The virtual CPU doesn't know it's not real. Our programs running
inside QEMU don't know they're running in a simulation.
The magic of QEMU is that it can also simulate devices like screens, keyboards,
hard drives and so on..entire computers. We can crash it as many times we want,
or corrupt it, and most importantly, we can pause it and debug it, step through
each instruction and see what is the state of the registers and the memory. When
you are programming you must execute the instructions in your head, and think
what would the computer would do, when you make a mistake what you have in your
head is not what the computer state is, and you must look at the computer's
memory and try to understand where things went wrong. Why it is the way it is?
Being able to debug your program step by step is very very powerful. Of course
you can do that with any program on your computer, there is no need to use QEMU
for that, you can just break into a program with gdb
(a debugger program) and
execute it instruction by instruction. Our goal is however to make an operating
system for an actual physical computer (either esp32c3 or Pico 2350), and
starting with a virtual computer will make the development much much.. much
easier.
There are few things you need to install, QEMU, RISC-V GNU Compiler Toolchain, GDB: The GNU Project Debugger, an editor like Visual Studio Code, or Emacs, and GNU Make. Depending on your operating system they will require different steps, I suggest you ask ChatGPT or Sonnet how to do it.
- QEMU https://www.qemu.org/
- RISC-V GNU Compiler Toolchain https://github.com/riscv-collab/riscv-gnu-toolchain
- GNU Make https://www.gnu.org/software/make/
- GDB https://www.sourceware.org/gdb/
- Emacs https://www.gnu.org/software/emacs/
- Visual Studio Code https://code.visualstudio.com/
Make sure you enable support for RISC-V 32bit.
Create a directory where we will put the files, I will call mine part1
, and we
will start by making a simple RISCV assembly program that will print 'hello
world' on the virtual screen of QEMU, just so that we make sure all the tooling
is working. I use linux and macos, but if you are using windows, you can ask
Sonnet to translate the commands and make it work.
Create a file called boot.s
and type this code in it, as you are typing it,
try to think about it, and its totally OK to be confused. This was quite common
in the 80s btw, havint pages and pages of code in a magazine and you have to
type it in. I was too young at the time to experience it, I got my first
computer in 1997 or so, but I just love the paper code medium.
You can also take a picture with your phone and copy the text from there.
.section .text
.globl _start
_start:
li a0, 'h'
call putc
li a0, 101
call putc
li a0, 'l'
call putc
li a0, 108
call putc
li a0, 'o'
call putc
li a0, 32
call putc
li a0, 'w'
call putc
li a0, 111
call putc
li a0, 'r'
call putc
li a0, 108
call putc
li a0, 'd'
call putc
li a0, 10
call putc
wait_for_q:
call getch
li t1, 'q'
beq t1, a0, exit_qemu
call putc
j wait_for_q
unreachable:
j unreachable
####
# Subroutine: getch
# Reads a character from UART
# Returns: a0 - the character read
getch:
li t0, 0x10000000 # t0 = 0x10000000, this is UART's base address Load UART base address into t0
1:
lbu t1, 5(t0) # t1 = mem[t0 + 5], base + 5 is UART status register
andi t1, t1, 0x01 # t1 = t1 & 0x01, use only the last bit
beqz t1, 1b # If no data ready, keep polling until the bit is 1
lbu a0, 0(t0) # a0 = mem[t0], base + 0 is the dasta register
ret
####
# Subroutine: putc
# Writes a character to UART
# Parameters: a0 - the character to write
putc:
li t0, 0x10000000 # t0 = 0x10000000, again t0 = UART base address
1:
lbu t1, 5(t0) # t1 = mem[t0 + 5], load 1 byte from the UART status register
andi t1, t1, 0x20 # t1 = t1 & 0x20, 0x20 is 00100000, check if this bit is 1
beqz t1, 1b # if not, we are not ready to transmit, try again
sb a0, 0(t0) # mem[t0] = a0, store a0 character to UART data register
ret
exit_qemu:
li t0, 0x100000 # t0 = 0x100000, QEMU exit device address
li t1, 0x5555 # t1 = 0x5555, success exit code
sw t1, 0(t0) # mem[t0] = t1, store exit code to QEMU exit device
j . # infinite loop until QEMU exits
.end
Now we need another file linker.ld
:
OUTPUT_ARCH( "riscv" )
ENTRY( _start )
MEMORY
{
RAM (rwx) : ORIGIN = 0x80000000, LENGTH = 128M
}
SECTIONS
{
.text :
{
*(.text.init)
*(.text)
} > RAM
.rodata :
{
*(.rodata)
} > RAM
.data :
{
*(.data)
} > RAM
.bss :
{
*(.bss)
. = ALIGN(8);
} > RAM
_bss_end = .;
_stack_top = ORIGIN(RAM) + LENGTH(RAM);
_ram_end = ORIGIN(RAM) + LENGTH(RAM);
_end = .;
}
If you execute the following commands now:
riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments \
-m elf32lriscv boot.o -o boot.elf
This will create a file boot.elf
which is our machine code executable program, we could ask QEMU to run it:
qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf
And you should see 'hello world' printed, if you press any character you will
see it echoed at the terminal, if you press 'q' then qemu will
exit. riscv64-unknown-elf-as
is the assembler, it takes the source code and
creates machine code object file, it contains just relative addresses and might
even reference unresolved symbols (e.g. we might want to call a subroutine from
another file, which is not even compiled yet, and even if it is we dont know
where it will sit in RAM, so how can we jump to it?). The linker however has all
the information, in our linker file we say RAM starts at address 0x80000000,
then various sections are in this order, first .text section then .rodata then
.data then .bss, and then we have few symbols where does bss end, where does the
ram end, where would we like to put the top of our stack, in this case the stack
"grows" downwards, the program is in the start of the RAM, and the stack will
start from the end of the ram and grow down.
The sections:
- .text: the program itself, the machine code
- .rodata: read only data, like constants we would like to have
- .data: initialized variables, can be modified during execution,
- .bss: uninitialized variables, but this does not actually take size in the executable, we can just say we want 10kb array of bytes, and the executable wont increase with 10kb, as opposed of the other sections.
Don't worry about those for now, we will get back to them later. In the
assembler we use .section and .end to define a section and we can specify where
would it live in RAM. ENTRY( _start )
this specifies where should the computer
jump to when the program is loaded.
When the linker creates the .elf file, inside of it it will put all this information, plus the machine code itself. ELF means Executable and Linkablke Format, it is very common format for executables. We have not spoken about files yet, but a file is just an array of bytes on non volatile storage, it has a name or some way for you to find it. How you would interptet the bytes inside of it is up to you. Whatever program reads .elf file it will have expectation that the ELF format is followed.
You can examine .elf files with the readelf
program. The options -h
is to
view the header, a header is a piece of structured information in the beginning
of a byte sequence, -S
is to view the section headers, -l
to view the
program headers.
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x80000000
Start of program headers: 52 (bytes into file)
Start of section headers: 5476 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 2
Size of section headers: 40 (bytes)
Number of section headers: 12
Section header string table index: 11
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 80000000 001000 0000c0 00 AX 0 0 4
[ 2] .bss NOBITS 800000c0 0010c0 000000 00 WA 0 0 1
[ 3] .riscv.attributes RISCV_ATTRIBUTE 00000000 0010c0 00004c 00 0 0 1
[ 4] .debug_line PROGBITS 00000000 00110c 000156 00 0 0 1
[ 5] .debug_info PROGBITS 00000000 001262 000026 00 0 0 1
[ 6] .debug_abbrev PROGBITS 00000000 001288 000014 00 0 0 1
[ 7] .debug_aranges PROGBITS 00000000 0012a0 000020 00 0 0 8
[ 8] .debug_str PROGBITS 00000000 0012c0 000043 01 MS 0 0 1
[ 9] .symtab SYMTAB 00000000 001304 000150 10 10 16 4
[10] .strtab STRTAB 00000000 001454 000095 00 0 0 1
[11] .shstrtab STRTAB 00000000 0014e9 000078 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), p (processor specific)
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
RISCV_ATTRIBUT 0x0010c0 0x00000000 0x00000000 0x0004c 0x00000 R 0x1
LOAD 0x001000 0x80000000 0x80000000 0x000c0 0x000c0 RWE 0x1000
Section to Segment mapping:
Segment Sections...
00 .riscv.attributes
01 .text
You see the ELF file starts with 7f 45 4c 46
, in decimal that is 127 69 76 70
which is 7f then the ascii for E, L and F. Every ELF file starts with this 4
bytes, but not every file that starts with them is an ELF file.
The linker will create the right ELF file from our machine code. I wont go deeper into it, but there are amazing guides online to explain the ELF format, and since it is very well defined and documented format the language models know a lot about it, so asking if you are confused just ask ChatGPT or Sonnet.
...
li a0, 'h'
call putc
li a0, 101
call putc
...
Back to our assembly, you see I used letters and numbers to put a value in a0
, they are of course the same thing. We have decided which letter is which number and defined it in a standard called ASCII, which stands for 'American Standard Code for Information Interchange'. It was defined in 1972. The letter 'A' is 65, 'B' is 66.. and so on. This is the whole table:
.-------------------------- ASCII Table ------------------------------------.
| |
| Dec Hex Char Dec Hex Char Dec Hex Char Dec Hex Char |
| ---------------- ---------------- ---------------- -------------- |
| 0 00 NUL 32 20 space 64 40 @ 96 60 ` |
| 1 01 SOH 33 21 ! 65 41 A 97 61 a |
| 2 02 STX 34 22 " 66 42 B 98 62 b |
| 3 03 ETX 35 23 # 67 43 C 99 63 c |
| 4 04 EOT 36 24 $ 68 44 D 100 64 d |
| 5 05 ENQ 37 25 % 69 45 E 101 65 e |
| 6 06 ACK 38 26 & 70 46 F 102 66 f |
| 7 07 BEL 39 27 ' 71 47 G 103 67 g |
| 8 08 BS 40 28 ( 72 48 H 104 68 h |
| 9 09 TAB 41 29 ) 73 49 I 105 69 i |
| 10 0A LF 42 2A * 74 4A J 106 6A j |
| 11 0B VT 43 2B + 75 4B K 107 6B k |
| 12 0C FF 44 2C , 76 4C L 108 6C l |
| 13 0D CR 45 2D - 77 4D M 109 6D m |
| 14 0E SO 46 2E . 78 4E N 110 6E n |
| 15 0F SI 47 2F / 79 4F O 111 6F o |
| 16 10 DLE 48 30 0 80 50 P 112 70 p |
| 17 11 DC1 49 31 1 81 51 Q 113 71 q |
| 18 12 DC2 50 32 2 82 52 R 114 72 r |
| 19 13 DC3 51 33 3 83 53 S 115 73 s |
| 20 14 DC4 52 34 4 84 54 T 116 74 t |
| 21 15 NAK 53 35 5 85 55 U 117 75 u |
| 22 16 SYN 54 36 6 86 56 V 118 76 v |
| 23 17 ETB 55 37 7 87 57 W 119 77 w |
| 24 18 CAN 56 38 8 88 58 X 120 78 x |
| 25 19 EM 57 39 9 89 59 Y 121 79 y |
| 26 1A SUB 58 3A : 90 5A Z 122 7A z |
| 27 1B ESC 59 3B ; 91 5B [ 123 7B { |
| 28 1C FS 60 3C < 92 5C \ 124 7C | |
| 29 1D GS 61 3D = 93 5D ] 125 7D } |
| 30 1E RS 62 3E > 94 5E ^ 126 7E ~ |
| 31 1F US 63 3F ? 95 5F _ 127 7F DEL |
| |
|------------------------ Control Characters -------------------------------|
| |
| NUL Null SO Shift Out FS File Separator |
| SOH Start of Header SI Shift In GS Group Separator |
| STX Start of Text DLE Data Link Escape RS Record Separator |
| ETX End of Text DC1 Device Control 1 US Unit Separator |
| EOT End of Trans. DC2 Device Control 2 SP Space |
| ENQ Enquiry DC3 Device Control 3 DEL Delete |
| ACK Acknowledge DC4 Device Control 4 |
| BEL Bell NAK Negative Ack. |
| BS Backspace SYN Synchronous Idle |
| TAB Horizontal Tab ETB End of Trans. Blk |
| LF Line Feed CAN Cancel |
| VT Vertical Tab EM End of Medium |
| FF Form Feed SUB Substitute |
| CR Carriage Return ESC Escape |
| |
'---------------------------------------------------------------------------'
When we use 'x'
with single quotes it literally means take the ascii code of
that character and substitute it, in the case for x
its the number 120, or
0x78 in hex. li a0, 'h'
is the same as li a0, 104
.
You will also notice I am using the mnemonic name a0 name instead of x10, it is just easier to read the code that way, a* is for arguments and return values, but its just because we use it that way, we could pass a parameter using t1(x6) or whatever, anything but zero(x0).
I used call
instead of jal
, call is a pseudo instruction, since jal
has
relative offset but the offset can only be 20 bits, 1 of which is sign bit, so
we cant jump more than 524287 bytes away, we need to use jalr
, and we need to
use auipc
to put the upper 20 bits in the register where we store the offset
to. basically call
is rewritten to :
auipc x6, offset[31:12] # Upper 20 bits of offset, PC-relative
jalr x1, offset[11:0](x6) # Lower 12 bits of offset
It could also be rewritten to:
auipc x1, offset[31:12] # Upper 20 bits of offset, PC-relative
jalr x1, offset[11:0](x1) # Lower 12 bits of offset
In some scenarios gcc uses t1
.
li a0, 'h'
call putc
This is clear, we put 104 into a0 and then jump to putc while putting pc+4 into ra(x1).
putc:
li t0, 0x10000000 # t0 = 0x10000000, again t0 = UART base address
1:
lbu t1, 5(t0) # t1 = mem[t0 + 5], load 1 byte from the UART status register
andi t1, t1, 0x20 # t1 = t1 & 0x20, 0x20 is 00100000, check if this bit is 1
beqz t1, 1b # if not, we are not ready to transmit, try again
sb a0, 0(t0) # mem[t0] = a0, store a0 character to UART data register
ret
lbu
means 'load byte unsigned' which just means it will load 1 byte from a specific memory address, in our case address 0x10000005, then the next instruction is andi t1, t1, 0x20
, which will do binary and
operation. You already know the and
truth table, you apply the AND logic bit by bit and write the result, for example:
01010101
AND 00001111
----------
00000101
Only if both bits are 1 then the output bit is 1. QEMU's UART status register
will put 1 on bit 5, so if we and
with 00100000
then the result will not be
zero only if the 5th bit is one, otherwise we will get zero in the result. then
we have beqz t1, 1b
means if t1 is zero jump to the label 1
backwards, it is
just a handy way to use temporary labels without us naming them, and this will
just read again the status register. This pattern is very common, it is called a
busy wait
, you keep checking something over and over. It is also called
'polling', but usually when people say poll they mean 'check every second' or
'every millisecond' or some time interval, in a busy wait
we use 100% of the
cpu resources until the status changes.
If the 5th bit is 1 and t1 is not zero, it means that the UART is ready for us
to write to it, you can think from the UART's point of view, it has some buffer,
and when the buffer is full, because it might be printing slower than your
writing speed, you will have to wait. Then we just write a0, which is the
character we passed as parameter, into the UART data register, which for QEMU is
at address 0x10000000
. Then we do ret
which is just jalr zero, 0(ra)
, it
will jump to the value of ra
which is pc+4
of wherever we called the call
pseudo instruction.
This is how we print a character using QEMU's UART.
We keep printing 'e', 'l', 'l'. .. and so on '10' is ASCII for new line, and then we have a getch loop.
wait_for_q:
call getch
li t1, 'q'
beq t1, a0, exit_qemu
call putc
j wait_for_q
again we call getch
which is a subroutine like putc, putc was writing the a0
parameter, getch returns che character that the user typed into a0, then we
compare it with the letter 'q' and if its equal we jump to exit_qemu, if not we
will call putc which will read from a0 and send it to the UART, so the character
you typed will appear on the terminal, and then we jump again to wait for 'q' to
appear.
getch:
li t0, 0x10000000 # t0 = 0x10000000, this is UART's base address
1:
lbu t1, 5(t0) # t1 = mem[t0 + 5], base + 5 is UART status register
andi t1, t1, 0x01 # t1 = t1 & 0x01, use only the last bit
beqz t1, 1b # If no data ready, keep polling until the bit is 1
lbu a0, 0(t0) # a0 = mem[t0], base + 0 is the data register
ret
getch is very similar to putc, it checks a status register, but it checks for the last bit instead of the 5th bit like putc, then keeps busy looping until this bit is 1, which QEMU's UART system will set once there is something in the input buffer, which happens when the user types a character on the keyboard. If the bit is set, we read from the data register and put the value of a0 and we return back.
qemu_exit just writes a specific value to specific address specified by QEMU so that we can tell it to shut down the virtual computer and exit.
exit_qemu:
li t0, 0x100000 # t0 = 0x100000, QEMU exit device address
li t1, 0x5555 # t1 = 0x5555, success exit code
sw t1, 0(t0) # mem[t0] = t1, store exit code to QEMU exit device
j . # infinite loop until QEMU exits
When we call getch and putc I call them subroutine, but they are actually
functions, subroutines dont take or return anything, they are just a sequence of
instructions, functions take inputs and produce outputs. From now on I will use
the term function, and this is similar to functions you learn in math, for
example y = 3x + 2
is a function that has one parameter and returns one
value. The return value of the function depends on the parameter. You can also
see it as a map from input to output.
input | output
-------------
0 | 2
1 | 5
2 | 8
3 | 11
...
Calling a sequence of instructions that take some parameter and return some
output a function is as good as calling a 32 bit value an integer. It is an
integer, but it can not fit the whole number line, the math variables have no
limit x
can be infinity, can be zero, can be infinitely precise fraction, can
even be infinite irrational number like π. The funciton y = 3x + 2
works fine,
in our 32 bit computer however our output will only approximate the abstract
function. There are a lot of symbols in programming that are kind of like math but not quite, the x = y
equal
symbol in math means equality
whatever is on the left is the same as whatever
is on the right, in programming languages x = y
typically means store copy whatever the value of y
is in memory into wherever x
is in memory.
y = 3x + 2
y - 2 = 3x
(y - 2)/3 = x
x = (y - 2)/3
Those are all equivalent in math, but make no sense in almost all programming languages.
That does not stop us from saying x = 3
or li a5, 7
being a5 = 7
, but it
is more of a 'set' operation, of course after the operation is executed a5 will
be 7, but 7 is not a5, as in 7 = a5
doesnt even make sense when you think of
the wires.
It is similar with functions, even with lambda calculus and functional languages, things are not "quite" alright, and thats totally OK, you have to understand abstract operations, but you also have to understand the limit of the machine, and then you can get the best of both.
OK obviously nobody writes characters instruction by instruction, we can just
put the string 'hello world' somewhere in memory and make a function puts
that
will take its address as parameter and print each character in a loop.
This is again the whole program, first type it in, and then we will discuss it, just replace boot.s with this code:
.section .text
.globl _start
_start:
la sp, _stack_top
la a0, message # Load address of message into a0
call puts # Call our new puts function
la a0, messageb
call puts
la a0, messaged
call puts
la a0, messageh
call puts
wait_for_q:
call getch
li t1, 'q'
beq t1, a0, exit_qemu
call putc
j wait_for_q
unreachable:
j unreachable
####
# Subroutine: puts
# Prints a null-terminated string
# Parameters: a0 - address of string to print
puts:
addi sp, sp, -8 # Allocate stack space
sw ra, 0(sp) # Save return address
sw s0, 4(sp) # Save s0 (we'll use it as our string pointer)
mv s0, a0 # Copy string address to s0
puts_loop:
lbu a0, 0(s0) # Load byte from string
beqz a0, puts_done # If byte is 0, we're done
call putc # Print the character
addi s0, s0, 1 # Move to next character
j puts_loop # Repeat
puts_done:
lw ra, 0(sp) # Restore return address
lw s0, 4(sp) # Restore s0
addi sp, sp, 8 # Deallocate stack space
ret
getch:
li t0, 0x10000000
1:
lbu t1, 5(t0)
andi t1, t1, 0x01
beqz t1, 1b
lbu a0, 0(t0)
ret
putc:
li t0, 0x10000000
1:
lbu t1, 5(t0)
andi t1, t1, 0x20
beqz t1, 1b
sb a0, 0(t0)
ret
exit_qemu:
li t0, 0x100000
li t1, 0x5555
sw t1, 0(t0)
j .
messaged:
.byte 104 # h
.byte 101 # e
.byte 108 # l
.byte 108 # l
.byte 111 # o
.byte 32 # space
.byte 119 # w
.byte 111 # o
.byte 114 # r
.byte 108 # l
.byte 100 # d
.byte 10 # newline
.byte 0 # null terminator
messageb:
.byte 0b01101000 # h (104 or 0x68)
.byte 0b01100101 # e (101 or 0x65)
.byte 0b01101100 # l (108 or 0x6C)
.byte 0b01101100 # l (108 or 0x6C)
.byte 0b01101111 # o (111 or 0x6F)
.byte 0b00100000 # space (32 or 0x20)
.byte 0b01110111 # w (119 or 0x77)
.byte 0b01101111 # o (111 or 0x6F)
.byte 0b01110010 # r (114 or 0x72)
.byte 0b01101100 # l (108 or 0x6C)
.byte 0b01100100 # d (100 or 0x64)
.byte 0b00001010 # newline (10 or 0x0A)
.byte 0 # null terminator
messageh:
.byte 0x68 # h
.byte 0x65 # e
.byte 0x6c # l
.byte 0x6c # l
.byte 0x6f # o
.byte 0x20 # space
.byte 0x77 # w
.byte 0x6f # o
.byte 0x72 # r
.byte 0x6c # l
.byte 0x64 # d
.byte 0x0a # newline
.byte 0x00 # null terminator
message:
.asciz "hello world\n" # .asciz adds null terminator automatically
.end
You see how in the .data section I predefined some bytes, when the assembler
makes the machine code it will put those specific bytes in the start of the
.data segment, which is just after .rodata which is after the .text
segment. Immediately after you will see bytes 104, 101, 108, 108.. la a0, messageb
la is a pseudo instruction means load address, it is similar to li,
but might use auipc
which is Add Upper Immediate to PC, auipc rd, immediate
means rd = pc + immediate << 12
, immediate shifted left 12 bits, so we can use
it for relative offsets and then we could add to the lower 12 bits with
addi. Anyway, la a0, messageb
will just put in a0 the address of wherever the
label messageb is in memory.
I used messageb messageh messaged and message, all are exactly the same in
memory. A sequence of characters is a string
, a null terminated string is a
sequence of characters that ends with 0. This means we dont need to know the
length of the string we just print until we reach zero byte. This simple
convenience, you will later find out, is the root cause of billions of dollars
lost due to bugs, memory corruption, security exploits, and all kinds of pain
and suffering.
There is one other big change, in start we do la sp, _stack_top
and you can
see in linker.ld we set _stack_top
to be at the end of RAM, so now the
register sp(x2) will be set to the very end of our RAM address space.
_start:
la sp, _stack_top
la a0, message
call puts
...
puts:
addi sp, sp, -8 # Allocate stack space
sw ra, 0(sp) # Save return address
sw s0, 4(sp) # Save s0 (we'll use it as our string pointer)
mv s0, a0 # Copy string address to s0
puts_loop:
lbu a0, 0(s0) # Load byte from string
beqz a0, puts_done # If byte is 0, we're done
call putc # Print the character
addi s0, s0, 1 # Move to next character
j puts_loop # Repeat
puts_done:
lw ra, 0(sp) # Restore return address
lw s0, 4(sp) # Restore s0
addi sp, sp, 8 # Deallocate stack space
ret
...
The puts function takes one argument in a0, which is a pointer to the null
terminated string we will print. We do call puts
which will set ra
to pc+4, but inside of puts
we need to call putc
now the second call will also set ra
to pc+4, then if we do ret
from puts, which again is just jal zero, 0(ra)
, it will actually jump to the wrong place.
_start:
la sp, _stack_top
la a0, message
jal ra, puts # call puts
... <------------------------------------------+
|
puts: |
addi sp, sp, -4 |
sw s0, 0(sp) |
mv s0, a0 we want to jump
back there
puts_loop: |
lbu a0, 0(s0) |
beqz a0, puts_done |
jal ra, putc # call putc |
addi s0, s0, 1 <-------+ |
j puts_loop | it will actually |
| jump here |
puts_done: | as ra was overwritten |
lw s0, 0(sp) | |
addi sp, sp, 4 | |
jalr zero, 0(ra) ------+ # ret /---------------+
...
So we need to store ra
somewhere and take it back, before we return. For that
we will use the system's stack, we use sp
(x2) to keep track of where the top
of the stack is. When we call a function that is going to call another function,
it must store the return address on the stack, and then take it out. The stack
is also used for all kinds of local variables, we can allocate as much space as
we need by moving sp
down, and then we move it back up. There is a convention
that the s*
registers are also saved by the callee if they are going to use
them, in our case we use s0 to keep track of the index that we are printing at
the moment, if we call a function that also uses s*
they will also store it on
the stack and then make sure its restored, the same way we do.
This is what this code does, it allocates 8 bytes of stack space
RAM BASE: (0x80000000)
_stack_top: (0x80000000 + 128M)
address | value
sp -> _stack_top |
|
|
|
|
|
|
data & program | xx
data & program | xx
data & program | xx
data & program | xx
RAM BASE |
after executing:
addi sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp)
address | value
_stack_top |
4(sp) | s0
sp -> 0(sp) | ra
|
|
...
|
|
data & program | xx
data & program | xx
data & program | xx
data & program | xx
RAM BASE |
sw ra, 0(sp)
is memory[sp + 0] = ra
and sw s0, 4(sp)
is memory[sp + 4] = s0
This is called the function prologue, the stack preparation, storing the s* registers, preparing local variables and so on. And restoring the stack is called function epilogue.
puts:
# prologue
addi sp, sp, -8 # Allocate stack space
sw ra, 0(sp) # Save return address
sw s0, 4(sp) # Save s0 (we'll use it as our string pointer)
...
# epilogue
lw ra, 0(sp) # Restore return address
lw s0, 4(sp) # Restore s0
addi sp, sp, 8 # Deallocate stack space
ret
You see after we return from puts, we are fetching the value for ra
from where
we stored it at 0(sp)
and the value for s0 from 4(sp)
. This way when we do
ret
it will jump back to where it is supposed to.
Those two things combined, the fact that we store the return address on the stack lead to a whole generation of exploits, if you just find a bug that allows you to write on the stack, you can make the program jump wherever you want, you can overwrite the program itself even, if true can become if false, as the program is just data. There are all kinds of protections in place to prevent this from happening, but, it seems like people find ways around them.
OK now we are ready to discuss the actual meat of the puts function.
...
mv s0, a0 # Copy string address to s0
puts_loop:
lbu a0, 0(s0) # Load byte from string
beqz a0, puts_done # If byte is 0, we're done
call putc # Print the character
addi s0, s0, 1 # Move to next character
j puts_loop # Repeat
...
First we store a0 into s0 (s0 = a0), thats what mv s0, a0
does, its the same as addi a0, s0, 0
, so we start from position 0, we load the value at s0 + 0, if its
zero then we have reached the null termination and we jump to done, if not we
call putc, as we already have the proper character in a0, and putc uses a0 as
its argument, so that works out nicely, then we want to move to the next
character, so we increment s0 += 1, and we jump back to the loop, which again
loads from s0+0 but now this is pointing to the next character, and so on until
we get to the 0 byte.
PHEW! now we can print more than one character, and also know how to call functions that call functions, we know about the system stack and about prologues and epilogues.
We are ready to write a forth interpreter program that parses and executes our tictactoe program, but of course we will start small, with the very core of Forth.
.section .text
.globl _start
_start:
la sp, _stack_top
la s1, FORTH_STACK_END # SP
la s0, bytecode # IP
# start the program
j NEXT
# the program should terminate by itself,
# in case it doesnt, we will print Z as a
# debug message and exit
li a0, 'Z'
call putc
j qemu_exit
##########################
NEXT:
lw t0, 0(s0) # IP
addi s0, s0, 4 # IP
jr t0
PLUS:
# POP t0
lw t0, 0(s1) # SP
addi s1, s1, 4 # SP
# POP t1
lw t1, 0(s1) # SP
addi s1, s1, 4 # SP
add t0, t0, t1
# PUSH t0
addi s1, s1, -4 # SP
sw t0, 0(s1)
j NEXT
CR:
li a0, '\n'
call putc
j NEXT
LITERAL:
lw t0, 0(s0) # IP
addi s0, s0, 4 # IP
# PUSH t0
addi s1, s1, -4 # SP
sw t0, 0(s1) # SP
j NEXT
EMIT:
# POP a0
lw a0, 0(s1) # SP
addi s1, s1, 4 # SP
add a0, a0, '0'
call putc
j NEXT
BYE:
j qemu_exit
##########################
putc:
li t0, 0x10000000
1:
lbu t1, 5(t0)
andi t1, t1, 0x20
beqz t1, 1b
sb a0, 0(t0)
ret
getch:
li t0, 0x10000000
1:
lbu t1, 5(t0)
andi t1, t1, 0x01
beqz t1, 1b
lbu a0, 0(t0)
ret
qemu_exit:
li t0, 0x100000
li t1, 0x5555
sw t1, 0(t0)
j .
bytecode:
# our program writting in our new language
# "2 3 + 4 + . cr bye"
.word LITERAL
.word 2
.word LITERAL
.word 3
.word PLUS
.word LITERAL
.word 4
.word PLUS
.word EMIT
.word CR
.word BYE
# allocate 1024 zero bytes for the FORTH Stack
.space 1024
FORTH_STACK_END:
.end
Save this in place of boot.s assemble it and run it
riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments -m elf32lriscv boot.o -o boot.elf
qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf
You should see then number 9 printed and then qemu will exit. First we will make
a quality of life improvement, it must be annoying to write those 3 commands all
the time, so we will create a Makefile which will just execute them when we type
the command make
, Makefiles are just a recipe of steps, it can be very complicated, and honestly I hate it, as I think it is very complicated, but we will use just a small part of the Make language to describe our recipe. Create a file in the same directory as boot.s and call it Makefile
, inside of it write those instructions:
.RECIPEPREFIX = >
all:
> riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
> riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments \
-m elf32lriscv boot.o -o boot.elf
run:
> qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf
It can also be written with <tab>
as prefix, the tab character is usually displayed as 8 spaces, depending it has ASCII code of 9, but some editors display it as 2 or as 4, depending on their configuration, and of course in some editors when you press the tab
button it will insert spaces instead the single ascii character 9, but when the make
program is processing the Makefile it expects ASCII 9 instead of a 32,32,32,32,32,32,32,32 8 spaces. In newer GNU Make versions we can change the prefix with .RECIPEPREFIX = >.
all:
riscv64-unknown-elf-as -g -march=rv32g -mabi=ilp32 boot.s -o boot.o
riscv64-unknown-elf-ld -T linker.ld --no-warn-rwx-segments \
-m elf32lriscv boot.o -o boot.elf
run:
qemu-system-riscv32 -nographic -machine virt -bios none -kernel boot.elf
If you have no issues with <tab>
use it, its much easier to read.
Now if you type make
in the directory it will run the assembler and linker and
get boot.elf, if you type make run
it will run qemu. We will later build a
more complicated Makefile that will allow us to work with more assembly file and
help us run the debugger.
Now, lets discuss our program.
# "2 3 + 4 + . cr bye"
.word LITERAL
.word 2
.word LITERAL
.word 3
.word PLUS
.word LITERAL
.word 4
.word PLUS
.word EMIT
.word CR
.word BYE
.word means 4 bytes, there is also .byte we use those directives to put specific data in the binary. This .word LITERAL .word 2 .. sequence is the same as writing the bytes 0x80000058, 0x00000002, 0x80000058, 0x00000003, 0x8000002c, 0x80000058, 0x00000004, 0x8000002c, 0x8000006c, 0x8000004c, 0x80000080, as you will see in a bit.
Once the whole binary is compiled into an .elf file, you can use objdump to see
its dissassembled machine code, dissassembly is the process of taking bytes and
converting them to mnemonic insturctions, for example 00008067
is jalr zero, 0(ra)
. Because in the linker we say that our program will be loaded at address
0x80000000, which is where QEMU's RAM starts, in real hardware you know by now
that those addressess are just enabled or disabled wires, in the .elf file the
address 0x80000000 is specified as Entry Point Address, it is also specified
that the program should be loaded at this address. When it makes the machine
code it knows very well where every instruction will be. So at address
0x80000000 we have auipc sp,0x8000
and then immediately after we have
0x80000004 mv sp, sp
which is the same as addi sp, sp, 0
, those two
instructions are the result of the expanded pseudo instruction la sp, _stack_top
, auipc means Add Upper Immediate to PC, our pc
is at 0x80000000,
we will add 0x8000 to the upper 20 bits, which means 0x8000 << 12, or 0x8000000,
and this of course is 134217728 in decimal, or 128MB and in our linker we have
defined that _stack_top is _stack_top = ORIGIN(RAM) + LENGTH(RAM), and sp(x2)
will be set at 0x80000000+0x8000000 0x88000000.
The next 2 instructions will come from la s1, FORTH_STACK_END
, now this is
more interesting, you can see the label FORTH_STACK_END in the end of our .data
section, before it we have said .space 1024, the assembler will create 1024
bytes empty spaces and then know where FORTH_STACK_END exactly is going to
be. auipc s1, 0x0
will put pc
in s1
, then we have s1 = s1 + 1268, you have
to be careful when reading objdump, if some numbers are decimal some are
hexadecimal, the addressess on the left and the machinecode are hexadecimal, and
they dont start with 0x, but the arguments to instructions are decimal, so s1
will be 0x800004fc, and then we have la s0, bytecode
, which will put
0x800000d0 in s0
.
And then the magic happens, we have j 80000020, which is the code for our NEXT function.
$ riscv64-unknown-elf-objdump -D boot.elf
boot.elf: file format elf32-littleriscv
Disassembly of section .text:
80000000 <_start>:
80000000: 08000117 auipc sp,0x8000
80000004: 00010113 mv sp,sp
80000008: 00000497 auipc s1,0x0
8000000c: 4f448493 addi s1,s1,1268 # 800004fc <FORTH_STACK_END>
80000010: 00000417 auipc s0,0x0
80000014: 0c040413 addi s0,s0,192 # 800000d0 <bytecode>
80000018: 0080006f j 80000020 <NEXT>
8000001c: 0980006f j 800000b4 <qemu_exit>
80000020 <NEXT>:
80000020: 00042283 lw t0,0(s0)
80000024: 00440413 addi s0,s0,4
80000028: 00028067 jr t0
8000002c <PLUS>:
8000002c: 0004a283 lw t0,0(s1)
80000030: 00448493 addi s1,s1,4
80000034: 0004a303 lw t1,0(s1)
80000038: 00448493 addi s1,s1,4
8000003c: 006282b3 add t0,t0,t1
80000040: ffc48493 addi s1,s1,-4
80000044: 0054a023 sw t0,0(s1)
80000048: fd9ff06f j 80000020 <NEXT>
8000004c <CR>:
8000004c: 00a00513 li a0,10
80000050: 034000ef jal 80000084 <putc>
80000054: fcdff06f j 80000020 <NEXT>
80000058 <LITERAL>:
80000058: 00042283 lw t0,0(s0)
8000005c: 00440413 addi s0,s0,4
80000060: ffc48493 addi s1,s1,-4
80000064: 0054a023 sw t0,0(s1)
80000068: fb9ff06f j 80000020 <NEXT>
8000006c <EMIT>:
8000006c: 0004a503 lw a0,0(s1)
80000070: 00448493 addi s1,s1,4
80000074: 03050513 addi a0,a0,48
80000078: 00c000ef jal 80000084 <putc>
8000007c: fa5ff06f j 80000020 <NEXT>
80000080 <BYE>:
80000080: 0340006f j 800000b4 <qemu_exit>
80000084 <putc>:
80000084: 100002b7 lui t0,0x10000
80000088: 0052c303 lbu t1,5(t0) # 10000005 <_start-0x6ffffffb>
8000008c: 02037313 andi t1,t1,32
80000090: fe030ce3 beqz t1,80000088 <putc+0x4>
80000094: 00a28023 sb a0,0(t0)
80000098: 00008067 ret
8000009c <getch>:
8000009c: 100002b7 lui t0,0x10000
800000a0: 0052c303 lbu t1,5(t0) # 10000005 <_start-0x6ffffffb>
800000a4: 00137313 andi t1,t1,1
800000a8: fe030ce3 beqz t1,800000a0 <getch+0x4>
800000ac: 0002c503 lbu a0,0(t0)
800000b0: 00008067 ret
800000b4 <qemu_exit>:
800000b4: 001002b7 lui t0,0x100
800000b8: 00005337 lui t1,0x5
800000bc: 55530313 addi t1,t1,1365 # 5555 <_start-0x7fffaaab>
800000c0: 0062a023 sw t1,0(t0) # 100000 <_start-0x7ff00000>
800000c4: 0000006f j 800000c4 <qemu_exit+0x10>
Disassembly of section .data:
800000d0 <bytecode>:
800000d0: 0058 .insn 2, 0x0058
800000d2: 8000 .insn 2, 0x8000
800000d4: 0002 .insn 2, 0x0002
800000d6: 0000 .insn 2, 0x
800000d8: 0058 .insn 2, 0x0058
800000da: 8000 .insn 2, 0x8000
800000dc: 00000003 lb zero,0(zero) # 0 <_start-0x80000000>
800000e0: 002c .insn 2, 0x002c
800000e2: 8000 .insn 2, 0x8000
800000e4: 0058 .insn 2, 0x0058
800000e6: 8000 .insn 2, 0x8000
800000e8: 0004 .insn 2, 0x0004
800000ea: 0000 .insn 2, 0x
800000ec: 002c .insn 2, 0x002c
800000ee: 8000 .insn 2, 0x8000
800000f0: 006c .insn 2, 0x006c
800000f2: 8000 .insn 2, 0x8000
800000f4: 004c .insn 2, 0x004c
800000f6: 8000 .insn 2, 0x8000
800000f8: 0080 .insn 2, 0x0080
800000fa: 8000 .insn 2, 0x8000
...
NEXT
NEXT:
lw t0, 0(s0) # IP
addi s0, s0, 4 # IP
jr t0
NEXT loads 4 bytes from memory at address s0 into t0, then increments s0 with 4 and jumps to t0. the value of s0 is 0x800000d0, and the value at memory[0x800000d0] is 0x800000058.
800000d0: 0058 .insn 2, 0x0058
800000d2: 8000 .insn 2, 0x8000
You can see it here, but it is written backwards, 0058 8000. How we print numbers and how we use them depend on which bytes we think are first. There are two ways, Big-endian and Little-endian.
In our case we compiling the code for a Little Endian RISC-V processor.
Memory Address | Byte Value
--------------------------
800000d0 | 58 (least significant byte)
800000d1 | 00
800000d2 | 00
800000d3 | 80 (most significant byte)
The term "endian" comes from Gulliver's Travels where two groups fought if they should break eggs at the big end or little end.
Objdump is showing the data 2 bytes at a time for memory contents, and thats why it looks backwards.
Honestly this endianness thing always annoys me, I wish it was only one, but we are where we are.
s0: 0x800000d0 # Forth Instruction Pointer
s1: 0x800004fc # Forth Stack Pointer
Address Value (big-endian) | Meaning
----------------------------------------------
s0 -> 800000d0: 0x80000058 | LITERAL
800000d4: 0x00000002 | 2
800000d8: 0x80000058 | LITERAL
800000dc: 0x00000003 | 3
800000e0: 0x8000002c | PLUS
800000e4: 0x80000058 | LITERAL
800000e8: 0x00000004 | 4
800000ec: 0x8000002c | PLUS
800000f0: 0x8000006c | EMIT
800000f4: 0x8000004c | CR
800000f8: 0x80000080 | BYE
800000fc: 0x00000000 |
... |
800004f4: 0x00000000 |
800004f8: 0x00000000 |
s1 -> 800004fc: 0x00000000 | Top of stack
80000500: 0x00000000 | unused memory
... | unused memory
Anyway, if you look up you will see that at address 0x80000058 we have our
LITERAL function, so NEXT
will jump to LITERAL. We will follow the value of s0
through the process.
LITERAL
NEXT added 4 to s0 before jumping, so it is at 0x800000d4 when we come into LITERAL
s0: 0x800000d4 # Forth Instruction Pointer
s1: 0x800004fc # Forth Stack Pointer
Address Value (big-endian) | Meaning
----------------------------------------------
800000d0: 0x80000058 | LITERAL
s0 -> 800000d4: 0x00000002 | 2
800000d8: 0x80000058 | LITERAL
800000dc: 0x00000003 | 3
800000e0: 0x8000002c | PLUS
800000e4: 0x80000058 | LITERAL
800000e8: 0x00000004 | 4
800000ec: 0x8000002c | PLUS
800000f0: 0x8000006c | EMIT
800000f4: 0x8000004c | CR
800000f8: 0x80000080 | BYE
800000fc: 0x00000000 |
... |
800004f4: 0x00000000 |
800004f8: 0x00000000 |
s1 -> 800004fc: 0x00000000 | Top of stack
80000500: 0x00000000 | unused memory
... | unused memory
Literal will load the value at memory[s0], in this case you can see its the
value 2
, then it will add 4 to s0, and push it on the forth stack, we use s1
to keep track of it. Our stack grows upwards, meaning it starts at a high
address and we just decreas its value, it is all relative this upwards downwards
thing, I call it upwards because I have the low addressess on top when I write,
so the stack grows up, but if you draw the memory the other way it will grow
down. Anyway, we decrease the value of s1.
LITERAL:
lw t0, 0(s0) # IP
addi s0, s0, 4 # IP
# PUSH t0
addi s1, s1, -4 # SP
sw t0, 0(s1) # SP
j NEXT
After LITERAL is done we will have 2 on the Forth stack, and then we jump to NEXT.
s0: 0x800000d8 # Forth Instruction Pointer
s1: 0x800004f8 # Forth Stack Pointer
Address Value (big-endian) | Meaning
----------------------------------------------
800000d0: 0x80000058 | LITERAL
800000d4: 0x00000002 | 2
s0 -> 800000d8: 0x80000058 | LITERAL
800000dc: 0x00000003 | 3
800000e0: 0x8000002c | PLUS
800000e4: 0x80000058 | LITERAL
800000e8: 0x00000004 | 4
800000ec: 0x8000002c | PLUS
800000f0: 0x8000006c | EMIT
800000f4: 0x8000004c | CR
800000f8: 0x80000080 | BYE
800000fc: 0x00000000 |
... |
800004f4: 0x00000000 |
s1 -> 800004f8: 0x00000002 | 2
800004fc: 0x00000000 | Top of stack
80000500: 0x00000000 | unused memory
... | unused memory
NEXT
NEXT again will load the value at memory[s0] into t0, in this case memory[0x800000d8], which is again 0x80000058, it will increment s0 with 4 and jump to t0.
This will be the memory state after NEXT.
s0: 0x800000dc # Forth Instruction Pointer
s1: 0x800004f8 # Forth Stack Pointer
Address Value (big-endian) | Meaning
----------------------------------------------
800000d0: 0x80000058 | LITERAL
800000d4: 0x00000002 | 2
800000d8: 0x80000058 | LITERAL
s0 -> 800000dc: 0x00000003 | 3
800000e0: 0x8000002c | PLUS
800000e4: 0x80000058 | LITERAL
800000e8: 0x00000004 | 4
800000ec: 0x8000002c | PLUS
800000f0: 0x8000006c | EMIT
800000f4: 0x8000004c | CR
800000f8: 0x80000080 | BYE
800000fc: 0x00000000 |
... |
800004f4: 0x00000000 |
s1 -> 800004f8: 0x00000002 | 2
800004fc: 0x00000000 | Top of stack
80000500: 0x00000000 | unused memory
... | unused memory
LITERAL
Again literal will load memory[s0] which is 3 and push it on the forth stack by also decrementing s1 with 4 and incrementing s0 with 4.
s0: 0x800000e0 # Forth Instruction Pointer
s1: 0x800004f4 # Forth Stack Pointer
Address Value (big-endian) | Meaning
----------------------------------------------
800000d0: 0x80000058 | LITERAL
800000d4: 0x00000002 | 2
800000d8: 0x80000058 | LITERAL
800000dc: 0x00000003 | 3
s0 -> 800000e0: 0x8000002c | PLUS
800000e4: 0x80000058 | LITERAL
800000e8: 0x00000004 | 4
800000ec: 0x8000002c | PLUS
800000f0: 0x8000006c | EMIT
800000f4: 0x8000004c | CR
800000f8: 0x80000080 | BYE
800000fc: 0x00000000 |
... |
s1 -> 800004f4: 0x00000003 | 3
800004f8: 0x00000002 | 2
800004fc: 0x00000000 | Top of stack
80000500: 0x00000000 | unused memory
... | unused memory
NEXT
Same story, load memory[s0] into t0, memory[0x800000e0] is 0x8000002c, and that is the address of our PLUS function, add 4 to s0 and jump to t0,
s0: 0x800000e4 # Forth Instruction Pointer
s1: 0x800004f4 # Forth Stack Pointer
Address Value (big-endian) | Meaning
----------------------------------------------
800000d0: 0x80000058 | LITERAL
800000d4: 0x00000002 | 2
800000d8: 0x80000058 | LITERAL
800000dc: 0x00000003 | 3
800000e0: 0x8000002c | PLUS
s0 -> 800000e4: 0x80000058 | LITERAL
800000e8: 0x00000004 | 4
800000ec: 0x8000002c | PLUS
800000f0: 0x8000006c | EMIT
800000f4: 0x8000004c | CR
800000f8: 0x80000080 | BYE
800000fc: 0x00000000 |
... |
s1 -> 800004f4: 0x00000003 | 3
800004f8: 0x00000002 | 2
800004fc: 0x00000000 | Top of stack
80000500: 0x00000000 | unused memory
... | unused memory
PLUS
Plus will pop two values from the stack, add them and push back to the stack, lets follow the stack.
8000002c <PLUS>:
8000002c: 0004a283 lw t0,0(s1)
80000030: 00448493 addi s1,s1,4
80000034: 0004a303 lw t1,0(s1)
80000038: 00448493 addi s1,s1,4
8000003c: 006282b3 add t0,t0,t1
80000040: ffc48493 addi s1,s1,-4
80000044: 0054a023 sw t0,0(s1)
80000048: fd9ff06f j 80000020 <NEXT>
lw t0, 0(s1), memory[800004f4] is 3
t0 is set to 3
---------------------------------------------
s1 -> 800004f4: 0x00000003 | 3
800004f8: 0x00000002 | 2
800004fc: 0x00000000 | 0
---------------------------------------------
addi s1,s1,4
---------------------------------------------
800004f4: 0x00000003 | 3
s1 -> 800004f8: 0x00000002 | 2
800004fc: 0x00000000 | 0
---------------------------------------------
lw t1, 0(s1), memory[800004f8] is 2
t1 is set to 2
---------------------------------------------
800004f4: 0x00000003 | 3
s1 -> 800004f8: 0x00000002 | 2
800004fc: 0x00000000 | 0
---------------------------------------------
addi s1,s1,4
---------------------------------------------
800004f4: 0x00000003 | 3
800004f8: 0x00000002 | 2
s1 -> 800004fc: 0x00000000 | 0
---------------------------------------------
add t0,t0,t1, t0 = t0 + t1
t0 is set to 5
---------------------------------------------
800004f4: 0x00000003 | 3
800004f8: 0x00000002 | 2
s1 -> 800004fc: 0x00000000 | 0
---------------------------------------------
addi s1,s1,-4
---------------------------------------------
800004f4: 0x00000003 | 3
s1 -> 800004f8: 0x00000002 | 2
800004fc: 0x00000000 | 0
---------------------------------------------
sw t0, 0(s1), t0 is 5,
memory[800004f8] is set to 5
---------------------------------------------
800004f4: 0x00000003 | 3
s1 -> 800004f8: 0x00000005 | 5
800004fc: 0x00000000 | 0
---------------------------------------------
After the PLUS function you see the top of the stack has value 5, we have "consumed" 2 and 3 and inserted "5" in their place, 3 is still left in memory but it is just a garbage value, we wont bother cleaning it up, next time we add something to the stack it will be overwritten. And when its done it will jump to NEXT.
This is how the memory looks after PLUS
s0: 0x800000e4 # Forth Instruction Pointer
s1: 0x800004f8 # Forth Stack Pointer
Address Value (big-endian) | Meaning
----------------------------------------------
800000d0: 0x80000058 | LITERAL
800000d4: 0x00000002 | 2
800000d8: 0x80000058 | LITERAL
800000dc: 0x00000003 | 3
800000e0: 0x8000002c | PLUS
s0 -> 800000e4: 0x80000058 | LITERAL
800000e8: 0x00000004 | 4
800000ec: 0x8000002c | PLUS
800000f0: 0x8000006c | EMIT
800000f4: 0x8000004c | CR
800000f8: 0x80000080 | BYE
800000fc: 0x00000000 |
... |
800004f4: 0x00000003 | 3
s1 -> 800004f8: 0x00000005 | 5
800004fc: 0x00000000 | Top of stack
80000500: 0x00000000 | unused memory
... | unused memory
NEXT
Same old next, doing the same thing, load memory[s0] into t0, add 4 to s0, jump to t0. so we go to LITERAL again, which will put 4 on the stack, then again we go to PLUS, which will pop 4 and pop 5 and add them and push 9., then we go to EMIT. EMIT pops 9 from the stack and adds 48 to it, puts the result in a0 and calls putc to print the character on screen (48 is the ascii for '0' and 48 + 9 is the ascii for 9). after EMIT is done it jumps to NEXT, then NEXT jumps to CR, which prints a new line, and jumps to NEXT again, and then we get to BYE which exits qemu.
--
You see we have a language inside assembly, it weaving like a thread, function -> next -> function -> next -> function next. So tiny and nice, it took us only few lines of code. Just like a silk thread weaving through memory.
Imagine WRITE function that pops two values from the stack, one a memory address, and one a value. Almost like PLUS but instead of pushing to the stack, we will write to the specific value to the specified address.
WRITE:
# POP t0, address
lw t0, 0(s1)
addi s1, s1, 4
# POP t1, value
lw t1, 0(s1)
addi s1, s1, 4
sw t1, 0(t0)
j NEXT
We could write this program that writes 7 to address 0x800000fc
.word LITERAL
.word 7
.word LITERAL
.word 0x800000fc
.word WRITE
We put the value 8 on the stack with .word LITERAL .word 7, and then we put 0x800000fc on the stack .word LITERAL .word 0x800000fc, then LITERAL's NEXT will jump into WRITE, which will pop the two values from the stack, the first pop is the address into t0, then it will pop the value 7 into t1 and finally it will write 7 into memory[t0], or memory[0x800000fc], now imagine if the program itself is there at address 0x800000fc.
I wrote this small program, the addressess are different than the ones we had so far because I addedd the WRITE code which will move everyuthing by 6 instructions, each instruciton is 4 bytes, so everything will be off by 24 bytes, but anyway, I just want to illustrate the point:
bytecode:
.word LITERAL
.word 0x800000a0
.word LITERAL
.word 0x80000104
.word WRITE
bytecode is at 0x800000f0 and ends at 0x80000104, so with this small program we write the value 0x800000a0 at address 0x80000104 and the value 0x800000a0 happens to be the address of BYE. I could've written it using labels:
bytecode:
.word LITERAL
.word BYE
.word LITERAL
.word bytecode+20 # 4 * 5
.word WRITE
# -> we want to write here
Or we could create a bytecode_end label that we can use.
bytecode:
.word LITERAL
.word BYE
.word LITERAL
.word bytecode_end
.word WRITE
bytecode_end:
The assembler knows where everything will be in memory, it knows that bytecode
will be at address X then each .word is 4 bytes, it knows bytecode_end is going
to be at bytecode + 20 bytes. .word bytecode_end
will be replaced with the
apropriate value. The labeling in modern assembler is really cool! And what is
even more cool, is that we wrote our program using our small bytecode language,
tha modified the memory where it lives. Such power!
Think for a second, what would this do.
.word LITERAL
.word LITERAL
.word LITERAL
.word LITERAL
.word WRITE
We will do few quality of life improvements to allow us to write code easier, for example I am constantly confused by stacks growing direction, I often forget to do -4 or +4 and that leads to a lot of pain and suffering and hours of debugging and then facepalming.
We will use MACROs, a macro is just a piece of code that gets executed before the program is compiled, it is like a program that the compiler executes on the source code itself.
.macro PUSH reg
addi s1, s1, -4
sw \reg, 0(s1)
.endm
.macro POP reg
lw \reg, 0(s1)
addi s1, s1, 4
.endm
then PLUS becomes:
PLUS:
POP t0
POP t1
add t0, t0, t1
PUSH t0
j NEXT
see its much clearer, POP reg
will be expanded into "lw reg, 0(s1); addi s1,
s1, 4" and PUSH reg
will be expanded in "addi s1, s1, -4; sw reg, 0(s1)". For
example POP t0
will expand to "lw t0, 0(s1); addi s1, s1, 4".
You can see that macros allow you to extend the language. Some programming languages have extremely flexible macro system that is a language in itself, and the best languages have a macro system that is the language itself (like LISP). In our case we will just use macros to help us not repeat the same few lines of code over and over again.
We could of course create a POP function and a PUSH function, and call into it, but this will at least double the amount of instructions per operation, its really not worth it.
We will create another quality of life improvement, instead of using s0 and s1 we will use :IP and :SP, IP as in instruction pointer, it is the same as program counter, just a register we will use to point where are we in the program, our index finger if you will, and :SP we will use for a stack pointer in the Forth stack. Sadly it is not possible to do that with a macro or in any other way in RISC-V assembly, so we will use an external program to replace :SP to s1 and :IP to s0 before we give the source code to the assembler to make machine code.
We will use the sed
command to replace all occurances. Now we are entering a
bit more complicated territory because the project is going to grow, we could
put everything in boot.s
but its going to be really hard to read, also we will
need a way to debug if there is an issue and be able to execute instructions
step by step.
This is a new version of the Makefile that allows us to have many .s files and in the end they are linked into one .elf file, it also creates a build/ directory and puts there all the object file (unlinked machine code files) and the elf file. It also uses sed to replace :IP, :SP and other registers we would use later into their corresponding s0, s1, s2 registers.
Its beyond the scope of the book to dig deeper into (.. I was going to use the word delve here, but now people will think that chatgpt wrote this if I do) the GNU Make language, I also dont think its worth spending time on it, just ask chatgpt to copy the code from the page and explain it.
# Compiler and linker
AS = riscv64-unknown-elf-as
LD = riscv64-unknown-elf-ld
GDB = riscv64-unknown-elf-gdb
# Flags
ASFLAGS = -g -march=rv32g -mabi=ilp32
LDFLAGS = -T linker.ld --no-warn-rwx-segments -m elf32lriscv
# QEMU command
QEMU = qemu-system-riscv32
QEMU_FLAGS = -nographic -machine virt -bios none
# Directories
SRC_DIR = .
BUILD_DIR = build
OBJ_DIR = $(BUILD_DIR)/obj
# Source files
SRC_FILES = $(wildcard $(SRC_DIR)/*.s)
OBJ_FILES = $(patsubst $(SRC_DIR)/%.s,$(OBJ_DIR)/%.o,$(SRC_FILES))
# Target executable
TARGET = $(BUILD_DIR)/boot.elf
# GDB script
GDB_SCRIPT = $(BUILD_DIR)/gdb_commands.gdb
# Default target
all: directories $(TARGET) $(GDB_SCRIPT)
# Create necessary directories
directories:
@mkdir -p $(OBJ_DIR)
# Compile .s files to object files
$(OBJ_DIR)/%.o: $(SRC_DIR)/%.s
@sed -e 's/:IP/s0/g' \
-e 's/:SP/s1/g' \
-e 's/:RSP/s2/g' \
-e 's/:CSP/s3/g' \
-e 's/:HERE/s4/g' \
-e 's/:XT/s5/g' \
-e 's/:LATEST/s6/g' \
-e 's/:MODE/s7/g' \
-e 's/:ESP/s8/g' $< > $@.pre.s
$(AS) $(ASFLAGS) $@.pre.s -o $@
# Link object files to create the executable
$(TARGET): directories $(OBJ_FILES)
$(LD) $(LDFLAGS) $(OBJ_FILES) -o $@
# Create GDB script
$(GDB_SCRIPT):
@echo "target remote localhost:1234" > $@
@echo "tui enable" >> $@
@echo "tui layout reg" >> $@
@echo "file $(TARGET)" >> $@
@echo "break _start" >> $@
@echo "continue" >> $@
# Clean up
clean:
rm -rf $(BUILD_DIR)
# Run the program in QEMU
run: $(TARGET)
$(QEMU) $(QEMU_FLAGS) -kernel $(TARGET)
# Run QEMU with GDB server enabled
qemu-gdb: $(TARGET)
reset ; $(QEMU) $(QEMU_FLAGS) -kernel $(TARGET) -S -s
# Run GDB and connect to QEMU
gdb: $(TARGET) $(GDB_SCRIPT)
$(GDB) -x $(GDB_SCRIPT)
kill:
killall -9 qemu-system-riscv32
objdump:
riscv64-unknown-elf-objdump -D build/boot.elf
objdump-data:
riscv64-unknown-elf-objdump -s -j .data build/boot.elf
.PHONY: all clean run run-gdb-server run-gdb debug directories objdump kill
When you replace your Makefile with this version you will have the commands
make, make run, make clean, make gdb, make qemu-gdb, make kill, make objdump
if you want to debug the program you need to run make qemu-gdb which starts QEMU
waiting for gdb to hook into it, and does not execute any instruction until its
connected to gdb, then in another terminal you must run make gdb which will
start gdb with the right parameters to connect to QEMU. Then you can run 'si'
which means 'step into' and it will run one instruction at a time. You can also
add breakpoints and pause the program at various places.
OK, now we can rewrite our code with all the quality of life improvements, we will split it into 3 files, boot.s where we will just do super basic preparation and jump into the forth interpreter, qemu.s where we will have all the qemu dependent code, like putc, getch, qemu_exit, and forth.s where we will keep the forth stuff.
# boot.s
.section .text
.globl _start
_start:
la sp, _stack_top
j forth
li a0, 'Z'
call putc
j qemu_exit
.end
# qemu.s
.section .text
.globl putc
.globl getch
.globl qemu_exit
putc:
li t0, 0x10000000
1:
lbu t1, 5(t0)
andi t1, t1, 0x20
beqz t1, 1b
sb a0, 0(t0)
ret
getch:
li t0, 0x10000000
1:
lbu t1, 5(t0)
andi t1, t1, 0x01
beqz t1, 1b
lbu a0, 0(t0)
ret
qemu_exit:
li t0, 0x100000
li t1, 0x5555
sw t1, 0(t0)
j .
.end
# forth.s
.section .text
.globl forth
.macro PUSH reg
addi :SP, :SP, -4
sw \reg, 0(:SP)
.endm
.macro POP reg
lw \reg, 0(:SP)
addi :SP, :SP, 4
.endm
forth:
la :SP, FORTH_STACK_END
la :IP, bytecode
# start the program
j NEXT
NEXT:
lw t0, 0(:IP)
addi :IP, :IP, 4
jr t0
PLUS:
POP t0
POP t1
add t0, t0, t1
PUSH t0
j NEXT
LITERAL:
lw t0, 0(:IP)
addi :IP, :IP, 4
PUSH t0
j NEXT
EMIT:
POP a0
add a0, a0, '0'
call putc
j NEXT
WRITE:
POP t0 # address
POP t1 # value
sw t1, 0(t0)
j NEXT
BYE:
j qemu_exit
CR:
li a0, '\n'
call putc
j NEXT
bytecode:
# "2 3 + 4 + . cr bye"
.word LITERAL
.word 2
.word LITERAL
.word 3
.word PLUS
.word LITERAL
.word 4
.word PLUS
.word EMIT
.word CR
.word BYE
.space 1024
FORTH_STACK_END:
.end
You will notice we use this .globl directive, which tells the assembler that this symbol (e.g getch) will be accessible from other object files.
Just like a silk thread weaving through memory.
I want to be able to write the text "2 3 + 4 + . cr bye" somewhere in memory instead of writing the bytecode by hand.
...
program:
.asciz "2 3 + 4 + . cr bye"
.asciz means null terminated ascii string, it will write the bytes 50 32 51 32 43 32 52 32 43 32 46 32 99 114 32 98 121 101 0 in the binary, which then will be loaded in memory on wherever the program: label happens to fall at.
You have experienced many levels of programming languages so far, from the microcode and the EEPROM wires, to SUBLEQ, to assembly, and now the mini forth bytecode thread jumping language. All of them allow you to program the computer. If you made a language on top of the wires lets call it W, and then a language on top of this language lets call it A, then whatever A can do W can do, as ultimately, A is executed by W. Why do we keep building higher and higher level languages, further and further from the wires? Those languages for sure can program the machine, but, in order for you express your thoughts into it, you have to think in the language you are using, and those very low level languages are much harder to think in, you cant keep track of 9548 wires and if they are on or off and what is going to happen next, its just not possible. But you can think about higher concepts, like remembering what is in a stack of values, you know the number 2 and 3 and you want to add them, this is how you think, the programming language has to be good both for you AND the machine to think in.
The problem is that everyone of us thinks differently, and certain things are easy for one and hard for another, as I said, if I were to make the perfect chair for me, it will be a torture device for you. Keep that in mind when studying programming, the languages we have are a compromise between how most people think and how the machines we made think. Do not worry if you struggle to express yourself, it takes time. It is not the same as learning another man made language, like knowing Dutch and learning English, those are languages made by people for people, they dont change faster than we change.
We will now build up from our bytecode language to the ascii "human" like language, but you know by now, it is just wires all the way down.
The first step we must do is to be able to know where a symbol starts and where it ends, for example 2 and + are 1 character long, cr is 2, bye is 3. Looking at the program we can just split the symbols by space, and things will work out. This is the very first step in any programming languages, tokenizing the program. Tokenization is the process of splitting something into chunks that you would work with, for example in language this is usually words, but you could also make character tokens, or you can make bigrams (twowords) or trigrams triwordstogether, or character ngrams li ke th is, for us we want to create a token out of each symbol, word, digit etc.
When I am working with something in memory I always imagine it in some random address, in this case I will think that our text program will be located at address 0x80001000.
Memory Address ASCII Hex Dec
-----------------------------------
0x80001000 '2' 0x32 50
0x80001001 ' ' 0x20 32
0x80001002 '3' 0x33 51
0x80001003 ' ' 0x20 32
0x80001004 '+' 0x2B 43
0x80001005 ' ' 0x20 32
0x80001006 '4' 0x34 52
0x80001007 ' ' 0x20 32
0x80001008 '+' 0x2B 43
0x80001009 ' ' 0x20 32
0x8000100A '.' 0x2E 46
0x8000100B ' ' 0x20 32
0x8000100C 'c' 0x63 99
0x8000100D 'r' 0x72 114
0x8000100E ' ' 0x20 32
0x8000100F 'b' 0x62 98
0x80001010 'y' 0x79 121
0x80001011 'e' 0x65 101
0x80001012 '\0' 0x00 0
We will make a function that takes a memory address and then returns the address of the next token and how big it is.
If we give it address 0x80001000 it should return 0x80001000 and length 1, if we give it 0x80001001, it should return 0x80001002 and 1. It will skip the leading spaces, then count the bytes of the token, if it reaches the null termination 0 or space it will stop.
Create a new file string.s and write the following code:
# string.s
.section .text
.globl token
# input:
# a0 address an ascii string
# output:
# a0 start token address
# a1 token length
token:
mv t3, a0 # t3 = initial address
li a1, 0 # length = 0
li t1, '!' # ascii 33, space is 32
# Skip leading spaces
.L_skip_spaces:
lbu t0, 0(t3) # load byte at current position
beqz t0, .L_done # if null termination, done
bge t1, t0, .L_count_token # if char >= 33 start counting
addi t3, t3, 1 # increment address
j .L_skip_spaces
# Count token length until space or null
.L_count_token:
mv a0, t3 # a0 is the start of the token
.L_count_token_next:
lbu t0, 0(t3) # load byte
blt t0, t1, .L_done # it char < 33 (including 0) done
addi a1, a1, 1 # increment length
addi t3, t3, 1 # increment address
j .L_count_token_next
.L_done:
ret
.end
There is a convention in RISC-V assembly to use .L_ for local labels, nothing will stop you to jump to them from anywhere, but at least its clear that its not intended to be jumped into from random places.
This code skips more than space, it skips anything in the ascii table below 33 (!), which includes new line, tab and other weird characters.
Our language will be the inverse of python. WHITESPACE FREEDOOOMM!
2
3 +
4
+
. cr
bye
We will use string.s to write other string functions we need, like is_number, atoi (ascii to integer), puts and print integer. As you might have guessed, we need to know if a token is a number or not, so we know if we should make put LITERAL, 3 or PLUS in the bytecode.
I will just show a bunch of code now, nothing you havent seen, but just more of it. There are a bunch of helper functions to help us manipulate the stack, to make from 1 2 3 -> 3 1 2, or 1 2 -> 1 2 1 2 and few more, if you read the code you will see how they work, its just pops and pushes.
Ther are also few helper functions to allow us to do memcompare, print integers, convert strings to integers etc. Again I wont go into a lot of detail, I have asked chatgpt to redo the comments so they are clearer, at least I found them clearer than the ones I wrote.
# string.s
# I actually wrote this but o1 pro styled it, it makes such beautiful and clear comments.
# After I confirmed they are correct, I couldnt resist using its version
#=====================================================================
# RISC-V Assembly Utilities
#
# This file provides:
# - token : Extract the next token (non-whitespace) from a string
# - is_number : Check if a substring is purely decimal digits
# - atoi : Convert a decimal string to an integer
# - puts : Print a null-terminated string
# - puts_len : Print a string up to a given length
# - print_int : Print an integer in decimal format
# - memcmp : Compare two memory arrays
# - print_unsigned_hex : Print integer in hex format (useful for address print)
#=====================================================================
.section .text
.globl token
.globl is_number
.globl atoi
.globl puts
.globl puts_len
.globl print_int
.globl memcmp
.globl print_unsigned_hex
#---------------------------------------------------------------------
# token
#
# Input:
# a0 = address of a null-terminated ASCII string
#
# Output:
# a0 = start of the next token
# a1 = length of that token
#
# Description:
# 1) Skips leading whitespace (ASCII < 33).
# 2) Returns the address at which the non-whitespace data begins.
# 3) Counts characters until the next whitespace or null terminator.
#---------------------------------------------------------------------
token:
mv t3, a0 # t3 = current pointer in string
li a1, 0 # a1 = token length = 0
li t1, 33 # ASCII 33 = '!' (first non-space, e.g. ' ' = 32)
#--- Skip leading spaces
.L_skip_spaces:
lbu t0, 0(t3) # load byte
beqz t0, .L_done_token # if null terminator -> done (empty token)
bge t0, t1, .L_count_token
addi t3, t3, 1 # else skip this whitespace char
j .L_skip_spaces
#--- Count token length
.L_count_token:
mv a0, t3 # a0 = start of token
.L_count_token_next:
lbu t0, 0(t3) # load byte
blt t0, t1, .L_done_token
addi a1, a1, 1 # increment token length
addi t3, t3, 1 # move to next character
j .L_count_token_next
.L_done_token:
ret
#---------------------------------------------------------------------
# is_number
#
# Input:
# a0 = address of the substring
# a1 = length of the substring
#
# Output:
# a0 = -1 if the substring is a valid integer (negative or positive)
# a0 = 0 if not
#
# Notes:
# - A leading minus sign is optional.
# - A lone minus sign ("-") is invalid.
# - Any non-digit character immediately disqualifies the string.
#---------------------------------------------------------------------
is_number:
beqz a1, .L_not_number # if length == 0, not a number
mv t0, a0 # t0 = current string pointer
mv t1, a1 # t1 = remaining length
# Check for optional leading minus sign
lbu t2, 0(t0) # look at first character
li t3, '-'
beq t2, t3, .L_handle_minus # if '-', skip it
#---------------------------------------------------------------------
# .L_check_digit_loop:
# Check each character must be '0'..'9'.
#---------------------------------------------------------------------
.L_check_digit_loop:
lbu t2, 0(t0) # load current character
li t3, '0' # ASCII '0' (48)
li t4, '9' # ASCII '9' (57)
blt t2, t3, .L_not_number # if char < '0' -> not number
bgt t2, t4, .L_not_number # if char > '9' -> not number
# Move to next character
addi t0, t0, 1
addi t1, t1, -1
bnez t1, .L_check_digit_loop # keep checking until length=0
# If we exit the loop normally, all checked chars are digits
li a0, -1 # indicate "valid number"
ret
#---------------------------------------------------------------------
# .L_handle_minus:
# Skip the minus sign and then check digits.
#---------------------------------------------------------------------
.L_handle_minus:
addi t0, t0, 1 # skip '-'
addi t1, t1, -1
beqz t1, .L_not_number # if no chars after '-', not number
j .L_check_digit_loop
#---------------------------------------------------------------------
# .L_not_number:
# If anything fails above, return 0.
#---------------------------------------------------------------------
.L_not_number:
li a0, 0
ret
#---------------------------------------------------------------------
# atoi (ASCII to Integer)
#
# Input:
# a0 = address of decimal string (may start with '-', followed by digits)
# a1 = length of the string
#
# Output:
# a0 = integer value of that string
#
# Description:
# - If the first character is '-', then parse the rest as digits
# and return the negative of that value.
# - Otherwise, treat all characters as digits ('0'..'9').
#
# Assumptions:
# - The string is valid and contains only an optional '-' plus digits,
# or the function’s caller already ensures validity.
#---------------------------------------------------------------------
atoi:
# Prologue: save RA and s-registers
addi sp, sp, -20
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
mv s0, a0 # s0 = pointer to string
mv s1, a1 # s1 = remaining length
li s2, 0 # s2 = accumulator (result)
li t0, 10 # t0 = base (10)
li s3, 0 # s3 = sign flag (0 = positive, 1 = negative)
# If string is empty, result stays 0
beqz s1, .L_atoi_done
# Check for optional leading '-'
lbu t1, 0(s0) # load first character
li t2, '-'
bne t1, t2, .L_parse_digits # if not '-', skip sign logic
# If '-' is found, set sign flag to negative
li s3, 1
addi s0, s0, 1 # skip the '-'
addi s1, s1, -1 # adjust the remaining length
.L_parse_digits:
# Loop over remaining digits
.L_atoi_loop:
beqz s1, .L_atoi_done # stop if no characters left
# result = result * 10
mul s2, s2, t0
# add current digit
lbu t1, 0(s0) # load ASCII digit
addi t1, t1, -48 # convert '0'..'9' to 0..9
add s2, s2, t1
# advance pointers
addi s0, s0, 1
addi s1, s1, -1
j .L_atoi_loop
.L_atoi_done:
# If negative flag was set, flip the sign
beqz s3, .L_return_result
neg s2, s2
.L_return_result:
mv a0, s2
# Epilogue: restore RA and s-registers
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
addi sp, sp, 20
ret
#---------------------------------------------------------------------
# puts
#
# Input:
# a0 = address of a null-terminated string
#
# Description:
# Prints characters one at a time until it hits a null terminator.
# Assumes an external function putc is available to print a single char.
#---------------------------------------------------------------------
puts:
# Prologue
addi sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp)
mv s0, a0
.L_puts_loop:
lbu a0, 0(s0) # load current char
beqz a0, .L_puts_done # if '\0', stop
call putc # print char
addi s0, s0, 1 # next char
j .L_puts_loop
.L_puts_done:
# Epilogue
lw ra, 0(sp)
lw s0, 4(sp)
addi sp, sp, 8
ret
#---------------------------------------------------------------------
# puts_len
#
# Input:
# a0 = address of string
# a1 = length
#
# Description:
# Prints exactly 'length' characters from the given address.
# Calls an external function putc to print a single char.
#---------------------------------------------------------------------
puts_len:
# Prologue
addi sp, sp, -12
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
mv s0, a0 # string address
mv s1, a1 # length
.L_puts_len_loop:
beqz s1, .L_puts_len_done # if length == 0, done
lbu a0, 0(s0) # load current char
call putc # print char
addi s0, s0, 1
addi s1, s1, -1
j .L_puts_len_loop
.L_puts_len_done:
# Epilogue
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
addi sp, sp, 12
ret
#---------------------------------------------------------------------
# print_unsigned_hex
#
# Input:
# a0 = unsigned integer to print in hexadecimal format
#
# Description:
# 1) Prints "0x" prefix
# 2) Extracts each 4-bit nibble from most to least significant
# 3) Converts each nibble to its ASCII hex digit ('0'-'9', 'a'-'f')
# 4) Skips leading zeros but always prints at least one digit
#
# Notes:
# - Uses putc to print individual characters
# - Prints lowercase hex digits (a-f) for values 10-15
# - Always includes "0x" prefix for clarity
#---------------------------------------------------------------------
print_unsigned_hex:
# Prologue
addi sp, sp, -20
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
mv s0, a0 # s0 = number to print
li s1, 28 # s1 = current shift amount (7 nibbles * 4)
li s2, 0 # s2 = leading zeros flag (0 = still skipping)
# Print "0x" prefix
li a0, '0'
call putc
li a0, 'x'
call putc
.L_print_hex_loop:
# Extract current nibble
mv t0, s0
srl t0, t0, s1 # shift right to get current nibble
andi t0, t0, 0xf # mask to get just the nibble
# Skip this digit if it's a leading zero (unless it's the last digit)
bnez t0, .L_print_digit # if non-zero, must print it
bnez s2, .L_print_digit # if already printed something, must continue
beqz s1, .L_print_digit # if it's the last digit, must print even if zero
# This is a leading zero we can skip
j .L_next_nibble
.L_print_digit:
li s2, 1 # mark that we're now printing digits
# Convert to ASCII
li t1, 10
blt t0, t1, .L_numeric # if < 10, use '0'-'9'
# Handle a-f (value 10-15)
addi t0, t0, 'a' - 10
j .L_print_char
.L_numeric:
# Handle 0-9
addi t0, t0, '0'
.L_print_char:
mv a0, t0
call putc
.L_next_nibble:
addi s1, s1, -4 # move to next nibble
bgez s1, .L_print_hex_loop
# Epilogue
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
addi sp, sp, 20
ret
#---------------------------------------------------------------------
# print_int
#
# Input:
# a0 = integer to print
#
# Description:
# 1) Checks if the number is 0; prints '0' if so.
# 2) If negative, print a '-', then flip it positive.
# 3) Continuously take remainder by 10, push ASCII digit onto stack,
# then pop them off in reverse order to print.
#---------------------------------------------------------------------
print_int:
# Prologue
addi sp, sp, -16
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
mv s0, a0 # s0 = integer to print
mv s1, sp # s1 = stack pointer for pushing digits
li s2, 10 # divisor = 10
# Handle zero as special case
bnez s0, .L_pi_convert
li a0, '0'
call putc
j .L_pi_done
.L_pi_convert:
# Handle negative numbers
bgez s0, .L_pi_digits
li a0, '-'
call putc
neg s0, s0
.L_pi_digits:
# Repeatedly divide s0 by 10, push remainder digit onto stack
beqz s0, .L_pi_print
rem t0, s0, s2 # remainder
addi t0, t0, 48 # + '0'
addi s1, s1, -4
sw t0, 0(s1)
div s0, s0, s2
j .L_pi_digits
.L_pi_print:
# Pop digits and print
beq s1, sp, .L_pi_done
lw a0, 0(s1)
call putc
addi s1, s1, 4
j .L_pi_print
.L_pi_done:
# Epilogue
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
addi sp, sp, 16
ret
#---------------------------------------------------------------------
# memcmp
#
# Inputs:
# a0 = ptr1 (start address of first buffer)
# a1 = len1 (number of bytes in first buffer)
# a2 = ptr2 (start address of second buffer)
# a3 = len2 (number of bytes in second buffer)
#
# Output:
# a0 = -1 if buffers have same length and contents
# a0 = 0 otherwise (length mismatch or byte mismatch)
#---------------------------------------------------------------------
memcmp:
# First, check if lengths are equal
bne a1, a3, .L_not_equal # lengths differ => not equal
# If length is 0 and they are both the same size, they're "equal" (both empty)
beqz a1, .L_equal
.L_compare_loop:
lbu t0, 0(a0) # load byte from first buffer
lbu t1, 0(a2) # load byte from second buffer
bne t0, t1, .L_not_equal # mismatch => not equal
addi a0, a0, 1 # advance ptr1
addi a2, a2, 1 # advance ptr2
addi a1, a1, -1 # decrement length
bnez a1, .L_compare_loop # if more bytes to compare, continue
.L_equal:
li a0, -1 # indicate "equal"
ret
.L_not_equal:
li a0, 0 # indicate "not equal"
ret
.end
# forth.s
.section .text
.globl forth
.macro PUSH reg
addi :SP, :SP, -4
sw \reg, 0(:SP)
.endm
.macro POP reg
lw \reg, 0(:SP)
addi :SP, :SP, 4
.endm
forth:
la :SP, FORTH_STACK_END
la :IP, bytecode
# start the program
j NEXT
NEXT:
lw t0, 0(:IP)
addi :IP, :IP, 4
jr t0
# ( a b -- c )
PLUS:
POP t0
POP t1
add t0, t0, t1
PUSH t0
j NEXT
# ( -- n )
LITERAL:
lw t0, 0(:IP)
addi :IP, :IP, 4
PUSH t0
j NEXT
# ( n -- )
EMIT:
POP a0
jal print_int
j NEXT
# ( value addr -- )
WRITE:
POP t0 # address
POP t1 # value
sw t1, 0(t0)
j NEXT
# ( -- )
BYE:
j qemu_exit
# ( -- )
CR:
li a0, '\n'
jal putc
j NEXT
# ( addr -- len addr )
PARSE_TOKEN:
POP a0
jal token
PUSH a1 # length
PUSH a0 # token address
j NEXT
# ( len addr -- n )
ATOI:
POP a0 # address
POP a1 # length
jal atoi
PUSH a0
j NEXT
# ( len addr -- f )
IS_NUMBER:
POP a0 # address
POP a1 # length
jal is_number
PUSH a0
j NEXT
# ( a -- a a )
DUP:
POP t0
PUSH t0
PUSH t0
j NEXT
# ( a b -- b a )
SWAP:
POP t0 # b
POP t1 # a
PUSH t0
PUSH t1
j NEXT
# ( a -- )
DROP:
POP zero
j NEXT
# ( a b -- )
TWODROP:
POP zero
POP zero
j NEXT
# ( a b -- a b a b )
TWODUP:
POP t0 # b
POP t1 # a
PUSH t1 # a
PUSH t0 # b
PUSH t1 # a
PUSH t0 # b
j NEXT
# ( n1 n2 -- n1 n2 n1 )
OVER:
POP t0 # n2
POP t1 # n1
PUSH t1 # n1
PUSH t0 # n2
PUSH t1 # n1
j NEXT
# (x1 x2 x3 x4 -- x3 x4 x1 x2)
TWOSWAP:
POP t0 # x4
POP t1 # x3
POP t2 # x2
POP t3 # x1
PUSH t1
PUSH t0
PUSH t3
PUSH t2
j NEXT
# (x1 x2 x3 -- x2 x3 x1 )
ROT:
POP t0 # x3
POP t1 # x2
POP t2 # x1
PUSH t1 # x2
PUSH t0 # x3
PUSH t2 # x1
j NEXT
# (x1 x2 x3 -- x3 x1 x2)
NROT:
POP t0 # x3
POP t1 # x2
POP t2 # x1
PUSH t0 # x3
PUSH t2 # x1
PUSH t1 # x2
j NEXT
# ( a b -- f)
EQUAL:
POP t0
POP t1
beq t0, t1, .L_equal
li t0, 0
PUSH t0
j NEXT
.L_equal:
li t0, -1
PUSH t0
j NEXT
# ( len1 addr1 len2 addr2 -- flag)
MEMCMP:
POP a2
POP a3
POP a0
POP a1
call memcmp
PUSH a0
j NEXT
# ( f -- )
BRANCH_ON_ZERO:
POP t0
beqz t0, .L_do_branch
addi :IP, :IP, 4
j NEXT
.L_do_branch:
lw :IP, 0(:IP)
j NEXT
# ( -- )
JUMP:
lw :IP, 0(:IP)
j NEXT
# just a debug function to print the whole stack
# print debugging.. some people hate it some people love it
# i both hate it and love it
DEBUG_STACK:
addi sp, sp, -12
sw ra, 0(sp)
sw s8, 4(sp)
sw s9, 8(sp)
li a0, '<'
call putc
li a0, '>'
call putc
li a0, ' '
call putc
mv s9, :SP
add s9, s9, -4
la s8, FORTH_STACK_END
add s8, s8, -4
.L_debug_stack_loop:
beq s8, s9, .L_debug_stack_loop_end
lw a0, 0(s8)
call print_unsigned_hex
li a0, ' '
call putc
addi s8, s8, -4
j .L_debug_stack_loop
.L_debug_stack_loop_end:
li a0, '\n'
call putc
lw ra, 0(sp)
lw s8, 4(sp)
lw s9, 8(sp)
addi sp, sp, 12
j NEXT
human_program:
.asciz "842 31 + 721 + 3 + . bye"
# This bytecode says:
# 1) Push address of human_program onto stack.
# 2) Go parse tokens from that string.
# 3) Decide if each token is a number or a known word (+, ., bye).
# 4) Execute the corresponding Forth logic.
# lets assume human_program is at address 1000
bytecode:
.word LITERAL
.word human_program # 1000
# parse the token, and check if we have reached end of string
next_token:
.word PARSE_TOKEN # ( addr -- len addr)
# 3 1000 for the first token 842, length is 3, address is 1000
# 2 1004 for the second token: 31, length is 2 address is 1004
.word OVER # ( n1 n2 -- n1 n2 n1 )
# 3 1000 3
.word LITERAL
.word 0 # push 0 to the stack
# 3 1000 3 0
# we want to compare if the token's length is 0
# so we push 0 and call equal
.word EQUAL # ( n1 n2 -- flag )
# push -1 if n1 == n2, 0 otherwise
# 3 1000 -1/0
.word BRANCH_ON_ZERO # ( flag -- )
# pop the flag and if 0, jump to the next instruction, if not continue
.word check_is_number # if we have a token (flag is -1) check if its a number
.word BYE # no token left, quit qemu
# check if the token is a number, and if it is convert it to integer and push it to the stack
check_is_number:
# when we come here, the stacks is: len addr of the token
.word TWODUP # (n1 n2 -- n1 n2 n1 n2)
# duplicate the token len and addr because IS_NUMBER
# will pop len addr and return a flag if the token is number, and
# we still want to use the actual token after that
.word IS_NUMBER # ( len addr -- flag )
# 3 1000 3 1000 -> 3 1000 -1/0
.word LITERAL
.word -1 # push -1, stack becomes: len addr flag -1
# we want to compare IS_NUMBER with -1 (true), so we push -1
# and call equal
.word EQUAL # ( n1 n2 -- flag)
.word BRANCH_ON_ZERO # ( flag -- )
.word not_a_number # if the result of equal is zero, means the token is not a number
.word TWODUP # otherwise it is a number
# duplicate the len addr so we can convert it from string to a 4 byte number
# stack is now 3 1000 3 1000
.word ATOI # ( len addr -- value )
# stack: 3 1000 842
# now the token is properly converted to a number and is on top of the stack
.word NROT # ( n1 n2 n3 -- n3 n1 n2 )
# stack: 842 3 1000
# we want to -rot the stack so that the token length and address are on top
# we want to add the length to the address and go parse the next token
.word PLUS # ( n1 n2 -- n )
# stack 852 1003
.word JUMP # jump to next token
.word next_token
# if its not a number, check if its a dot . for EMIT
not_a_number:
.word TWODUP # when we come here the stack is: ... len addr
# duplicate the token to be compared with "."
.word LITERAL
.word 1 # length of "."
.word LITERAL
.word string_dot # address of the string "."
.word MEMCMP # ( len1 addr1 len2 addr2 -- flag)
.word BRANCH_ON_ZERO # ( flag -- )
.word not_a_dot # if memcmp pushes 0 to the stack, then the token is not "."
# otherwise prepare the stack to call EMIT to print it
.word ROT # ( x1 x2 x3 -- x2 x3 x1 )
# rotate the stack so we get the value that shoud've been pushed
# to the stack before we come here, so the stack is len addr value
#
.word EMIT # ( v -- )
# print the top of the stack, after it becomes len addr of the token
.word PLUS # add the token addr and its length, and go to the next token
.word JUMP # jump to next token
.word next_token #
not_a_dot:
.word TWODUP
.word LITERAL
.word 1
.word LITERAL
.word string_plus
.word MEMCMP
.word BRANCH_ON_ZERO
.word not_a_plus
.word TWOSWAP
.word PLUS
.word NROT
.word PLUS
.word JUMP
.word next_token
not_a_plus:
.word TWODUP
.word LITERAL
.word 3
.word LITERAL
.word string_bye
.word MEMCMP
.word BRANCH_ON_ZERO
.word do_next_token
.word BYE
do_next_token:
.word PLUS
.word JUMP
.word next_token
string_dot:
.ascii "."
.zero 3
string_plus:
.ascii "+"
.zero 3
string_bye:
.ascii "bye"
.zero 1
.space 1024
FORTH_STACK_END:
.end
First relax, thats a lot of code.
This is the compiled machine code. The program is to be loaded at address 0x80000000. Lets look at its purest form, where nothing is hidden, no secrets, no pseudo instructions, no macros, no words, no comments. As close as we can get to the wires. And yet, there are no wires, our QEMU computer is a computer within a computer.
80000000 <_start>:
80000000: 08000117 auipc sp,0x8000
80000004: 00010113 addi sp,sp,0 # 88000000 <_ram_end>
80000008: 0040006f jal zero,8000000c <forth>
8000000c <forth>:
8000000c: 00001497 auipc s1,0x1
80000010: 87148493 addi s1,s1,-1935 # 8000087d <FORTH_STACK_END>
80000014: 00000417 auipc s0,0x0
80000018: 36540413 addi s0,s0,869 # 80000379 <bytecode>
8000001c: 0040006f jal zero,80000020 <NEXT>
80000020 <NEXT>:
80000020: 00042283 lw t0,0(s0)
80000024: 00440413 addi s0,s0,4
80000028: 00028067 jalr zero,0(t0)
8000002c <PLUS>:
8000002c: 0004a283 lw t0,0(s1)
80000030: 00448493 addi s1,s1,4
80000034: 0004a303 lw t1,0(s1)
80000038: 00448493 addi s1,s1,4
8000003c: 006282b3 add t0,t0,t1
80000040: ffc48493 addi s1,s1,-4
80000044: 0054a023 sw t0,0(s1)
80000048: fd9ff06f jal zero,80000020 <NEXT>
8000004c <LITERAL>:
8000004c: 00042283 lw t0,0(s0)
80000050: 00440413 addi s0,s0,4
80000054: ffc48493 addi s1,s1,-4
80000058: 0054a023 sw t0,0(s1)
8000005c: fc5ff06f jal zero,80000020 <NEXT>
80000060 <EMIT>:
80000060: 0004a503 lw a0,0(s1)
80000064: 00448493 addi s1,s1,4
80000068: 28d000ef jal ra,80000af4 <print_int>
8000006c: fb5ff06f jal zero,80000020 <NEXT>
80000070 <WRITE>:
80000070: 0004a283 lw t0,0(s1)
80000074: 00448493 addi s1,s1,4
80000078: 0004a303 lw t1,0(s1)
8000007c: 00448493 addi s1,s1,4
80000080: 0062a023 sw t1,0(t0)
80000084: f9dff06f jal zero,80000020 <NEXT>
80000088 <BYE>:
80000088: 0290006f jal zero,800008b0 <qemu_exit>
8000008c <CR>:
8000008c: 00a00513 addi a0,zero,10
80000090: 7f0000ef jal ra,80000880 <putc>
80000094: f8dff06f jal zero,80000020 <NEXT>
80000098 <PARSE_TOKEN>:
80000098: 0004a503 lw a0,0(s1)
8000009c: 00448493 addi s1,s1,4
800000a0: 025000ef jal ra,800008c4 <token>
800000a4: ffc48493 addi s1,s1,-4
800000a8: 00b4a023 sw a1,0(s1)
800000ac: ffc48493 addi s1,s1,-4
800000b0: 00a4a023 sw a0,0(s1)
800000b4: f6dff06f jal zero,80000020 <NEXT>
800000b8 <ATOI>:
800000b8: 0004a503 lw a0,0(s1)
800000bc: 00448493 addi s1,s1,4
800000c0: 0004a583 lw a1,0(s1)
800000c4: 00448493 addi s1,s1,4
800000c8: 091000ef jal ra,80000958 <atoi>
800000cc: ffc48493 addi s1,s1,-4
800000d0: 00a4a023 sw a0,0(s1)
800000d4: f4dff06f jal zero,80000020 <NEXT>
800000d8 <IS_NUMBER>:
800000d8: 0004a503 lw a0,0(s1)
800000dc: 00448493 addi s1,s1,4
800000e0: 0004a583 lw a1,0(s1)
800000e4: 00448493 addi s1,s1,4
800000e8: 019000ef jal ra,80000900 <is_number>
800000ec: ffc48493 addi s1,s1,-4
800000f0: 00a4a023 sw a0,0(s1)
800000f4: f2dff06f jal zero,80000020 <NEXT>
800000f8 <DUP>:
800000f8: 0004a283 lw t0,0(s1)
800000fc: 00448493 addi s1,s1,4
80000100: ffc48493 addi s1,s1,-4
80000104: 0054a023 sw t0,0(s1)
80000108: ffc48493 addi s1,s1,-4
8000010c: 0054a023 sw t0,0(s1)
80000110: f11ff06f jal zero,80000020 <NEXT>
80000114 <SWAP>:
80000114: 0004a283 lw t0,0(s1)
80000118: 00448493 addi s1,s1,4
8000011c: 0004a303 lw t1,0(s1)
80000120: 00448493 addi s1,s1,4
80000124: ffc48493 addi s1,s1,-4
80000128: 0054a023 sw t0,0(s1)
8000012c: ffc48493 addi s1,s1,-4
80000130: 0064a023 sw t1,0(s1)
80000134: eedff06f jal zero,80000020 <NEXT>
80000138 <DROP>:
80000138: 0004a003 lw zero,0(s1)
8000013c: 00448493 addi s1,s1,4
80000140: ee1ff06f jal zero,80000020 <NEXT>
80000144 <TWODROP>:
80000144: 0004a003 lw zero,0(s1)
80000148: 00448493 addi s1,s1,4
8000014c: 0004a003 lw zero,0(s1)
80000150: 00448493 addi s1,s1,4
80000154: ecdff06f jal zero,80000020 <NEXT>
80000158 <TWODUP>:
80000158: 0004a283 lw t0,0(s1)
8000015c: 00448493 addi s1,s1,4
80000160: 0004a303 lw t1,0(s1)
80000164: 00448493 addi s1,s1,4
80000168: ffc48493 addi s1,s1,-4
8000016c: 0064a023 sw t1,0(s1)
80000170: ffc48493 addi s1,s1,-4
80000174: 0054a023 sw t0,0(s1)
80000178: ffc48493 addi s1,s1,-4
8000017c: 0064a023 sw t1,0(s1)
80000180: ffc48493 addi s1,s1,-4
80000184: 0054a023 sw t0,0(s1)
80000188: e99ff06f jal zero,80000020 <NEXT>
8000018c <OVER>:
8000018c: 0004a283 lw t0,0(s1)
80000190: 00448493 addi s1,s1,4
80000194: 0004a303 lw t1,0(s1)
80000198: 00448493 addi s1,s1,4
8000019c: ffc48493 addi s1,s1,-4
800001a0: 0064a023 sw t1,0(s1)
800001a4: ffc48493 addi s1,s1,-4
800001a8: 0054a023 sw t0,0(s1)
800001ac: ffc48493 addi s1,s1,-4
800001b0: 0064a023 sw t1,0(s1)
800001b4: e6dff06f jal zero,80000020 <NEXT>
800001b8 <TWOSWAP>:
800001b8: 0004a283 lw t0,0(s1)
800001bc: 00448493 addi s1,s1,4
800001c0: 0004a303 lw t1,0(s1)
800001c4: 00448493 addi s1,s1,4
800001c8: 0004a383 lw t2,0(s1)
800001cc: 00448493 addi s1,s1,4
800001d0: 0004ae03 lw t3,0(s1)
800001d4: 00448493 addi s1,s1,4
800001d8: ffc48493 addi s1,s1,-4
800001dc: 0064a023 sw t1,0(s1)
800001e0: ffc48493 addi s1,s1,-4
800001e4: 0054a023 sw t0,0(s1)
800001e8: ffc48493 addi s1,s1,-4
800001ec: 01c4a023 sw t3,0(s1)
800001f0: ffc48493 addi s1,s1,-4
800001f4: 0074a023 sw t2,0(s1)
800001f8: e29ff06f jal zero,80000020 <NEXT>
800001fc <ROT>:
800001fc: 0004a283 lw t0,0(s1)
80000200: 00448493 addi s1,s1,4
80000204: 0004a303 lw t1,0(s1)
80000208: 00448493 addi s1,s1,4
8000020c: 0004a383 lw t2,0(s1)
80000210: 00448493 addi s1,s1,4
80000214: ffc48493 addi s1,s1,-4
80000218: 0064a023 sw t1,0(s1)
8000021c: ffc48493 addi s1,s1,-4
80000220: 0054a023 sw t0,0(s1)
80000224: ffc48493 addi s1,s1,-4
80000228: 0074a023 sw t2,0(s1)
8000022c: df5ff06f jal zero,80000020 <NEXT>
80000230 <NROT>:
80000230: 0004a283 lw t0,0(s1)
80000234: 00448493 addi s1,s1,4
80000238: 0004a303 lw t1,0(s1)
8000023c: 00448493 addi s1,s1,4
80000240: 0004a383 lw t2,0(s1)
80000244: 00448493 addi s1,s1,4
80000248: ffc48493 addi s1,s1,-4
8000024c: 0054a023 sw t0,0(s1)
80000250: ffc48493 addi s1,s1,-4
80000254: 0074a023 sw t2,0(s1)
80000258: ffc48493 addi s1,s1,-4
8000025c: 0064a023 sw t1,0(s1)
80000260: dc1ff06f jal zero,80000020 <NEXT>
80000264 <EQUAL>:
80000264: 0004a283 lw t0,0(s1)
80000268: 00448493 addi s1,s1,4
8000026c: 0004a303 lw t1,0(s1)
80000270: 00448493 addi s1,s1,4
80000274: 00628a63 beq t0,t1,80000288 <EQUAL+0x24>
80000278: 00000293 addi t0,zero,0
8000027c: ffc48493 addi s1,s1,-4
80000280: 0054a023 sw t0,0(s1)
80000284: d9dff06f jal zero,80000020 <NEXT>
80000288: fff00293 addi t0,zero,-1
8000028c: ffc48493 addi s1,s1,-4
80000290: 0054a023 sw t0,0(s1)
80000294: d8dff06f jal zero,80000020 <NEXT>
80000298 <MEMCMP>:
80000298: 0004a603 lw a2,0(s1)
8000029c: 00448493 addi s1,s1,4
800002a0: 0004a683 lw a3,0(s1)
800002a4: 00448493 addi s1,s1,4
800002a8: 0004a503 lw a0,0(s1)
800002ac: 00448493 addi s1,s1,4
800002b0: 0004a583 lw a1,0(s1)
800002b4: 00448493 addi s1,s1,4
800002b8: 0c5000ef jal ra,80000b7c <memcmp>
800002bc: ffc48493 addi s1,s1,-4
800002c0: 00a4a023 sw a0,0(s1)
800002c4: d5dff06f jal zero,80000020 <NEXT>
800002c8 <BRANCH_ON_ZERO>:
800002c8: 0004a283 lw t0,0(s1)
800002cc: 00448493 addi s1,s1,4
800002d0: 00028663 beq t0,zero,800002dc <BRANCH_ON_ZERO+0x14>
800002d4: 00440413 addi s0,s0,4
800002d8: d49ff06f jal zero,80000020 <NEXT>
800002dc: 00042403 lw s0,0(s0)
800002e0: d41ff06f jal zero,80000020 <NEXT>
800002e4 <JUMP>:
800002e4: 00042403 lw s0,0(s0)
800002e8: d39ff06f jal zero,80000020 <NEXT>
800002ec <DEBUG_STACK>:
800002ec: ff410113 addi sp,sp,-12
800002f0: 00112023 sw ra,0(sp)
800002f4: 01812223 sw s8,4(sp)
800002f8: 01912423 sw s9,8(sp)
800002fc: 03c00513 addi a0,zero,60
80000300: 580000ef jal ra,80000880 <putc>
80000304: 03e00513 addi a0,zero,62
80000308: 578000ef jal ra,80000880 <putc>
8000030c: 02000513 addi a0,zero,32
80000310: 570000ef jal ra,80000880 <putc>
80000314: 00048c93 addi s9,s1,0
80000318: ffcc8c93 addi s9,s9,-4
8000031c: 00000c17 auipc s8,0x0
80000320: 561c0c13 addi s8,s8,1377 # 8000087d <FORTH_STACK_END>
80000324: ffcc0c13 addi s8,s8,-4
80000328: 019c0e63 beq s8,s9,80000344 <DEBUG_STACK+0x58>
8000032c: 000c2503 lw a0,0(s8)
80000330: 730000ef jal ra,80000a60 <print_unsigned_hex>
80000334: 02000513 addi a0,zero,32
80000338: 548000ef jal ra,80000880 <putc>
8000033c: ffcc0c13 addi s8,s8,-4
80000340: fe9ff06f jal zero,80000328 <DEBUG_STACK+0x3c>
80000344: 00a00513 addi a0,zero,10
80000348: 538000ef jal ra,80000880 <putc>
8000034c: 00012083 lw ra,0(sp)
80000350: 00412c03 lw s8,4(sp)
80000354: 00812c83 lw s9,8(sp)
80000358: 00c10113 addi sp,sp,12
8000035c: cc5ff06f jal zero,80000020 <NEXT>
80000360 <human_program>:
80000360: 20323438 .word 0x20323438
80000364: 2b203133 .word 0x2b203133
80000368: 31323720 .word 0x31323720
8000036c: 33202b20 .word 0x33202b20
80000370: 2e202b20 .word 0x2e202b20
80000374: 65796220 .word 0x65796220
...
80000379 <bytecode>:
80000379: 8000004c .word 0x8000004c
8000037d: 80000360 .word 0x80000360
80000381 <next_token>:
80000381: 80000098 .word 0x80000098
80000385: 8000018c .word 0x8000018c
80000389: 8000004c .word 0x8000004c
8000038d: 00000000 .word 0x00000000
80000391: 80000264 .word 0x80000264
80000395: 800002c8 .word 0x800002c8
80000399: 800003a1 .word 0x800003a1
8000039d: 80000088 .word 0x80000088
800003a1 <check_is_number>:
800003a1: 80000158 .word 0x80000158
800003a5: 800000d8 .word 0x800000d8
800003a9: 8000004c .word 0x8000004c
800003ad: ffffffff .word 0xffffffff
800003b1: 80000264 .word 0x80000264
800003b5: 800002c8 .word 0x800002c8
800003b9: 800003d5 .word 0x800003d5
800003bd: 80000158 .word 0x80000158
800003c1: 800000b8 .word 0x800000b8
800003c5: 80000230 .word 0x80000230
800003c9: 8000002c .word 0x8000002c
800003cd: 800002e4 .word 0x800002e4
800003d1: 80000381 .word 0x80000381
800003d5 <not_a_number>:
800003d5: 80000158 .word 0x80000158
800003d9: 8000004c .word 0x8000004c
800003dd: 00000001 .word 0x00000001
800003e1: 8000004c .word 0x8000004c
800003e5: 80000471 .word 0x80000471
800003e9: 80000298 .word 0x80000298
800003ed: 800002c8 .word 0x800002c8
800003f1: 80000409 .word 0x80000409
800003f5: 800001fc .word 0x800001fc
800003f9: 80000060 .word 0x80000060
800003fd: 8000002c .word 0x8000002c
80000401: 800002e4 .word 0x800002e4
80000405: 80000381 .word 0x80000381
80000409 <not_a_dot>:
80000409: 80000158 .word 0x80000158
8000040d: 8000004c .word 0x8000004c
80000411: 00000001 .word 0x00000001
80000415: 8000004c .word 0x8000004c
80000419: 80000475 .word 0x80000475
8000041d: 80000298 .word 0x80000298
80000421: 800002c8 .word 0x800002c8
80000425: 80000441 .word 0x80000441
80000429: 800001b8 .word 0x800001b8
8000042d: 8000002c .word 0x8000002c
80000431: 80000230 .word 0x80000230
80000435: 8000002c .word 0x8000002c
80000439: 800002e4 .word 0x800002e4
8000043d: 80000381 .word 0x80000381
80000441 <not_a_plus>:
80000441: 80000158 .word 0x80000158
80000445: 8000004c .word 0x8000004c
80000449: 00000003 .word 0x00000003
8000044d: 8000004c .word 0x8000004c
80000451: 80000479 .word 0x80000479
80000455: 80000298 .word 0x80000298
80000459: 800002c8 .word 0x800002c8
8000045d: 80000465 .word 0x80000465
80000461: 80000088 .word 0x80000088
80000465 <do_next_token>:
80000465: 8000002c .word 0x8000002c
80000469: 800002e4 .word 0x800002e4
8000046d: 80000381 .word 0x80000381
80000471 <string_dot>:
80000471: 0000002e .word 0x0000002e
80000475 <string_plus>:
80000475: 0000002b .word 0x0000002b
80000479 <string_bye>:
80000479: 00657962 .word 0x00657962
...
8000087d <FORTH_STACK_END>:
8000087d: 0000 .insn 2, 0x
...
80000880 <putc>:
80000880: 100002b7 lui t0,0x10000
80000884: 0052c303 lbu t1,5(t0) # 10000005 <_start-0x6ffffffb>
80000888: 02037313 andi t1,t1,32
8000088c: fe030ce3 beq t1,zero,80000884 <putc+0x4>
80000890: 00a28023 sb a0,0(t0)
80000894: 00008067 jalr zero,0(ra)
80000898 <getch>:
80000898: 100002b7 lui t0,0x10000
8000089c: 0052c303 lbu t1,5(t0) # 10000005 <_start-0x6ffffffb>
800008a0: 00137313 andi t1,t1,1
800008a4: fe030ce3 beq t1,zero,8000089c <getch+0x4>
800008a8: 0002c503 lbu a0,0(t0)
800008ac: 00008067 jalr zero,0(ra)
800008b0 <qemu_exit>:
800008b0: 001002b7 lui t0,0x100
800008b4: 00005337 lui t1,0x5
800008b8: 55530313 addi t1,t1,1365 # 5555 <_start-0x7fffaaab>
800008bc: 0062a023 sw t1,0(t0) # 100000 <_start-0x7ff00000>
800008c0: 0000006f jal zero,800008c0 <qemu_exit+0x10>
800008c4 <token>:
800008c4: 00050e13 addi t3,a0,0
800008c8: 00000593 addi a1,zero,0
800008cc: 02100313 addi t1,zero,33
800008d0: 000e4283 lbu t0,0(t3)
800008d4: 02028463 beq t0,zero,800008fc <token+0x38>
800008d8: 0062d663 bge t0,t1,800008e4 <token+0x20>
800008dc: 001e0e13 addi t3,t3,1
800008e0: ff1ff06f jal zero,800008d0 <token+0xc>
800008e4: 000e0513 addi a0,t3,0
800008e8: 000e4283 lbu t0,0(t3)
800008ec: 0062c863 blt t0,t1,800008fc <token+0x38>
800008f0: 00158593 addi a1,a1,1
800008f4: 001e0e13 addi t3,t3,1
800008f8: ff1ff06f jal zero,800008e8 <token+0x24>
800008fc: 00008067 jalr zero,0(ra)
80000900 <is_number>:
80000900: 04058863 beq a1,zero,80000950 <is_number+0x50>
80000904: 00050293 addi t0,a0,0
80000908: 00058313 addi t1,a1,0
8000090c: 0002c383 lbu t2,0(t0)
80000910: 02d00e13 addi t3,zero,45
80000914: 03c38663 beq t2,t3,80000940 <is_number+0x40>
80000918: 0002c383 lbu t2,0(t0)
8000091c: 03000e13 addi t3,zero,48
80000920: 03900e93 addi t4,zero,57
80000924: 03c3c663 blt t2,t3,80000950 <is_number+0x50>
80000928: 027ec463 blt t4,t2,80000950 <is_number+0x50>
8000092c: 00128293 addi t0,t0,1
80000930: fff30313 addi t1,t1,-1
80000934: fe0312e3 bne t1,zero,80000918 <is_number+0x18>
80000938: fff00513 addi a0,zero,-1
8000093c: 00008067 jalr zero,0(ra)
80000940: 00128293 addi t0,t0,1
80000944: fff30313 addi t1,t1,-1
80000948: 00030463 beq t1,zero,80000950 <is_number+0x50>
8000094c: fcdff06f jal zero,80000918 <is_number+0x18>
80000950: 00000513 addi a0,zero,0
80000954: 00008067 jalr zero,0(ra)
80000958 <atoi>:
80000958: fec10113 addi sp,sp,-20
8000095c: 00112023 sw ra,0(sp)
80000960: 00812223 sw s0,4(sp)
80000964: 00912423 sw s1,8(sp)
80000968: 01212623 sw s2,12(sp)
8000096c: 01312823 sw s3,16(sp)
80000970: 00050413 addi s0,a0,0
80000974: 00058493 addi s1,a1,0
80000978: 00000913 addi s2,zero,0
8000097c: 00a00293 addi t0,zero,10
80000980: 00000993 addi s3,zero,0
80000984: 02048e63 beq s1,zero,800009c0 <atoi+0x68>
80000988: 00044303 lbu t1,0(s0)
8000098c: 02d00393 addi t2,zero,45
80000990: 00731863 bne t1,t2,800009a0 <atoi+0x48>
80000994: 00100993 addi s3,zero,1
80000998: 00140413 addi s0,s0,1
8000099c: fff48493 addi s1,s1,-1
800009a0: 02048063 beq s1,zero,800009c0 <atoi+0x68>
800009a4: 02590933 mul s2,s2,t0
800009a8: 00044303 lbu t1,0(s0)
800009ac: fd030313 addi t1,t1,-48
800009b0: 00690933 add s2,s2,t1
800009b4: 00140413 addi s0,s0,1
800009b8: fff48493 addi s1,s1,-1
800009bc: fe5ff06f jal zero,800009a0 <atoi+0x48>
800009c0: 00098463 beq s3,zero,800009c8 <atoi+0x70>
800009c4: 41200933 sub s2,zero,s2
800009c8: 00090513 addi a0,s2,0
800009cc: 00012083 lw ra,0(sp)
800009d0: 00412403 lw s0,4(sp)
800009d4: 00812483 lw s1,8(sp)
800009d8: 00c12903 lw s2,12(sp)
800009dc: 01012983 lw s3,16(sp)
800009e0: 01410113 addi sp,sp,20
800009e4: 00008067 jalr zero,0(ra)
800009e8 <puts>:
800009e8: ff810113 addi sp,sp,-8
800009ec: 00112023 sw ra,0(sp)
800009f0: 00812223 sw s0,4(sp)
800009f4: 00050413 addi s0,a0,0
800009f8: 00044503 lbu a0,0(s0)
800009fc: 00050863 beq a0,zero,80000a0c <puts+0x24>
80000a00: e81ff0ef jal ra,80000880 <putc>
80000a04: 00140413 addi s0,s0,1
80000a08: ff1ff06f jal zero,800009f8 <puts+0x10>
80000a0c: 00012083 lw ra,0(sp)
80000a10: 00412403 lw s0,4(sp)
80000a14: 00810113 addi sp,sp,8
80000a18: 00008067 jalr zero,0(ra)
80000a1c <puts_len>:
80000a1c: ff410113 addi sp,sp,-12
80000a20: 00112023 sw ra,0(sp)
80000a24: 00812223 sw s0,4(sp)
80000a28: 00912423 sw s1,8(sp)
80000a2c: 00050413 addi s0,a0,0
80000a30: 00058493 addi s1,a1,0
80000a34: 00048c63 beq s1,zero,80000a4c <puts_len+0x30>
80000a38: 00044503 lbu a0,0(s0)
80000a3c: e45ff0ef jal ra,80000880 <putc>
80000a40: 00140413 addi s0,s0,1
80000a44: fff48493 addi s1,s1,-1
80000a48: fedff06f jal zero,80000a34 <puts_len+0x18>
80000a4c: 00012083 lw ra,0(sp)
80000a50: 00412403 lw s0,4(sp)
80000a54: 00812483 lw s1,8(sp)
80000a58: 00c10113 addi sp,sp,12
80000a5c: 00008067 jalr zero,0(ra)
80000a60 <print_unsigned_hex>:
80000a60: fec10113 addi sp,sp,-20
80000a64: 00112023 sw ra,0(sp)
80000a68: 00812223 sw s0,4(sp)
80000a6c: 00912423 sw s1,8(sp)
80000a70: 01212623 sw s2,12(sp)
80000a74: 01312823 sw s3,16(sp)
80000a78: 00050413 addi s0,a0,0
80000a7c: 01c00493 addi s1,zero,28
80000a80: 00000913 addi s2,zero,0
80000a84: 03000513 addi a0,zero,48
80000a88: df9ff0ef jal ra,80000880 <putc>
80000a8c: 07800513 addi a0,zero,120
80000a90: df1ff0ef jal ra,80000880 <putc>
80000a94: 00040293 addi t0,s0,0
80000a98: 0092d2b3 srl t0,t0,s1
80000a9c: 00f2f293 andi t0,t0,15
80000aa0: 00029863 bne t0,zero,80000ab0 <print_unsigned_hex+0x50>
80000aa4: 00091663 bne s2,zero,80000ab0 <print_unsigned_hex+0x50>
80000aa8: 00048463 beq s1,zero,80000ab0 <print_unsigned_hex+0x50>
80000aac: 0240006f jal zero,80000ad0 <print_unsigned_hex+0x70>
80000ab0: 00100913 addi s2,zero,1
80000ab4: 00a00313 addi t1,zero,10
80000ab8: 0062c663 blt t0,t1,80000ac4 <print_unsigned_hex+0x64>
80000abc: 05728293 addi t0,t0,87
80000ac0: 0080006f jal zero,80000ac8 <print_unsigned_hex+0x68>
80000ac4: 03028293 addi t0,t0,48
80000ac8: 00028513 addi a0,t0,0
80000acc: db5ff0ef jal ra,80000880 <putc>
80000ad0: ffc48493 addi s1,s1,-4
80000ad4: fc04d0e3 bge s1,zero,80000a94 <print_unsigned_hex+0x34>
80000ad8: 00012083 lw ra,0(sp)
80000adc: 00412403 lw s0,4(sp)
80000ae0: 00812483 lw s1,8(sp)
80000ae4: 00c12903 lw s2,12(sp)
80000ae8: 01012983 lw s3,16(sp)
80000aec: 01410113 addi sp,sp,20
80000af0: 00008067 jalr zero,0(ra)
80000af4 <print_int>:
80000af4: ff010113 addi sp,sp,-16
80000af8: 00112023 sw ra,0(sp)
80000afc: 00812223 sw s0,4(sp)
80000b00: 00912423 sw s1,8(sp)
80000b04: 01212623 sw s2,12(sp)
80000b08: 00050413 addi s0,a0,0
80000b0c: 00010493 addi s1,sp,0
80000b10: 00a00913 addi s2,zero,10
80000b14: 00041863 bne s0,zero,80000b24 <print_int+0x30>
80000b18: 03000513 addi a0,zero,48
80000b1c: d65ff0ef jal ra,80000880 <putc>
80000b20: 0440006f jal zero,80000b64 <print_int+0x70>
80000b24: 00045863 bge s0,zero,80000b34 <print_int+0x40>
80000b28: 02d00513 addi a0,zero,45
80000b2c: d55ff0ef jal ra,80000880 <putc>
80000b30: 40800433 sub s0,zero,s0
80000b34: 00040e63 beq s0,zero,80000b50 <print_int+0x5c>
80000b38: 032462b3 rem t0,s0,s2
80000b3c: 03028293 addi t0,t0,48
80000b40: ffc48493 addi s1,s1,-4
80000b44: 0054a023 sw t0,0(s1)
80000b48: 03244433 div s0,s0,s2
80000b4c: fe9ff06f jal zero,80000b34 <print_int+0x40>
80000b50: 00248a63 beq s1,sp,80000b64 <print_int+0x70>
80000b54: 0004a503 lw a0,0(s1)
80000b58: d29ff0ef jal ra,80000880 <putc>
80000b5c: 00448493 addi s1,s1,4
80000b60: ff1ff06f jal zero,80000b50 <print_int+0x5c>
80000b64: 00012083 lw ra,0(sp)
80000b68: 00412403 lw s0,4(sp)
80000b6c: 00812483 lw s1,8(sp)
80000b70: 00c12903 lw s2,12(sp)
80000b74: 01010113 addi sp,sp,16
80000b78: 00008067 jalr zero,0(ra)
80000b7c <memcmp>:
80000b7c: 02d59663 bne a1,a3,80000ba8 <memcmp+0x2c>
80000b80: 02058063 beq a1,zero,80000ba0 <memcmp+0x24>
80000b84: 00054283 lbu t0,0(a0)
80000b88: 00064303 lbu t1,0(a2)
80000b8c: 00629e63 bne t0,t1,80000ba8 <memcmp+0x2c>
80000b90: 00150513 addi a0,a0,1
80000b94: 00160613 addi a2,a2,1
80000b98: fff58593 addi a1,a1,-1
80000b9c: fe0594e3 bne a1,zero,80000b84 <memcmp+0x8>
80000ba0: fff00513 addi a0,zero,-1
80000ba4: 00008067 jalr zero,0(ra)
80000ba8: 00000513 addi a0,zero,0
80000bac: 00008067 jalr zero,0(ra)
Amazing. 800003a9 ffffffff
is .word LITERAL .word -1
, you can see it in
address 800003ab where we have the check_number code.
800003a1: 80000158 .word 0x80000158
800003a5: 800000d8 .word 0x800000d8
800003a9: 8000004c .word 0x8000004c
800003ad: ffffffff .word 0xffffffff
800003b1: 80000264 .word 0x80000264
800003b5: 800002c8 .word 0x800002c8
800003b9: 800003d5 .word 0x800003d5
800003bd: 80000158 .word 0x80000158
800003c1: 800000b8 .word 0x800000b8
800003c5: 80000230 .word 0x80000230
800003c9: 8000002c .word 0x8000002c
800003cd: 800002e4 .word 0x800002e4
800003d1: 80000381 .word 0x80000381
At the end of IS_NUMBER
, it calls NEXT, by jumping to 80000020, s0 will be
800003a9, and then it NEXT will move s0 to 800003ad and then jump into LITERAL at
8000004c. LITERAL will load the value from memory[800003ad] and push it into the
stack, then it will move s0 to 800003b1, and call NEXT again by jumping to 80000020.
8000004c: 00042283 lw t0,0(s0)
80000050: 00440413 addi s0,s0,4
80000054: ffc48493 addi s1,s1,-4
80000058: 0054a023 sw t0,0(s1)
8000005c: fc5ff06f jal zero,80000020 <NEXT>
See again, how the thread is woven. From NEXT to NEXT to NEXT..
Examine address 8000005c containing the machine code fc5ff06f. It is 'jal x0, -60', as we discussed in RISCV jumps are relative to the jal
instruction itself. And 0x8000005c - 60
is.. you guessed it 0x80000020 :)
In binary 0xfc5ff06f is 11111100010111111111000001101111. The most right bits
are the jal instruction iself, then 5 bits are for rd
the destination register
where pc+4
will be stored, in this case it is the zero register, and then,
the instruction's immediate value, the pc relative offset, in somewhat strange encoding, first we take the
20th bit, then the bits 10 to 1 then bit 11 then 19 to 12 to construct the actual value
This is how it is actually decoded:
signed Bits<21> imm = sext({$encoding[31], $encoding[19:12], $encoding[20], $encoding[30:21], 1'd0});
Sext means sign extension,if its 1 then it is a 2s complement number. and the sign must be preserved. For example if we had 4 bit number -3 1101, and want to extend it to 8 bits we must make it 11111101 not 00001101. That is what Sign Extension mean, because we want to convert 20 bit number into 32 bit number.
1'd0 means one bit of value 0 added to the end. This multiplies the result by 2, and guarantees that the address we are jumping to is multiple of 2. So if the actual immediate value is 10 we will jump to pc + 20.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 1 1 1
So our immediate value is:
sext(1, 11111111, 1, 1111100010, 0)
, or in 32 sign extended bits
11111111111111111111111111000100
, to convert it from twos complement, we have
to invert the bits and add 1 to it, 00000000000000000000000000111011 + 1 is
00000000000000000000000000111100 which in decimal is 60, so
11111111111111111111111111000100
two's complement is -60 in decimal
I wanted to show you the manual decoding of fc5ff06f because we jump around so
much, the whole Forth inner interpreter is about jumping, so understanding the
jal
instruction seemed appropriate, however you can do fine without
understanding the bits of it. You can "forget" how the bits are, and just know
that it will jump to where you want, you don't even need to know that its a
relative jump rather than absolute, you can see even objdump shows absolute
values for the jumps, but then they are compiled to relative offsets in the
machine code. At some point however you might want to jump further than 20 bits away, and then
it will error out, relocation truncated to fit: R_RISCV_JAL against 'xyz'
, you
will google it, someone will say, 'just use call instead of jal' and you will go
on with your life. A tiny specticle of confusion will be left in your soul. It makes no sense why cant you just jump to a label? You know it is somewhere in memory, you know the instruction for jumping, why cant it jump? This kind of questions happen more often than you think, and you how deep difference is between 'when I see jal error I must use call' and 'oh my jal offset is more than 20 bits away, I should use call'. You can code in assembly all their life, and be satisfied with 'I must use call', you wont even code more or less bugs, you wont be more productive. But you will be incomplete. Knowledge and understanding grows, like a huge interconnected graph, inside of this graph there are nodes of doubt and confusion that spread their tenticles. For most you are not even aware until you reach them. It is a rare and valuable opportunity to turn a doubt node into a light node, do not miss it. Cherish the moments when things make no sense, because you are about to grow. The more confused you are the better.
Lets go back to our bytecode, There are few important concepts that you should
pay attention to, one is BRANCH_ON_ZERO
, JUMP
and EQUAL
. We will discuss
them in detail.
...
NEXT:
lw t0, 0(:IP)
addi :IP, :IP, 4
jr t0
...
# ( f -- )
BRANCH_ON_ZERO:
POP t0
beqz t0, .L_do_branch
addi :IP, :IP, 4
j NEXT
.L_do_branch:
lw :IP, 0(:IP)
j NEXT
# ( -- )
JUMP:
lw :IP, 0(:IP)
j NEXT
# ( a b -- f)
EQUAL:
POP t0
POP t1
beq t0, t1, .L_equal
li t0, 0
PUSH t0
j NEXT
.L_equal:
li t0, -1
PUSH t0
j NEXT
NEXT reads the current word from wherever :IP points, then adds 4 to it, and then jumps to the original value. So every time we call NEXT wherever we jump to, IP is pointing to the next instruction.
JUMP is the easiest, coming to it from NEXT you it will just read the current value of memory[:IP] and set :IP to there, so NEXT will jump to there.
Memory layout at 0x80000400:
Address | Value | Meaning
------------------------------------
0x80000400 | LITERAL | Push number onto stack
0x80000404 | 42 | The number to push
0x80000408 | JUMP | Jump instruction
0x8000040C | 0x80000418 | Jump target address
0x80000410 | LITERAL | (skipped)
0x80000414 | 99 | (skipped)
0x80000418 | EMIT | Print top of stack
0x8000041C | BYE | Exit
Step by step execution:
1. Initial state:
:IP = 0x80000400
:SP = FORTH_STACK_END
2. Execute NEXT:
- Load t0 = memory[0x80000400] = LITERAL
- :IP += 4 (now 0x80000404)
- Jump to LITERAL
3. Execute LITERAL:
- Load value from memory[:IP] = 42
- Push 42 onto stack
- :IP += 4 (now 0x80000408)
- Jump to NEXT
4. Execute NEXT:
- Load t0 = memory[0x80000408] = JUMP
- :IP += 4 (now 0x8000040C)
- Jump to JUMP
5. Execute JUMP:
- Load new_ip = memory[:IP] = 0x80000418
- Set :IP = 0x80000418
- Jump to NEXT
(Notice we skip over addresses 0x80000410-0x80000414)
6. Execute NEXT:
- Load t0 = memory[0x80000418] = EMIT
- :IP += 4 (now 0x8000041C)
- Jump to EMIT
7. Execute EMIT:
- Pop 42 from stack
- Print it
- Jump to NEXT
8. Execute NEXT:
- Load t0 = memory[0x8000041C] = BYE
- :IP += 4 (now 0x80000420)
- Jump to BYE
An example of infinite loop, jump that jumps to itself:
Address | Value
--------------------------
0x80000408 | JUMP
0x8000040C | 0x80000408
I hope you understand the unconditional JUMP, it is pretty much the same as the jump we did in our SUBLEQ computer, we just set PC to some value and thats where the next instruction is loaded from.
The conditional jump BRANCH_ON_ZERO
is very similar but we decide if we should
jump to the argument or not depending if the top of the stack is 0.
EQUAL
is quite straight forward, it pops 2 elements from the stack, if they
are equal it pushes -1 otherwise it pushes 0. So for example if the stack is 1 2
after equal it will be 0
, if it was 3 3
it will be -1
, if you examine
the code for BRANCH_ON_ZERO
, any non zero value is true for us, 745762 is just
as true as 1 and as -1 and as -487327, anything but 0. In Forth it is convention
to use -1, I am not sure why, could be because as 32 bit two's complement value
its 11111111111111111111111111111111.
Let's look at a simple example that checks if two numbers are equal
and branches based on that:
Memory layout at 0x80000400:
Address | Value | Meaning
------------------------------------
0x80000400 | LITERAL | Push first number
0x80000404 | 42 | Value 42
0x80000408 | LITERAL | Push second number
0x8000040C | 42 | Value 42
0x80000410 | EQUAL | Compare numbers
0x80000414 | BRANCH_ON_ZERO | Branch if not equal
0x80000418 | 0x80000428 | Branch target (skip to BYE)
0x8000041C | LITERAL | Push success number
0x80000420 | 7 | Success value
0x80000424 | EMIT | Print it
0x80000428 | BYE | Exit
Step by step execution (when numbers are equal):
1. Start with empty stack
:IP = 0x80000400
2. After first LITERAL: stack = [42]
:IP = 0x80000408
3. After second LITERAL: stack = [42, 42]
:IP = 0x80000410
4. After EQUAL: stack = [-1] (because 42 == 42)
:IP = 0x80000414
5. BRANCH_ON_ZERO sees -1:
- Since top of stack is not zero, don't branch
- :IP += 4 (moves to 0x80000418)
6. LITERAL pushes 7: stack = [7]
:IP = 0x80000420
7. EMIT prints 7
:IP = 0x80000424
8. BYE exits
If we changed the second LITERAL to push 43 instead:
- EQUAL would push 0 (because 42 != 43)
- BRANCH_ON_ZERO would see 0 and jump to 0x80000428
- We would skip the LITERAL 7 and EMIT
- Program would exit immediately
The key insight is that BRANCH_ON_ZERO makes a decision based on the stack's top value:
- If top of stack is 0: jump to the target address
- If top of stack is anything else: continue to next instruction
Now you can read our mini bytecode interpreter again, it can compile the program "2 2 + 3 + 5 + . bye"
.word LITERAL
.word human_program
next_token:
.word PARSE_TOKEN
.word OVER
.word LITERAL
.word 0
.word EQUAL
.word BRANCH_ON_ZERO
.word check_is_number
.word BYE
check_is_number:
.word TWODUP
.word IS_NUMBER
.word LITERAL
.word -1
.word EQUAL
.word BRANCH_ON_ZERO
.word not_a_number
.word TWODUP
.word ATOI
.word NROT
.word PLUS
.word JUMP
.word next_token
not_a_number:
.word TWODUP
.word LITERAL
.word 1
.word LITERAL
.word string_dot
.word MEMCMP
.word BRANCH_ON_ZERO
.word not_a_dot
.word ROT
.word EMIT
.word PLUS
.word JUMP
.word next_token
not_a_dot:
.word TWODUP
.word LITERAL
.word 1
.word LITERAL
.word string_plus
.word MEMCMP
.word BRANCH_ON_ZERO
.word not_a_plus
.word TWOSWAP
.word PLUS
.word NROT
.word PLUS
.word JUMP
.word next_token
not_a_plus:
.word TWODUP
.word LITERAL
.word 3
.word LITERAL
.word string_bye
.word MEMCMP
.word BRANCH_ON_ZERO
.word do_next_token
.word BYE
do_next_token:
.word PLUS
.word JUMP
.word next_token
If I just expand the interpreter to support WRITE then I could write a program that writes a program "800003a9 0x80000420 WRITE 4 0x8000042d WRITE ... ", you can see its not difficult to do this expansion, as WRITE is no different than + or bye. However my forth program will be completely unportable between computers, because on some other computer it wont be compiled for address 80000000, and LITERAL wont be at 800003a9. If we could only know where LITERAL is then our Forth program wont have actual hardcoded memory values. Not only that but also expending the program with hardcoded write instructions is similar to writing a program in machine code, it requires great dedication and possibly desparation, pen paper and confiedence beyond my abilities.
Forth solves this problem by having a dictionary of words, each word has a link to the previous word in the dictionary, and you can search for words.
dictionary:
word_bye:
.word 0 # link
.word 3 # token length
.ascii "bye\0" # first 4 characters of token
.word BYE # address of execution token
word_plus:
.word word_bye
.word 1
.ascii "+\0\0\0"
.word PLUS
word_write:
.word word_plus
.word 5
.ascii "writ"
.word WRITE
word_dup:
.word word_write
.word 3
.ascii "dup\0"
.word DUP
Each entry has at least 4 values, link, token len, first 4 characters of the
token, execution address. Usually the token is actually variable length, but for
simplicity I decided to use fixed size of 4 bytes, wo WRITE
and WRITZ
will
actually find the same forth word, both are 5 letters and the first 4 are WRIT
, but
that is ok for our version.
The first value is very important, the link, it is the address of the previous dictionary entry. If our example dictionary starts at address 8000087d this is how our memory would look like:
8000002c <PLUS>:
8000002c: 0004a283 lw t0,0(s1) <-.
80000030: 00448493 addi s1,s1,4 |
80000034: 0004a303 lw t1,0(s1) |
80000038: 00448493 addi s1,s1,4 |
8000003c: 006282b3 add t0,t0,t1 |
80000040: ffc48493 addi s1,s1,-4 |
80000044: 0054a023 sw t0,0(s1) |
80000048: fd9ff06f jal zero,80000020 |
... |
80000070 <WRITE>: |
80000070: 0004a283 lw t0,0(s1) | <-.
80000074: 00448493 addi s1,s1,4 | |
80000078: 0004a303 lw t1,0(s1) | |
8000007c: 00448493 addi s1,s1,4 | |
80000080: 0062a023 sw t1,0(t0) | |
80000084: f9dff06f jal zero,80000020 | |
... | |
80000088 <BYE>: | |
80000088: 0290006f jal zero,800008b0 <-. | |
... | | |
800000f8 <DUP>: | | |
800000f8: 0004a283 lw t0,0(s1) | | | <-.
800000fc: 00448493 addi s1,s1,4 | | | |
80000100: ffc48493 addi s1,s1,-4 | | | |
80000104: 0054a023 sw t0,0(s1) | | | |
80000108: ffc48493 addi s1,s1,-4 | | | |
8000010c: 0054a023 sw t0,0(s1) | | | |
80000110: f11ff06f jal zero,80000020 | | | |
... | | | |
8000087d: 00000000 0 <--. | | | |
80000881: 00000003 3 | | | | |
80000885: 65796200 bye | | | | |
80000889: 80000088 BYE -+------------------------------' | | |
8000088d: 8000087d -----' <-. | | |
80000891: 00000001 1 | | | |
80000895: 2b000000 + | | | |
80000899: 8000002c PLUS------+----------------------------' | |
8000089d: 8000088d ----------' <-. | |
800008a1: 00000005 5 | | |
800008a5: 74697277 writ | | |
800008a9: 80000070 WRITE----------+---------------------------' |
800008ad: 8000089d ---------------' |
800008b1: 00000003 3 |
800008b5: 70756400 dup |
800008b9: 800000f8 DUP--------------------------------------------'
This data structure where one entry points to another is called a linked list, it is incrediblu useful and powerful, just as the stack data structure is powerful. I wont spend much time on it, but its power is in allowing variable size entries to reference each other even if they are in different places in memory. You only need to know where the last element is and you can keep adding to the chain of entries, If you know where the first element is you can add from the head (thats how the first element is called) or from the tail (thats how we call the last element). You can also remove any element without having to copy anything, as you traverse it you just make the parent's link point to the link of the element you want to remove, and it just vanishes. Anyway there are also doubly linked lists and skip lists and so on, all have different powers. For now it is safe to think of it as a chain of things. In our case it is a chain of forth words.
A pseudo code for a FIND function looks something like this:
find(tok)
entry = last entry
while true:
if entry == 0
break
compare entry's length with tok length
if not equal
entry = entry's link
continue
compare first 4 characters of entry and tok
if not equal
entry = entry's link
continue
both the length and first 4 characters are equal
this is our token
return entry's execution token address
return not found
This pattern is very common while scanning a linked list, you start from the tail, and go backwards element by element. (or the head, depending if the link is backwards or forwards).
We will do one more change, it is very annoying to keep the token on the stack,
because we have to keep rotating things to get it back on top so we can
calculate the next address, but we dont know how much stack would the word
use. So we will just move the token length and address in a global
variables. And we will modify the PARSE_TOKEN
function to read them and update them,
plus we will add NEXT_TOKEN
that moves the the address to address + length so we
can read the next token next time PARSE_TOKEN
is called.
This is the modified code, plus the FIND function, and the global variables, and the refactored interpreter.
# ...
# ( -- )
NEXT_TOKEN:
la a0, cur_token_address
lw t0, 0(a0)
la t1, cur_token_len
lw t1, 0(t1)
add t0, t0, t1 # len + addr
sw t0, 0(a0)
j NEXT
# ( -- len addr )
PARSE_TOKEN:
# load the variables
la a0, cur_token_address
lw a0, 0(a0)
la a1, cur_token_len
lw a1, 0(a1)
jal token
PUSH a1 # length
PUSH a0 # token address
# store the new values
la t0, cur_token_address
sw a0, 0(t0)
la t1, cur_token_len
sw a1, 0(t1)
j NEXT
# Input:
# a0: token address
# a1: token length
# Output:
# a0: execution token address (or 0 if not found)
do_find:
li t1, 0
mv t3, a1
# The shananigans here are so we can build little endian version of the token
# in 4 bytes dont be intimidated by them, I just made the tokens in the
# dictionary as "bye\0" instead of "\0eyb" to be easier to read
beqz t3, .L_not_found # zero length token
lbu t1, 0(a0)
addi t3, t3, -1
beqz t3, .L_find_start
lbu t2, 1(a0)
sll t2, t2, 8
or t1, t1, t2
addi t3, t3, -1
beqz t3, .L_find_start
lbu t2, 2(a0)
sll t2, t2, 16
or t1, t1, t2
addi t3, t3, -1
beqz t3, .L_find_start
lbu t2, 3(a0)
sll t2, t2, 24
or t1, t1, t2
# t1: has the input token as 4 byte number
# a1: is the length of the input token
# t0: pointer to the entry, we will start at the end
.L_find_start:
la t0, dictionary_end # t0 = last dictionary entry
.L_find_loop:
beqz t0, .L_not_found # if the entry is 0, means we didnt find a match
lw t2, 4(t0) # load the length of the entry
bne t2, a1, .L_next_entry # compare lengths
lw t2, 8(t0) # load entry name
bne t2, t1, .L_next_entry # compare names
lw a0, 12(t0) # load the actual execution token
ret # return the execution token
.L_next_entry:
lw t0, 0(t0) # follow link to next entry
j .L_find_loop
.L_not_found:
li a0, 0 # return 0 for not found
ret
# ( len addr -- xt )
FIND_WORD:
POP a0 # token address
POP a1 # token length
call FIND
PUSH a0 # push execution token or 0
j NEXT
# ...
human_program:
.asciz "842 31 + 721 + 3 + . bye"
cur_token_address:
.word human_program
cur_token_len:
.word 0
bytecode:
next_token:
.word NEXT_TOKEN
.word PARSE_TOKEN
.word OVER
.word LITERAL
.word 0
.word EQUAL
.word BRANCH_ON_ZERO
.word check_is_number
check_is_number:
.word TWODUP
.word IS_NUMBER
.word LITERAL
.word -1
.word EQUAL
.word BRANCH_ON_ZERO
.word not_a_number
.word ATOI
.word JUMP
.word next_token
not_a_number:
.word FIND_WORD
.word DUP # we want a copy otherwise EQUAL will pop the word we need
.word LITERAL
.word 0
.word EQUAL
.word BRANCH_ON_ZERO
.word forth_word_found # find word is not zero, meaning we found something
.word BYE # word not found, just exit
forth_word_found:
.word LITERAL
.word execute_placeholder # we want to write the execution token there
.word WRITE # ( value addr -- )
# value is the execution token address (XT)
# returned from FIND_WORD and is on the stack
# address is execute_placeholder
execute_placeholder:
.word 0 # <-- magic! WRITE will write at this location, and then NEXT will jump to it
.word JUMP
.word next_token
dictionary:
word_bye:
.word 0 # link
.word 3 # token length
.ascii "bye\0" # first 4 characters of token
.word BYE # address of execution token
word_plus:
.word word_bye
.word 1
.ascii "+\0\0\0"
.word PLUS
word_write:
.word word_plus
.word 5
.ascii "writ"
.word WRITE
word_dup:
.word word_write
.word 3
.ascii "dup\0"
.word DUP
word_emit:
dictionary_end:
.word word_dup
.word 1
.ascii ".\0\0\0"
.word EMIT
FIND_WORD and FIND are cool, I think you will understand them on your own, but I am not sure you will appreciate the beauty of WRITE.
.word LITERAL
.word execute_placeholder
.word WRITE
execute_placeholder:
.word 0
.word JUMP
.word next_token
When assembled looks like this
...
800004ad: 8000004c LITERAL
800004b1: 800004b9 execute_placeholder
800004b5: 80000070 WRITE
800004b9: 00000000 [ execute placeholder value ]
800004bd: 80000330 JUMP
800004c1: 80000445 next_token
...
At this point the stack is the address of the execution token, for example PLUS
8000002c, so the stack is 842 31 8000002c
, then we have LITERAL 800004b9, the
stack becomes 842 31 8000002c 800004b9
and then LITERAL calls NEXT which
executes WRITE. WRITE will write the value 8000002c at location 800004b9, so
memory[800004b9] = 8000002c
or memory[execute_placeholder] = PLUS
, then it will
call NEXT, and lo and behold, instead of 00000000, which was going to be
executed have not been for our WRITE, we have 8000002c and we will execute PLUS.
The next token again will modify this location, and again it will be executed.
The program changes itself in order to execute itself. How cool is that.
There we have it, now we can trivially add words to our dictionary and expand our language. The only thing is missing is the power to easilly expand the dictionary form the program itself. We can kind of do it now with WRITE, but, it will be beyond painfull, and will require carefull planning and patience that I dont have.
Few things are needed to expand the dictionary, first now we know where it ends by using the dictionary_end label, which has to be dynamic, and we need some helper functions to make it easy to create new words. We also need 4 more bytes per dictionary entry for flags, as you will see some words will be different than others.
Imagine this program : square dup * ; 3 square .
it will create the word
square
, when we jump into it it will execute dup and then multiplication. :
is
also a word of course and ;
as well. but when we get to square
, we should not
try to find it, but we should create it, and then dup *
should not be executed,
but we have to store their bytecode into the square
dictionary entry to be
executed when square is invoked. We will just have a MODE variable that defines
if we are in compilation mode (where we are creating a dictionary entry) or
evaluation mode where we are exeucting the words. Some words however will have
to be immidiately executed even in compilation mode, you will see later why, so
we need flags per word to know if its immediate word or not.
After we compile the square word into the dictionary it could look something like this:
...
word_emit:
.word word_dup
.word 1
.ascii ".\0\0\0"
.word 0 # flag
.word EMIT
word_square:
.word word_emit
.word 6
.ascii "squa"
.word 0 # flag
.word DUP
.word MULTIPLY
The issue with this structure is that once we get to execute the execution token
of square
, then we must have our :IP jump to there, as if we create a new
thread, we we have to break out of the thread we were on, and go there to
execute DUP and MULTIPLY and then somehow we have to go back.
word_square:
.word word_emit
.word 6
.ascii "squa"
.word 0
IP->.word DUP
.word MULTIPLY
This is very similar issue to how we jal
, we need to store where are we coming
back. We will use another stack for that purpose, we will store :IP there before
jumping, and then before returning we will jump back.
word_square:
.word word_emit
.word 6
.ascii "squa"
.word 0
.word DOCOL <- push :IP in the return stack, and set it to our thread and call NEXT
.word DUP
.word MULTIPLY
.word EXIT <- pop :IP from the return stack and call NEXT
Remember how NEXT works, it first loads the value from memory[:IP] then it increments it, so when it jumps somewhere, we have the intended NEXT :IP before it jumped to us.
NEXT:
lw t0, 0(:IP)
addi :IP, :IP, 4
jr t0
Which means if I have a thread like this, when PLUS calls NEXT, the IP we will get is the address of SQUARE.
.word LITERAL
.word 8
.word LITERAL
.word 7
.word PLUS
.word SQUARE
.word EMIT
.word BYE
in DOCOL we will capture this value, add 4 to it, and push it in the return stack, because when we return from SQUARE we want to get to EMIT.
We will have two threads to weave. 8 7 + square . bye
.word LITERAL
.word 8
.word LITERAL
.word 7
.word PLUS
.word SQUARE >------.
\
`
.word DOCOL
.word DUP
.word MULTIPLY
.word EXIT
.
/
.word EMIT <---------'
.word BYE
And of course you can imagine SQUARE being more complicated, lets make
gigantize
which does : double dup + ; : gigantize double dup * ; 7 gigantize . bye
, so gigantize will double the stack value and then square it.
Main Thread GIGANTIZE Thread DOUBLE Thread
------------- ----------------- --------------
LITERAL 7
GIGANTIZE ---------------> DOCOL
DOUBLE --------------------> DOCOL
DUP
PLUS
EXIT
<---------------------'
DUP
MULTIPLY
EXIT
EMIT <----------------'
BYE
Stack evolution:
7 # After LITERAL 7
7 # Enter GIGANTIZE
7 7 # Enter DOUBLE, DUP
14 # PLUS
14 # Return to GIGANTIZE
14 14 # DUP
196 # MULTIPLY
196 # Return to main, EMIT
The term thread is used in Forth to mean this idea of a silk thread of instructions. In modern programming the term thread means something else, and yet somehow similar, a thread of execution in modern programming is a lightweight process (runnign program) that shares memory with the main process allowing threads to communicate and execute instructions independently of each other. You can see it is quite different than the Forth thread, but you can also see how the weaving metaphor works spot on, and thats how they call it in the Forth magazines and books, so I will stick to that word. Also the term "forth word" has nothing to do with the assembly notation ".word" in our case .word just means declaration of a 4 byte value, a forth word now you see is a entry in the dictionary.
In the code so far in do_find
we do lw a0, 12(t0)
which will load the actual
machine code address, which I loosely call execution token. It is where NEXT
will jump to.
NEXT:
lw t0, 0(:IP)
addi :IP, :IP, 4
jr t0
Our words at the moment have actual machine code pointer at memory[:IP], e.g. If the address of DUP is 800000f8, then the value at memory[:IP] will be 800000f8 when we are about to execute DUP, it wont be the address of the word definition of DUP in the dictionary.
This is a major decision when making a Forth interpreter, do you point to the machine code or to the word. And in general the question of "how do you actually execute words from the dictionary". In our case we will point to the machine code.
There is a slight problem with the current explanation. Making our word gigantize use the word double would look like this:
word_double:
.word word_square
.word 6
.ascii "doub"
.word 0
.word DOCOL
.word DUP
.word PLUYS
.word EXIT
word_gigantize:
.word word_double
.word 9
.ascii "giga"
.word 0
.word DOCOL
.word DOCOL # this DOCOL is double's execution token
.word DUP
.word PLUS
.word EXIT
When we execute DOCOL for gigantize it will properly store the :IP in the return stack, but, then how are we going to move :IP inside gigantize's thread? That is the first problem. We we have EXECUTE code that at the moment writes the execution token from FIND in memory and then NEXT jumps to it, so for "dup", FIND_WORD will put the machine code of DUP then NEXT will jump to it. So far our :IP has been within the interpreter thread, jumping up and down through the bytecode, once we were executing a word when we call NEXT from the DUP machine code it will move IP onw down to JUMP in the interpreter, and we go again.
...
.word BRANCH_ON_ZERO
.word forth_word_found
.word BYE
forth_word_found:
.word LITERAL
.word execute_placeholder
.word WRITE
execute_placeholder:
IP -> .word 0
.word JUMP
.word next_token
...
The question is how do we make the IP jump within the word's thread?
word_gigantize:
.word word_double
.word 9
.ascii "giga"
.word 0
IP -> .word DOCOL
.word DOCOL
.word DUP
.word PLUS
.word EXIT
We do know at FIND's time the actual address of the thread, as we have found the
word, so we just have to change find to not dereference it (dereference is just
a fancy name of follow the pointer), if we replace this line lw a0, 12(t0)
with addi a0, t0, 12
so that we return the pointer not the dereferenced value
we can then put the address in a register (usually called W or XT), and then in
DOCOL we can do IP = XT + 4 (since we want to jump over the DOCOL) to start
executing the thread. This will work if your word does not call other words, as
you can see in the gigantize example, we just have DOCOL and then DOCOL again,
so we we will lose the XT value. This is a big annoying, it can be solved in
many ways, I wont go into details, but we will solve it in the coolest way. At
the time when we create word, we will create machine code instructions with the
value of the current address and we will set XT to this value from there and
then we will jump to DOCOL.
This is how a word will look in memory:
# : square dup * ;
#
# ...
# DOCOL:
# 80000534: RPUSH :IP <-----------------.
# 80000538: |
# 8000053c: mv :IP, :XT |
# 80000540: j NEXT |
# ... |
# 80000148 <DUP>: |
# 80000148: lw t0, 0(:SP) |
# 8000014c: PUSH t0 |
# ... |
# 80000880: w_square: |
# 80000880: 80000..# link |
# 80000884: 6 # size |
# 80000888: "squa" # token |
# 8000088c: 0 # flags |
# 80000890: 80000894 # CODE FIELD >--------|---.
# 80000894: lui :XT, 0x80001 >---. | <-'
# 80000898: addi :XT, :XT, 0x8a8 >--. |
# 8000089c: lui t0, 0x80000 >---. | |
# 800008a0: addi t0, t0, 0x534 >----|------'
# 800008a4: jr t0 |
# 800008a8: 80000148 # DUP <--------'
# 800008ac: 80000... # MUL
# 800008b0: 80000... # EXIT
# ...
One more change we need is to add flags field, which is going to be used to tell
us if a word is going to be executed in compile mode or not. In forth :
is the
symbol for 'create a new subroutine word', it puts the interpreter into compiler
mode. For example : square
creates a word square
that will be put in the
dictionary, then all the words after are going to be compiled into square's
thread, and when ;
is seen the word is complete. And the interpreter changes
back to interpret mode.
Generating machine code on the fly is called just in time compilation, we are not doing exactly what modern jit compilers do, and in the context of forth words it means something slightly different, but it is just as cool. To be able to put instructions by hand in memory and jump to them is the ultimate expression of the man machine interraction. There is nothing more beautiful than that.
GCC for example takes C code and generates machine code, this is called ahead of time compilation, compilers are more complicated than assemblers, they understand more about the higher level semantics of the program and can make executive decisions about the generated code, for example:
a = 5
b = 4
a = b
An optimizing compiler can see that a=5 is irrelevant, it wont even generate the machine code for li t0, 5; sw t0, 40(sp) (if a is on the function stack)
.
if (0) {
a = 6
b = 8
c = a + b
}
It will know that this branch will never be taken, so no code will be generated.
The assembler is much simpler than that, it tries to map one to one what you
wrote into machine code. An interpreter is a program that evaluates your
program. Different than compilers and assemblers, interpreters themselves are
compiled to machine code and they will execute the program, in our case our
interpreter has bytecode that goes through and finds tokens and so on. However
now we will have a compile mode which can compile new bytecode, and even more we
will have a on the fly machine code generation which will assemble machine code.
So it is safe to say we have everything, an interpreter, a compiler and an
assembler. We actually have two interpreters, inner one, the one that is j NEXT
that jumps through the memory threads, and the outer one which in our case
is written in the bytecode of the inner interpreter, and now we will have a
compuler and also the ability to generate machine code, and of course forth
bytecode, from inside our forth program and execute it.
All is the one.

First, lets see the code, boot.s and string.s are the same, I am putting the whole code here even the code for PLUS, EMIT, etc even though it didnt change, but I think its easier to read that way, it will give you some anchor to the things you are familiar with.
Take a deep breath, and just read it, it is code made by a human, to be read by other humans. It might seem frightening, some parts of it are easy, some make no sense and that's OK.
.section .text
.globl forth
.globl NEXT
.macro PUSH reg
addi :SP, :SP, -4
sw \reg, 0(:SP)
.endm
.macro POP reg
lw \reg, 0(:SP)
addi :SP, :SP, 4
.endm
.macro RPUSH reg
addi :RSP, :RSP, -4
sw \reg, 0(:RSP)
.endm
.macro RPOP reg
lw \reg, 0(:RSP)
addi :RSP, :RSP, 4
.endm
forth:
la :SP, FORTH_STACK_END
la :RSP, RETURN_STACK_END
mv :IP, zero
mv :XT, zero
la :HERE, dictionary_end
la :LATEST, dictionary_end - 5*4
li :MODE, 0
la t1, human_program
la t0, cur_token_address
sw t1, 0(t0)
la t0, cur_token_len
sw zero, 0(t0)
la :IP, interpreter_bytecode
la :XT, interpreter_bytecode
j NEXT
NEXT:
lw t0, 0(:IP) # load the actual code address from [IP]
addi :IP, :IP, 4 # move IP to next cell
jr t0 # jump
# ( a b -- c )
PLUS:
POP t0
POP t1
add t0, t0, t1
PUSH t0
j NEXT
# ( a b -- c )
MUL:
POP t0
POP t1
mul t0, t0, t1
PUSH t0
j NEXT
# ( -- n )
LIT:
lw t0, 0(:IP)
addi :IP, :IP, 4
lw t1, 0(:IP)
PUSH t0
j NEXT
# ( n -- )
EMIT:
POP a0
jal print_int
j NEXT
# ( value addr -- )
BANG:
POP t0 # address
POP t1 # value
sw t1, 0(t0)
j NEXT
# ( -- )
BYE:
j qemu_exit
# ( -- )
CR:
li a0, '\n'
jal putc
j NEXT
# ( len addr -- n )
ATOI:
POP a0 # address
POP a1 # length
jal atoi
PUSH a0
j NEXT
# ( len addr -- f )
IS_NUMBER:
POP a0 # address
POP a1 # length
jal is_number
PUSH a0
j NEXT
# ( a -- a a )
DUP:
POP t0
PUSH t0
PUSH t0
j NEXT
# ( a b -- b a )
SWAP:
POP t0 # b
POP t1 # a
PUSH t0
PUSH t1
j NEXT
# ( a -- )
DROP:
POP zero
j NEXT
# ( a b -- )
TWODROP:
POP zero
POP zero
j NEXT
# ( a b -- a b a b )
TWODUP:
POP t0 # b
POP t1 # a
PUSH t1 # a
PUSH t0 # b
PUSH t1 # a
PUSH t0 # b
j NEXT
# ( n1 n2 -- n1 n2 n1 )
OVER:
POP t0 # n2
POP t1 # n1
PUSH t1 # n1
PUSH t0 # n2
PUSH t1 # n1
j NEXT
# (x1 x2 x3 x4 -- x3 x4 x1 x2)
TWOSWAP:
POP t0 # x4
POP t1 # x3
POP t2 # x2
POP t3 # x1
PUSH t1
PUSH t0
PUSH t3
PUSH t2
j NEXT
# (x1 x2 x3 -- x2 x3 x1 )
ROT:
POP t0 # x3
POP t1 # x2
POP t2 # x1
PUSH t1 # x2
PUSH t0 # x3
PUSH t2 # x1
j NEXT
# (x1 x2 x3 -- x3 x1 x2)
NROT:
POP t0 # x3
POP t1 # x2
POP t2 # x1
PUSH t0 # x3
PUSH t2 # x1
PUSH t1 # x2
j NEXT
# ( a b -- f)
EQUAL:
POP t0
POP t1
beq t0, t1, .L_equal
li t0, 0
PUSH t0
j NEXT
.L_equal:
li t0, -1
PUSH t0
j NEXT
# ( len1 addr1 len2 addr2 -- flag)
MEMCMP:
POP a2
POP a3
POP a0
POP a1
call memcmp
PUSH a0
j NEXT
# ( f -- )
BRANCH_ON_ZERO:
POP t0
beqz t0, .L_do_branch
addi :IP, :IP, 4
j NEXT
.L_do_branch:
lw :IP, 0(:IP)
j NEXT
# ( -- )
JUMP:
lw :IP, 0(:IP)
j NEXT
# just a debug function to print the whole stack
# print debugging.. some people hate it some people love it
# I both hate it and love it
DEBUG_STACK:
addi sp, sp, -12
sw ra, 0(sp)
sw s8, 4(sp)
sw s9, 8(sp)
li a0, '<'
call putc
li a0, '>'
call putc
li a0, ' '
call putc
mv s9, :SP
add s9, s9, -4
la s8, FORTH_STACK_END
add s8, s8, -4
.L_debug_stack_loop:
beq s8, s9, .L_debug_stack_loop_end
lw a0, 0(s8)
call print_unsigned_hex
li a0, ' '
call putc
addi s8, s8, -4
j .L_debug_stack_loop
.L_debug_stack_loop_end:
li a0, '\n'
call putc
lw ra, 0(sp)
lw s8, 4(sp)
lw s9, 8(sp)
addi sp, sp, 12
j NEXT
do_next_token:
la a0, cur_token_address
lw t0, 0(a0)
la t1, cur_token_len
lw t1, 0(t1)
add t0, t0, t1 # len + addr
sw t0, 0(a0)
ret
# ( -- )
NEXT_TOKEN:
jal do_next_token
j NEXT
do_parse_token:
addi sp, sp, -4
sw ra, 0(sp)
# load the variables
la a0, cur_token_address
lw a0, 0(a0)
la a1, cur_token_len
lw a1, 0(a1)
jal token # parse the token
# store the new values
la t0, cur_token_address
sw a0, 0(t0)
la t1, cur_token_len
sw a1, 0(t1)
lw ra, 0(sp)
addi sp, sp, 4
# return a0 a1 from token
ret
# ( -- len addr )
PARSE_TOKEN:
jal do_parse_token
PUSH a1 # length
PUSH a0 # token address
j NEXT
# Input:
# a0: token address
# a1: token length
# Output:
# a0: execution token address (or 0 if not found)
do_find:
li t1, 0
mv t3, a1
# The shananigans here are so we can build little endian version of the token
# in 4 bytes dont be intimidated by them, I just made the tokens in the
# dictionary as "bye\0" instead of "\0eyb" to be easier to read
beqz t3, .L_not_found # zero length token
lbu t1, 0(a0)
addi t3, t3, -1
beqz t3, .L_find_start
lbu t2, 1(a0)
sll t2,t2, 8
or t1, t1, t2
addi t3, t3, -1
beqz t3, .L_find_start
lbu t2, 2(a0)
sll t2, t2, 16
or t1, t1, t2
addi t3, t3, -1
beqz t3, .L_find_start
lbu t2, 3(a0)
sll t2, t2, 24
or t1, t1, t2
# t1: has the input token as 4 byte number
# a1: is the length of the input token
# t0: pointer to the entry, we will start at the end
.L_find_start:
mv t0, :LATEST
.L_find_loop:
beqz t0, .L_not_found # if the entry is 0, means we didnt find a match
lw t2, 4(t0) # load the length of the entry
bne t2, a1, .L_next_entry # compare lengths
lw t2, 8(t0) # load entry name
bne t2, t1, .L_next_entry # compare names
add a0, t0, 16 # return the code address
ret
.L_next_entry:
lw t0, 0(t0) # follow link to next entry
j .L_find_loop
.L_not_found:
li a0, 0 # return 0 for not found
ret
# ( len addr -- xt )
FIND_WORD:
POP a0 # token address
POP a1 # token length
jal do_find
PUSH a0
j NEXT
DOCOL:
RPUSH :IP
mv :IP, :XT
j NEXT
EXIT:
RPOP :IP
j NEXT
COLON:
li :MODE, -1 # enter compile mode
jal do_create
# we want to achieve this, creating a new word
#
# : square dup * ;
#
# ...
# DOCOL:
# 80000534: RPUSH :IP <-----------------.
# 80000538: |
# 8000053c: mv :IP, :XT |
# 80000540: j NEXT |
# ... |
# 80000148 <DUP>: |
# 80000148: lw t0, 0(:SP) |
# 8000014c: PUSH t0 |
# ... |
# 80000880: w_square: |
# 80000880: 80000..# link |
# 80000884: 6 # size |
# 80000888: "squa" # token |
# 8000088c: 0 # flags |
# 80000890: 80000894 # CODE FIELD >--------|---.
# 80000894: lui :XT, 0x80001 >---. | <-'
# 80000898: addi :XT, :XT, 0x8a8 >--. |
# 8000089c: lui t0, 0x80000 >---. | |
# 800008a0: addi t0, t0, 0x534 >----|------'
# 800008a4: jr t0 |
# 800008a8: 80000148 # DUP <--------'
# 800008ac: 80000... # MUL
# 800008b0: 80000... # EXIT
# ...
# 1. EXECUTION CODE FIELD point to HERE + 4, where we will
# put the machine code: memory[HERE] = HERE+4
mv t0, :HERE
add t0, t0, 4
sw t0, 0(:HERE)
addi :HERE, :HERE, 4
# 2. Generate absolute address for where we want DOCOL to jump, in our case we want HERE+20
mv t0, :HERE
addi t0, t0, 20
# 3. Generate the machine code
# li :XT, value of :HERE + 20
# la t0, DOCOL
# jr t0
# and expanded
# lui :XT, value << 12
# addi :XT, :XT, value << 20 >> 20
# lui t0, value << 12
# addi t0, t0, value << 20 >> 20
# jr t0
# 3.1 Generate machine code for XT = HERE + 20 at time of compilation
li a0, 21 # XT is s5, which is register x21
mv a1, t0
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4
# 3.1 Generate machine code for la t0, DOCOL
li a0, 5 # t0 is x5
la a1, DOCOL
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4
# 3.2 Generate machine code for jr t0
li a0, 5 # t0 is x5
jal do_jr
sw a0, 0(:HERE) # jr
addi :HERE, :HERE, 4
j NEXT
# ( -- )
SEMICOLON:
mv :MODE, zero # exit compile mode
la t0, EXIT
sw t0, 0(:HERE)
addi :HERE, :HERE, 4
j NEXT
# ( x -- )
COMMA:
POP t0
sw t0, 0(:HERE)
addi :HERE, :HERE, 4
j NEXT
# ( -- flag )
MODE:
PUSH :MODE
j NEXT
do_create:
addi sp, sp, -4
sw ra, 0(sp)
jal do_next_token
jal do_parse_token
beqz a1, .L_create_error
# link field (4 bytes)
sw :LATEST, 0(:HERE)
# length field (4 bytes)
sw a1, 4(:HERE)
# token field (4 bytes)
li t1, 0
mv t3, a1
.L_create_build_token:
lbu t1, 0(a0)
addi t3, t3, -1
beqz t3, .L_create_write_token
lbu t2, 1(a0)
sll t2, t2, 8
or t1, t1, t2
addi t3, t3, -1
beqz t3, .L_create_write_token
lbu t2, 2(a0)
sll t2, t2, 16
or t1, t1, t2
addi t3, t3, -1
beqz t3, .L_create_write_token
lbu t2, 3(a0)
sll t2, t2, 24
or t1, t1, t2
.L_create_write_token:
sw t1, 8(:HERE)
# flags field
sw zero, 12(:HERE)
# move the dictionary end
mv :LATEST, :HERE
# update HERE to point to the end of the word
addi :HERE, :HERE, 16
lw ra, 0(sp)
addi sp, sp, 4
ret
.L_create_error:
la a0, err_create_error
j panic
panic:
jal puts
jal getch
j qemu_exit
# ( xt -- f )
SHOULD_COMPILE_WORD:
POP t0
beqz :MODE, .L_dont_compile
# if we are in compile mode, check the flag
lw t0, -4(t0) # flag value
bnez t0, .L_dont_compile # flag is immediate, execute it
li t1, -1
PUSH t1
j NEXT
.L_dont_compile:
PUSH zero
j NEXT
# ( addr -- value )
AT:
POP t0
lw t0, 0(t0)
PUSH t0
j NEXT
# ( xt -- )
EXECUTE:
POP t0 # xt
lw t0, 0(t0) # load code pointer
jr t0
# ( -- c )
KEY:
jal getch
PUSH a0
j NEXT
# ( -- addr )
PUSH_HERE:
PUSH :HERE
j NEXT
# Li ( a0: reg, a1: imm -- a0: opcode_lui a1: opcode_addi )
do_li:
# Extract upper immediate
# compensating for sign extension if needed
srli t0, a1, 12 # First get upper 20 bits
li t3, 0x800
and t1, a1, t3 # Check bit 11
beqz t1, no_adjust
addi t0, t0, 1 # Adjust for sign extension
no_adjust:
# LUI
#
# bits [31:12] = immediate
# bits [11:7] = rd
# bits [6:0] = 0x37 (opcode)
#
li a2, 0x37 # LUI opcode
slli t2, t0, 12 # upper immediate
or a2, a2, t2
slli t2, a0, 7 # rd
or a2, a2, t2
# ADDI
#
# bits [31:20] = immediate
# bits [19:15] = rs1
# bits [14:12] = 0 (funct3)
# bits [11:7] = rd
# bits [6:0] = 0x13 (opcode)
#
li a3, 0x13 # ADDI opcode
li t1, 0xfff
and t0, a1, t1 # lower 12 bits
slli t2, t0, 20 # immediate
or a3, a3, t2
slli t2, a0, 15 # rs1
or a3, a3, t2
slli t2, a0, 7 # rd
or a3, a3, t2
mv a0, a2
mv a1, a3
ret
# ( reg imm -- lui addi )
LI:
POP a1 # imm
POP a0 # reg
call do_li
PUSH a0 # lui
PUSH a1 # addi
j NEXT
# JR ( a0: reg -- a0: opcode_jr )
do_jr:
mv t0, a0
# bits [31:20] = 0 for imm=0
# bits [19:15] = reg
# bits [14:12] = 0 (funct3=0)
# bits [11:7] = x0 => 0
# bits [6:0] = 0x67 (opcode for JALR)
#
# So the entire instruction is:
# (reg << 15) | 0x67
slli t1, t0, 15 # reg << 15
li t2, 0x67 # opcode JALR
or t1, t1, t2 # final 32-bit instruction
mv a0, t1
ret
# JR ( reg -- opcode_jr )
JR:
POP a0
call do_jr
PUSH a0
j NEXT
dictionary:
word_bye:
.word 0 # link
.word 3 # token length
.ascii "bye\0" # first 4 characters of token
.word 0 # flags
.word BYE # address of execution token
word_plus:
.word word_bye
.word 1
.ascii "+\0\0\0"
.word 0
.word PLUS
word_mul:
.word word_plus
.word 1
.ascii "*\0\0\0"
.word 0
.word MUL
word_bang:
.word word_mul
.word 1
.ascii "!\0\0\0"
.word 0
.word BANG
word_at:
.word word_bang
.word 1
.ascii "@\0\0\0"
.word 0
.word AT
word_dup:
.word word_at
.word 3
.ascii "dup\0"
.word 0
.word DUP
word_emit:
.word word_dup
.word 1
.ascii ".\0\0\0"
.word 0
.word EMIT
word_cr:
.word word_emit
.word 2
.ascii "cr\0\0"
.word 0
.word CR
word_debug_stack:
.word word_cr
.word 2
.ascii ".s\0\0"
.word 0
.word DEBUG_STACK
word_colon:
.word word_debug_stack
.word 1
.ascii ":\0\0\0"
.word 0
.word COLON
word_semicolon:
.word word_colon
.word 1
.ascii ";\0\0\0"
.word 1 # immediate
.word SEMICOLON
word_li:
.word word_semicolon
.word 2
.ascii "li\0\0"
.word 0
.word LI
word_jr:
.word word_li
.word 2
.ascii "jr\0\0"
.word 0
.word JR
word_key:
.word word_jr
.word 3
.ascii "key\0"
.word 0
.word KEY
word_here:
.word word_key
.word 4
.ascii "here"
.word 1
.word PUSH_HERE
word_comma:
.word word_here
.word 1
.ascii ",\0\0\0"
.word 1
.word COMMA
dictionary_end:
# forth stack
.space 2048
FORTH_STACK_END:
# forth return stack
.space 2048
RETURN_STACK_END:
# token variables
cur_token_address:
.word 0
cur_token_len:
.word 0
# the outer interpreter
interpreter_bytecode:
next_token:
.word NEXT_TOKEN
.word PARSE_TOKEN
.word OVER
.word BRANCH_ON_ZERO
.word exit
check_is_number:
.word TWODUP
.word IS_NUMBER
.word BRANCH_ON_ZERO
.word not_a_number
.word ATOI # the number is on the stack
.word MODE
.word BRANCH_ON_ZERO
.word next_token # we are in eval mode
.word LIT
.word LIT
.word COMMA
.word COMMA
.word JUMP
.word next_token
not_a_number:
.word FIND_WORD
.word DUP
.word BRANCH_ON_ZERO
.word exit # word not found, just exit for now
forth_word_found:
.word DUP
.word SHOULD_COMPILE_WORD
.word BRANCH_ON_ZERO
.word execute_word # we are in eval mode, execute the word
.word AT # we are in compile mode, dereference the execution token
.word COMMA # write the code address in the thread
.word JUMP
.word next_token
execute_word:
.word EXECUTE
.word JUMP
.word next_token
exit:
.word BYE
# error messages
err_create_error:
.asciz "\nerror: create missing name, usage: create [name]\n"
# our actual human readable program
human_program:
.asciz "
: plus3 3 + ; 2 plus3 + . cr
: square dup + ;
: double dup * ;
: gigantize square double ;
3 gigantize . cr
bye
"
.end
There are few important things, oen I renamed WRITE to BANG which is a synonym
for !
, thats how Forth calls it, and I added AT which is just reading a value
from memory and pushes it to the stack. I changed FIND to return the exeuction
token instead of dereferencing it, and added few helper functtions. I added the
flag field in the dictionary entries, and I added 5 new registers, :RSP (s2).
:HERE (s4), :XT (s5), :MODE (s6), and :LATEST (s7), will explain them in a bit.
Renamed LITERAL to LIT, we wiull add LITERAL as a different kind of word that
will use LIT, and added new macros RPUSH and RPOP that push and pop values from
the Forth Return Stack. Changed do_find to use :LATEST as the end of the
dictionary, since we will modify it, we need to know where it ends in order to
add to it and to search it.
This is how the interpreter bytecode looks in memory.
address value label
--------------------------------------------------------
80001988 <next_token>:
80001988: 800003cc # NEXT_TOKEN
8000198c: 8000041c # PARSE_TOKEN
80001990: 800001d4 # OVER
80001994: 80000310 # BRANCH_ON_ZERO
80001998: 80001a44 # exit
8000199c <check_is_number>:
8000199c: 800001a0 # TWODUP
800019a0: 80000120 # IS_NUMBER
800019a4: 80000310 # BRANCH_ON_ZERO
800019a8: 80001a04 # not_a_number
800019ac: 80000100 # ATOI
800019b0: 80000590 # MODE
800019b4: 80000310 # BRANCH_ON_ZERO
800019b8: 80001988 # next_token
800019bc: 800000b0 # LIT
800019c0: 800000b0 # LIT
800019c4: 8000057c # COMMA
800019c8: 8000057c # COMMA
800019cc: 8000032c # JUMP
800019d0: 80001988 # next_token
80001a04 <not_a_number>:
80001a04: 800004b0 # FIND_WORD
80001a08: 80000140 # DUP
80001a0c: 80000334 # BRANCH_ON_ZERO
80001a10: 80001a44 # exit
80001a18 <forth_word_found>:
80001a18: 80000140 # DUP
80001a1c: 80000634 # SHOULD_COMPILE_WORD
80001a20: 80000310 # BRANCH_ON_ZERO
80001a24: 80001a38 # execute_word
80001a28: 80000678 # AT
80001a2c: 8000057c # COMMA
80001a30: 8000032c # JUMP
80001a34: 80001988 # next_token
80001a38 <execute_word>:
80001a38: 80000690 # EXECUTE
80001a3c: 8000032c # JUMP
80001a40: 80001988 # next_token
80001a44 <exit>:
80001a44: 800000f0 # BYE
Lets start by looking at what the bytecode does, step by step.
next_token:
.word NEXT_TOKEN
.word PARSE_TOKEN
.word OVER
.word BRANCH_ON_ZERO
.word exit
NEXT_TOKEN
moves just addst the current token's length to the current token's
pointer, hence it advances the token pointer by len. PARSE_TOKEN
will then
skip whitespaces until it finds the next token and it will push its length and
address on the stack, OVER
will copy the length and push it on top, and then
BRANCH_ON_ZERO
will not branch to BYE
if the length is zero, if not we
continue. You can see i removed a bunch of EQUAL and LITERAL 0 and so on, they
of course are not needed, I was using them in order to exercise your ability to
step through the bytecode and think what it is doing.
check_is_number:
.word TWODUP
.word IS_NUMBER
.word BRANCH_ON_ZERO
.word not_a_number
.word ATOI # the number is on the stack
.word MODE
.word BRANCH_ON_ZERO
.word next_token # we are in eval mode
.word LIT
.word LIT
.word COMMA
.word COMMA
.word JUMP
.word next_token
When we enter check_is_number
the stack is again length, address
as
BRANCH_ON_ZERO
popped OVER
's copy of the length. TWODUP
will copy the top
2 elements of the stack, so it will become length, address, length, address
,
we need to duplicate them so we can give a copy to IS_NUMBER
which will
consume top 2 elements and return a flag if the string is a number or not, then
BRANCH_ON_ZERO
will jump to not_a_number
if IS_NUMBER returned 0, otherwise
we continue, at the point of ATOI the stack is again length, address
, and ATOI
will take those and convert the string of ASCII symbols into a single 4 byte
integer and will push it to the stack, after that we will use the MODE
keyword
which just pushes the value of the :MODE
register to the stack, in our case
thats s7, and we use it to keep track if we are in compile mode or in
interpreter mode. If we are in interpreter mode at this point we are good to go,
and just jump to the next token, as the number is properly on the stack, if not
we must compile the vauke of the number into the dictionary of the word we are
creating. For example lets say we have this definition : plus3 3 + ;
we could
use it like this 2 plus3 + .
which will push 2 to the stack then we jump to
plus3 which pushes 3 on the stack, calls plus which pushes the result on the
stack and then we print the number top of the stack, which will be 5. The plus3
word should inside of it have code that pushes the number 3 on the stack, so it
should have .word LIT .word 3 .word PLUS
inside its thread. To do that we must
write the address of LIT and the value from the stack into the thread..
The stack at this point is just the number, e.g. 3, LIT LIT will push the
address of LIT on top, so it will become 3 800000b8
(3 is the number, and
800000b8 is the address of LIT), COMMA takes the top of the stack and writes it
wherever :HERE is, and increments :HERE += 4, :HERE is a register (s4) where we
keep the value of where we are writing now in the dictionary.
The first COMMA call will take the top of the stack, which now is the address of LIT and write it wherever HERE is pointing, and set HERE += 4 , then the second COMMA will write the number 3 at HERE (which is now HERE+4).
This writes exactly what we want in the dictionary entry .word LIT .word 3
.
Then it will just jump to next_token.
not_a_number:
.word FIND_WORD
.word DUP
.word BRANCH_ON_ZERO
.word exit # word not found, just exit for now
Reaching not_a_number
we have the length, address
, we are coming here all
the way up from the IS_NUMBER BRANCH_ON_ZERO
check, we call FIND_WORD
which
is going to return the execution token, or 0 if not found, we dup it so we can
check if its zero, meaning word is not found, BRANCH_ON_ZERO
will jump to
BYE
Otherwise we continue to process the word.
forth_word_found:
.word DUP
.word SHOULD_COMPILE_WORD
.word BRANCH_ON_ZERO
.word execute_word # we are in eval mode, execute the word
.word AT # we are in compile mode, dereference the execution token
.word COMMA # write the code address in the thread
.word JUMP
.word next_token
Again we copy the execution token, and check if we should compile it or not, if SHOULD_COMPILE_WORD
returns zero means we should execute it, SHOULD_COMPILE_WORD
will return 0 either if we are in compile mode, meaning we are writing bytecode into a thread, or the word is marked as immediate in which case it will be executed in immediate mode. If the word is supposed to be compiled, we continue to AT, COMMA
, The top of the stack is still the execution token, AT reads the value at specific address, which means it will read the value at the address of the execution token and dereference it, as we will get the pointer to the actual machine code, then COMMA
will write it at HERE
.
execute_word:
.word EXECUTE
.word JUMP
.word next_token
At this point again we have the execution token at the top of the stack, we will
jump to EXECUTE
which will dereference it and jump to the machine code. At this
point IP is pointing to JUMP so when the word is executed we will come back to
our JUMP
and go to the start again.
exit:
.word BYE
This is quite self explanatory, just exit qemu by NEXT jumping to BYE.
You can guess by now, maybe SHOULD_COMPILE_WORD
was a big enough hint, that
this whole interpreter can be also written in few lines of assembly, there is
zero reason to write it in the inner interpreter's bytecode, but I thought this
way is more fun, and I think we should have more fun with computers. Make them
do things, the more bizzare the better. Language that writes itself in itself
while overwriting itself with machine code of the machine that is running, whats
better than that.
Stepping through this forth program : square dup * ; 5 square . cr bye
. First
:
will set the interpreter in compile mode then create the word square in the
dictionary, inside with the thread of dup *
then ;
as it is immediate word,
will be executed in compile mode and will set the interpreter back in evaluation
mode, then 5 will push 5 to the stack and square will execute the word square,
it will move IP to its thread and execute dup and then +, dup will dup the top
of the stack which is 5, so now the stack will be 5 5, then * will multiply the
top 2 elements and push the result, after that we will exit from square and go
to the main thread, . will print the top of the stack which is noe 25, cr will
print a new line and bye will finally exit.
I want to talk speicfically in how we move IP from the main thread into the
word's thread, as I think its really cool, and for that we will have to dig into
:
.
Imagine we are the 'square' word. Empathize with it, think as if you are it, and the other words will interract with you and from somewhere, you dont know where, they will jump into your execution address.
Reminder of how the of the dictionary word looks like:
LINK : points to the previous word
LENGHT : the token length, e.g `begin` is 5 letters
TOKEN : first 4 characters of the token, begi in case of begin
FLAGS : is the word going to be executed in compile time or not
EXEC : where to jump to when the word is executed
Lets look at the dictionary around DUP
800001ac <DUP>:
800001ac: 0004a283 <-----. lw t0,0(s1)
800001b0: 00448493 | addi s1,s1,4
800001b4: ffc48493 | addi s1,s1,-4
800001b8: 0054a023 | sw t0,0(s1)
800001bc: ffc48493 | addi s1,s1,-4
800001c0: 0054a023 | sw t0,0(s1)
800001c4: eedff06f | jal zero,800000b0 <NEXT>
|
... |
|
|
80000aa8 <word_at>: |
80000aa8: 80000a94 <---. |
80000aac: 00000001 | |
80000ab0: 00000040 | |
80000ab4: 00000000 | |
80000ab8: 800006dc | |
| |
80000abc <word_dup>: | |
80000abc: 80000aa8 ----' | points to previous word at 80000aa8
80000ac0: 00000003 | lenght 3
80000ac4: 00707564 | the ascii for d u p
80000ac8: 00000000 | flags are 0
80000acc: 800001ac ------' address of the DUP function
80000ad0 <word_emit>: |
80000ad0: 80000abc ---' points to previous word at 80000abc
80000ad4: 00000001
80000ad8: 0000002e
80000adc: 00000000
80000ae0: 80000134
...
when we have the code 3 dup
our interpreter will first push 3 on the stack,
then it will find the word rup, and we will call EXECUTE, which will load the
value at the code field and jump to it, in our case at address 80000acc, and the
value is 800001ac, and so it will jump to 800001ac, where we have the machine
code for DUP, we will execute the machine code which pops the value from the
stack and pushes it twoce and it will then jump to NEXT.
Now thats all OK because NEXT will jump to the value of the :IP and then do :IP + 4, and our :IP is in the interpreter thread, so all good, NEXT will jump back to the interpreter. For user defined words however we need to make IP inside the word's thread. As we discussed we will create a tiny bit of machine code at the time we are creating the word that stores the location of its thread inside the machine code, so later when we jump to it the machine code has the correct value.
OK time to imagine you are the word 'square', someone jumps to you, first you want to jump to the machine code you have prepared, lets say you are at address 80000880 in the dictionary, and your thread starts at 800008a8 you want to do this
li :XT, 0x800008a8
la t0, DOCOL
jr t0
DOCOL will push the current :IP, which in our case will be somewhere in the interpreter's thread, to the return stack. and then move the :IP thread to :XT which our tiny machine code would've set to 800008a8.
DOCOL:
RPUSH :IP
mv :IP, :XT
j NEXT
As our word is being compiled inside COLON:
we know exactly where are writing
in memory, we keep moving the :HERE register to the right location. You know
when square
is to be executed someone will jump to your code field's value,
your execution token, so we will use that, we will write our machine code just
below it, and make it point to our machine code. Then inside the machine code,
as we know exactly how many instructions we need for it, we will put :XT to just
after the machine code itself, then DOCOL will do the rest and jump after. We
could ofcourse write the machine code for DOCOL itself, but this way seemed more
fun for me.
li
and la
are pseudo instructions, both are broken into lui
and addi
;
lui
loads the upper 20 bits of the value, and addi
the lower 12 bits. So our
machine code is exactly 5 instructions, or 20 bytes.
This is what we want for : square dup * ;
DOCOL:
80000534: RPUSH :IP <-----------------.
80000538: |
8000053c: mv :IP, :XT |
80000540: j NEXT |
... |
80000148 <DUP>: |
80000148: lw t0, 0(:SP) |
8000014c: PUSH t0 |
... |
80000880: w_square: |
80000880: 80000..# link |
80000884: 6 # size |
80000888: "squa" # token |
8000088c: 0 # flags |
80000890: 80000894 # CODE FIELD >--------|---.
80000894: lui :XT, 0x80001 >---. | <-'
80000898: addi :XT, :XT, 0x8a8 >--. |
8000089c: lui t0, 0x80000 >---. | |
800008a0: addi t0, t0, 0x534 >----|------'
800008a4: jr t0 |
800008a8: 80000148 # DUP <--------'
800008ac: 80000... # MUL
800008b0: 80000... # EXIT
Thats a lot of arrows, but I hope you get the idea, our execution token is just below our codefield, if our code field is at 80000890 then the exeuction token will be 80000894, so when someone finds our word in the dictionary, they will dereference is as in they will load the avalue at address 80000890 and jump to the value, which will be 80000894, and thats where our machine code lives, then the machine code in the end will jump to DOCOL which will make NEXT jump to our actual thread, in our case DUP and MUL.
Then we have EXIT which will pop :IP from the return stack and call NEXT to go back wherever we were called from.
Now lets discuss now would we make lui, addi and jr as machine code. Imagine we
want to write the instruction li :XT, 0x80000534
. for us :XT is s5, s5 is
register 21.
| x0 | zero |
| x1 | ra |
| x2 | sp |
| x3 | gp |
| x4 | tp |
| x5 | t0 |
| x6 | t1 |
| x7 | t2 |
| x8 | s0/fp |
| x9 | s1 |
| x10 | a0 |
| x11 | a1 |
| x12 | a2 |
| x13 | a3 |
| x14 | a4 |
| x15 | a5 |
| x16 | a6 |
| x17 | a7 |
| x18 | s2 |
| x19 | s3 |
| x20 | s4 |
| x21 | s5 |
| x22 | s6 |
| x23 | s7 |
| x24 | s8 |
| x25 | s9 |
| x26 | s10 |
| x27 | s11 |
| x28 | t3 |
| x29 | t4 |
| x30 | t5 |
| x31 | t6 |
This li
is going to be split into two instructions, liu x21, 0x80000
and
addi x21, x21, 0x534
, if you take the number 0x80000, 10000000000000000000 in
binary, and shift it to the left 12 bits, it becomes
10000000000000000000000000000000, or 2147483648 in decimal or 0x80000000 in hex,
and when you add 0x534 to it, or 10100110100 in binary, or 1332 in decimal, you
0x80000000 + 0x534 = 0x80000534, which is what we wanted to do.
In 32 bit RISCV there is no one instruction which can move 32 bits to a register, and you might have guessed why, the instructions themselves are 32 bits, and they have parameters, as in we need few bits to know which register destination we will use, and what is the instruction itself so we can execute the right sequence of micro instructions on the wires, enable this on the bus, disable that on the bus..
The machine code for liu x21, 0x80000 is 80000ab7
. 10000000000000000000101010110111 in binary.
For addi addi x21, x21, 0x534 is 534a8a93
, 01010011010010101000101010010011 in binary.
addi x21, x21, 0x534
means x21 = x21 + 0x534
and lui
before that put
0x80000000
into x21 so we get 0x80000000 + 0x534.
You see addi has two registers as paramters, rd and rs1, the format is addi rd, rs1, 12 bit value
, in out case both rd and rs are the same, 21, or in binary
10101, you can see those in the machine code. The
left most 12 bits of the instructions are the actual value we will add to rs1
and the result will be stored in rd. You can see 000
those are also part of the instruction, actually 0010011
just means integer
instruction, 000 means addi, 111 means andi 110 ori and etc, its just different
kind of integer operations, if you remember 74LS181 how you can control what
exact operation it does with S 0 1 2 3
, so I think thats why they decided to
put integer instructions closer together, so you can decode the fact that its
integer operation and then route the operation kind to the wires.
OK now, we have to come up with a function that when give the parameter 21 and
0x80000534, it produces the numbers 80000ab7
and 534a8a93
.
The recipe is quite straight forward, but there is a slight complication with the sign extention.
This is the snippet of the code, with ridiculous amount of comments.
# Input:
# a0 = destination register number (e.g., 21 for x21/:XT)
# a1 = immediate value we want to load (e.g., 0x80000534)
# Output:
# a0 = LUI instruction machine code
# a1 = ADDI instruction machine code
do_li:
# For example, for li x21, 0x80000534:
# 0x80000534 = 1000 0000 0000 0000 0000 0101 0011 0100
# First, handle the upper bits for LUI
srli t0, a1, 12 # Shift right by 12 to get upper 20 bits
# 0x80000534 >> 12 = 0x80000
# 1000 0000 0000 0000 0000
# Check if we need to adjust for sign extension
# This is needed because ADDI sign-extends its 12-bit immediate
li t3, 0x800 # 0x800 = 1000 0000 0000
and t1, a1, t3 # Check bit 11 of original value
# If bit 11 is 1, ADDI will sign-extend negatively
# So we need to add 1 to upper bits to compensate
beqz t1, no_adjust # If bit 11 is 0, no adjustment needed
addi t0, t0, 1 # Add 1 to upper bits to compensate for sign extension
no_adjust:
# Build LUI instruction: lui rd, imm
# Format: [imm[31:12]] [rd] [0110111]
# [20 bits ] [5 ] [7 bits ]
li a2, 0x37 # 0x37 = 0110111 = LUI opcode
slli t2, t0, 12 # Shift immediate to bits 31:12
or a2, a2, t2 # Combine with opcode
slli t2, a0, 7 # Shift rd (dest reg) to bits 11:7
or a2, a2, t2 # Combine with prev result
# Example for x21, 0x80000534:
# LUI x21, 0x80000 becomes:
# imm=10000000000000000000 rd=10101 opcode=0110111
# = 1000 0000 0000 0000 0000 1010 1011 0111 = 0x80000ab7
# Build ADDI instruction: addi rd, rs1, imm
# Format: [imm[11:0]] [rs1] [000] [rd] [0010011]
# [12 bits ] [5 ] [3 ] [5 ] [7 bits ]
li a3, 0x13 # 0x13 = 0010011 = ADDI opcode
li t1, 0xfff # Mask for lower 12 bits
and t0, a1, t1 # Get lower 12 bits of immediate
slli t2, t0, 20 # Shift immediate to bits 31:20
or a3, a3, t2 # Combine with opcode
slli t2, a0, 15 # Shift rs1 (source reg) to bits 19:15
or a3, a3, t2 # Combine with prev result
slli t2, a0, 7 # Shift rd (dest reg) to bits 11:7
or a3, a3, t2 # Combine with prev result
# Example for x21, 0x80000534:
# ADDI x21, x21, 0x534 becomes:
# imm=010100110100 rs1=10101 f3=000 rd=10101 opcode=0010011
# = 0101 0011 0100 1010 1000 1010 1001 0011 = 0x534a8a93
mv a0, a2 # Return LUI instruction in a0
mv a1, a3 # Return ADDI instruction in a1
We call the function like so:
li a0, 21
li a1, 0x80000534
jal do_li
# a0 contains LUI
# a1 contains ADDI
We will use this function to do both li :XT, HERE + 20
and la t0, DOCOL
COLON first creates the base of the word, where we have its link, length, token, and flags. :HERE is a register that points to the last value we added to the dictionary, we keep moving it as we add more and more values.
COLON:
...
# word is created, HERE points to just after the flags
mv t0, :HERE
add t0, t0, 4 # t0 = HERE + 4
sw t0, 0(:HERE)
addi :HERE, :HERE, 4
Then we store here + 4 at memory[here] and increment here +4 for the next write.
Compute HERE + 20, that is the address that we want to put in :XT so DOCOL moves :IP to it, and then generate the lui and addi machine code for it.
mv t0, :HERE
addi t0, t0, 20 # t0 = HERE + 20
# 3.1 Generate machine code for XT = HERE + 20 at time of compilation
li a0, 21 # XT is s5, which is register x21
mv a1, t0
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4
After that we do the same but we want to put DOCOL's address in t0
li a0, 5 # t0 is x5
la a1, DOCOL
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4
Now we have written li :XT, HERE+20
and la t0, DOCOL
, next we want to write
the machine code for jalr zero, 0(t0)
jalr zero, 0(t0) is 0x28067
or 00000000000000101000000001100111.
For our purposes this is actually a constant value, as none of the parameters change, t0 is always the 5th register or 00101, zero or x0 is always 00000, and the offset is alwas 000000000000, we dont have to recompute it, it will always be 0x28067, but we will do it anyway.
# Input:
# a0 = register number to jump to (e.g., 5 for t0)
# Output:
# a0 = JALR instruction machine code to jump to that register
do_jr:
mv t0, a0 # Save register number in t0
# We want to generate: jalr x0, reg, 0
# This means: jump to address in 'reg', don't save return address
#
# JALR instruction format:
# [imm[11:0]] [rs1] [000] [rd] [1100111]
# [12 bits ] [5 ] [3 ] [5 ] [7 bits ]
#
# For jr, we want:
# - imm = 0 (no offset to add to jump address)
# - rs1 = input register (where to jump to)
# - funct3 = 000 (JALR variant)
# - rd = x0 (don't save return address)
# - opcode = 1100111 (0x67) (JALR opcode)
#
# Example for jr t0 (x5):
# imm=000000000000 rs1=00101 000 rd=00000 1100111
# = 0000 0000 0000 0010 1000 0000 0110 0111
# = 0x00028067
slli t1, t0, 15 # Shift register number to rs1 position (bits 19:15)
# e.g., 5 << 15 = 0x00028000
li t2, 0x67 # Load JALR opcode (0x67 = 1100111)
or t1, t1, t2 # Combine register bits with opcode
# e.g., 0x00028000 | 0x67 = 0x00028067
# The middle zeros are:
# - imm[11:0] = 0 (bits 31:20)
# - funct3 = 0 (bits 14:12)
# - rd = 0 (bits 11:7)
mv a0, t1 # Return final instruction
ret
In COLON we use it like this:
# 3.2 Generate machine code for jr t0
li a0, 5 # t0 is x5
jal do_jr
sw a0, 0(:HERE) # jr
addi :HERE, :HERE, 4
Now when COLON finishes, :HERE points just after the jr, so the execution toknes
will be added just below, whatever we want DUP MUL etc, as we are parsing the
tokens since we are in compile mode, we will keep adding execution toknes to the
thread, until ;
is executed, and you see ;
is immediate word, which means it
is executed in compile mode, and what it does is it adds the execution token
EXIT
to the end of the word, it actually just adds it to wherever :HERE is,
which is at the end of the current word, and moves back intro evaluation mode.
SEMICOLON:
mv :MODE, zero # exit compile mode
la t0, EXIT
sw t0, 0(:HERE)
addi :HERE, :HERE, 4
j NEXT
You might notice HERE is a bit like a stack for the dictionary memory, we just
keep pushing. We can do soooo much with it, for example as we are creating a new
word, we can store here
on the data stack, then we can use it as a parameter
to the jump word. or we can use a placeholder and go back to patch with a
value. This is how we will create all kinds of control flow logic in Forth, from
if to loops.
Again there are many many ways to make a Forth, what is the absolute minimum
needed to build a complete and expressive language, which words are fundamental?
I am actually quite new to Forth, learned about it few months ago, but I got
excited by this exact question. In math for example I can say a = a
, which
means a thing must be equal to itself, 5 = 5, 3.2 = 3.2, there cant be anything
more fundamental than it right? But what about a - a = 0
, as 5 - 5 = 0, if we
say that a thing subtracted from itself gives nothing, it follows that a thing
is equal to itself, so which one is more fundamental? Are they not the same? The
symbolic manipulation, when evaluated, must be evaluated in context of the
evaluator. For example there could be a system where a = a
is broken, imagine the
evaluator is evaluating the expression symbol by symbol, and there is some
temporal nature to a
as in a
changes with time, how would it know that it is
the same a
? How much time would it take for the expression to be evaluated,
and by the time it is done, how would we know that a
is the same as it was? So
we abstract it away, we pretend that the evaluation is instant, and there are
things in our universe that are like that. For example gravity is instant, it
seems, once object moves its field moves with it, electricity however is not
instant, once you move the electron, the electric field takes time realize,
wait.. my electron moved, I have to move. Quite strange that we have both
instant and non instant evaluations of the fundamental forces. Anyway.. a = a
needs some contex when you are evaluating it, you must know its surroundings, it
itself is nothing, but it plus its evaluator together, they are something!
You see in this language, symbols exist in so many layers, for example I can
expose the most primitive words PUSH and POP, that all use the same temporary
storage, t0 for example,, and then make DUP a word : dup pop push push ;
, push
and pop are few machine code instructions, which are then few sequential micro
instructions, voltage or no voltage on some wires driven by a clock. How related
are the wires to our dup word? We could make a dup word from biological cells,
or from dominos, we can make it with water and valves as well. It seems that
programming languages live somewhere else, not exactly in the machine, not
exactly in their syntax, not exactly in their grammar, and not exactly in the
programmer, what a weird place must that be.
In our Forth we have the words jr and li, they take values from the stack and
push the assembled instructions into the stack. We also have the word here
that pushes the current location of the word we are compiling, and of course we
have !
that can write to any memory location.
# JR ( reg -- opcode_jr )
JR:
POP a0
call do_jr
PUSH a0
j NEXT
# ( reg imm -- lui addi )
LI:
POP a1 # imm
POP a0 # reg
call do_li
PUSH a0 # lui
PUSH a1 # addi
j NEXT
We could write machine code from forth itself with cleaver manipulations something like here 20 + dup dup 5 12345 li rot ! swap 4 + !
would write li t0, 12345
, lui at here + 20 and and addi at here + 24. So now it is even harder to say what is the language, and what is the machine. As Ada Lovelance said, the limit is in us, what we can think of, the possibilities are endless.
We will build few more forth functions that allow us to manipulate the return stack and we will add one more stack, called control flow stack to save jump addressess for if and else, and we will add the ability for a word to write bytecode in the word that is compiling it, you will see how powerful that is.
We will add a macto to PUSH and POP from the control flow stack, we will use register s3 for :CSP, and we will setup some stack space after the return stack.
.macro CFPUSH reg
addi :CSP, :CSP, -4
sw \reg, 0(:CSP)
.endm
.macro CFPOP reg
lw \reg, 0(:CSP)
addi :CSP, :CSP, 4
.endm
forth:
la :CSP, CONTROL_FLOW_STACK_END
...
...
.space 2048
FORTH_STACK_END:
# forth return stack
.space 2048
RETURN_STACK_END:
# forth control flow stack
.space 2048
CONTROL_FLOW_STACK_END:
Few words are needed to copy data from the return stack into the data stack and
from vice versa. r>
pops from the return stack and pushes to data stack, >r
pops from the data stack and pushes to the return stack and r@
copies the top
element from the return stack and pushes it to the data stack, leaving the
return stack unchainged. We have the same for the control flow stack, cf> >cf cf@
# ( x -- ) (R: -- x)
TO_R:
POP t0
RPUSH t0
j NEXT
# ( -- x ) (R: x -- )
FROM_R:
RPOP t0
PUSH t0
j NEXT
# ( -- x ) (R: x -- x)
R_FETCH:
lw t0, 0(:RSP)
PUSH t0
j NEXT
# ( x -- ) (CF: -- x)
TO_CF:
POP t0
CFPUSH t0
j NEXT
# ( -- x ) (CF: x -- )
FROM_CF:
CFPOP t0
PUSH t0
j NEXT
# ( -- x ) (CF: x -- x)
CF_FETCH:
lw t0, 0(:CSP)
PUSH t0
j NEXT
...
word_to_r:
.word ...
.word 2
.ascii ">r\0\0"
.word 0
.word TO_R
word_from_r:
.word word_to_r
.word 2
.ascii "r>\0\0"
.word 0
.word FROM_R
word_r_fetch:
.word word_from_r
.word 2
.ascii "r@\0\0"
.word 0
.word R_FETCH
word_to_cf:
.word word_r_fetch
.word 3
.ascii ">cf\0"
.word 0
.word TO_CF
word_from_cf:
.word word_to_cf
.word 3
.ascii "cf>\0"
.word 0
.word FROM_CF
word_cf_fetch:
.word word_from_cf
.word 3
.ascii "cf@\0"
.word 0
.word CF_FETCH
The other two very important words we will add are postpone
and immediate
# ( -- )
IMMEDIATE:
li t1, 1
sw t1, 12(:LATEST) # flag value
j NEXT
POSTPONE:
jal do_next_token
jal do_parse_token
jal do_find
beqz a0, .L_word_not_found
la t1, LIT
sw t1, 0(:HERE)
addi :HERE, :HERE, 4
lw a0, 0(a0) # dereference
sw a0, 0(:HERE)
addi :HERE, :HERE, 4
la t1, COMMA
sw t1, 0(:HERE)
addi :HERE, :HERE, 4
j NEXT
.L_word_not_found:
la a0, err_word_not_found
j panic
...
word_immediate:
.word ...
.word 9
.ascii "imme"
.word 0
.word IMMEDIATE
word_postpone:
.word word_immediate
.word 8
.ascii "post"
.word 1 # immediate
.word POSTPONE
immediate
sets the flag of the latest word to 1, so the interpreter will
execute it in compile time instead of embedding its execution token in the
thread of the word being compiled. Its pretty straight forward, once we create a
word with do_create
we update the :LATEST register, so it always points to the
right place, and LATEST + 12
is the exact place of the flag field.
postpone
however is a bit more subtle, I mean it is easy when you read it, it
compiles LIT, execution token, COMMA
into the compiled word's thread.
: begin
here
>cf
; immediate
: again
postpone jump
cf>
,
; immediate
: forever
begin
1 . cr
again
;
forever
begin-again loops are infinite loops in forth, there is no way to exit them, we
build begin
and again
just with here, jump , and postpone
, I am using the
control flow stack instead of the return stack, because of the way I made EXIT
work, and begin's EXIT will pop the wrong value from the return stack, so we use
the return stack only for subroutines and do-loops. Lets see what gets compiled in the
threads.
After compilation begin's bytecode looks like this in memory:
80000e40 <word_begin>:
80000e40: 80000e10 .word 0x80000e10 # link to previous word
80000e44: 00000005 .word 0x5 # length
80000e48: 69676562 .ascii "begi" # token
80000e4c: 00000001 .word 0x1 # immediate flag
80000e50: 80000e54 .word 0x80000e54 # code field
80000e54: 80001537 lui a0,0x80001 # jit code
80000e58: 0f450513 addi a0,a0,244
80000e5c: 800002b7 lui t0,0x80000
80000e60: 53428293 addi t0,t0,1332
80000e64: 00028067 jr t0
80000e68: 80000534 .word DOCOL
80000e6c: 80000678 .word PUSH_HERE
80000e70: 80000690 .word TO_CF
80000e74: 800004f8 .word EXIT
Those addressess are just some plausible numbers, but this is very effective method to think like the computer, just pick some number and imagine values there, on those addresses. I usually pick small numbers, like 1042 or something, but now I want to make numbers kind of consistend with what you would see from objdump.
Again's code will look a bit weird at first, but thats OK.
80000e78 <word_again>:
80000e78: 80000e40 .word 0x80000e40 # link to previous word
80000e7c: 00000005 .word 0x5 # length
80000e80: 69616761 .ascii "agai" # token
80000e84: 00000001 .word 0x1 # immediate flag
80000e88: 80000e8c .word 0x80000e8c # code field
80000e8c: 80001537 lui a0,0x80001 # jit code
80000e90: 0f450513 addi a0,a0,244
80000e94: 800002b7 lui t0,0x80000
80000e98: 53428293 addi t0,t0,1332
80000e9c: 00028067 jr t0
80000ea0: 80000534 .word DOCOL
80000ea4: 800000b0 .word LIT
80000ea8: 8000032c .word JUMP # address of JUMP
80000eac: 8000057c .word COMMA
80000eb0: 80000698 .word FROM_CF
80000eb4: 8000057c .word COMMA
80000eb8: 800004f8 .word EXIT
This is what postpone
does, it will add LIT, X, COMMA
to the bytecode, LIT X
will put the value X on the stack, and COMMA will write the value on the stack to memory at location :HERE, and will move :HERE += 4, so LIT,X,COMMA
is the same as memory[:HERE] = X
.
Pretend you are the outer interpreter, in compile mode, compiling the word
forever
, go step by step. First :
creates a dictionary word for the next
token, which is forever
, it creates just the basic word, link, length, token,
flags, :HERE points to the end of it. Next token is begin
, you lookup the
word, find it in the dictionary, see its flag is immediate, you jump to it to
execute it, it executes here
which will put the value of :HERE on the stack,
and then >cf
, will pop the data stack and push to the control flow stack. In
the end of executing begin nothing new has been added to the forever
word's
thread, then we have 1 . cr
which gets compiled to
80000f00 <word_forever>:
80000f00: 80000e78 .word 0x80000e78 # link to previous word
80000f04: 00000008 .word 0x8 # length
80000f08: 65726f66 .ascii "fore" # token
80000f0c: 00000000 .word 0x0 # flag
80000f10: 80000f14 .word 0x80000f14 # code field
80000f14: 80001537 lui a0,0x80001 # jit code
80000f18: 0f450513 addi a0,a0,244
80000f1c: 800002b7 lui t0,0x80000
80000f20: 53428293 addi t0,t0,1332
80000f24: 00028067 jr t0
80000f28: 80000534 .word DOCOL
80000f2c: 800000b0 .word LIT # <- HERE when begin was executed
80000f30: 00000001 .word 0x1
80000f34: 80000134 .word EMIT
80000f38: 800000f0 .word CR
80000f48: 00000000 .word ____ # <- HERE before again is executed
Finding the execution tokens one by one in the dictionary. Now we reach
again
. and HERE points to the end of the current definition, so again
has
LIT, ADDR OF JUMP, COMMA
which will write the address of JUMP to the location
of HERE. And our forever
word will look like this:
80000f28: 80000534 .word DOCOL
80000f2c: 800000b0 .word LIT
80000f30: 00000001 .word 0x1
80000f34: 80000134 .word EMIT
80000f38: 800000f0 .word CR
80000f3c: 8000032c .word JUMP
80000f48: 00000000 .word ____ <--- HERE
We continue to execute again
, next we have FROM_CF
and COMMA
, FROM_CF
will pop from the value begin stored in the control flow stack and push to the
data stack, then comma will write it to the location at HERE.
This is how the forever word would look after again
is executed.
80000f00 <word_forever>:
80000f00: 80000e78 .word 0x80000e78 # link to previous word
80000f04: 00000008 .word 0x8 # length
80000f08: 65726f66 .ascii "fore" # token
80000f0c: 00000000 .word 0x0 # flag
80000f10: 80000f14 .word 0x80000f14 # code field
80000f14: 80001537 lui a0,0x80001 # jit code
80000f18: 0f450513 addi a0,a0,244
80000f1c: 800002b7 lui t0,0x80000
80000f20: 53428293 addi t0,t0,1332
80000f24: 00028067 jr t0
80000f28: 80000534 .word DOCOL
80000f2c: 800000b0 .word LIT # <--------------------.
80000f30: 00000001 .word 0x1 |
80000f34: 80000134 .word EMIT |
80000f38: 800000f0 .word CR |
80000f3c: 8000032c .word JUMP |
80000f40: 80000f2c .word 0x80000f2c # jumps back to LIT ---'
80000f44: 800004f8 .word EXIT
80000f48: 00000000 .word ____ <--- HERE
Pretty cool right?
: until
postpone 0branch
cf>
,
; immediate
Using almost the same code we can implement begin-until, at the end of the loop we check if the stack has 0 and if it does we break.
: test-until
begin
key
dup . cr
113 =
until
;
test-until
This code for example will exit the begin loop if you press 'q' (ascii
113). Everything is the same as begin again
but here we use BRANCH_ON_ZERO
,
so we will jump back only if there is 0 on the stack.
80001000: 80000534 .word DOCOL
80001004: 800006a8 .word KEY <-------.
80001008: 80000140 .word DUP |
8000100c: 80000134 .word EMIT |
80001010: 800000f0 .word CR |
80001014: 800000b0 .word LIT |
80001018: 00000071 .word 113 |
8000101c: 80000264 .word EQUAL |
80001020: 80000310 .word BRANCH_ON_ZERO |
80001024: 80001004 .word 0x80001004 -------'
80001028: 800004f8 .word EXIT
We will use similar methods to create for loops, if, else, while, until etc, and then we will have quite expressive language, that is build from itself.
The code that is compiled is quite efficient, we did a lot of work during the compilation but the actual bytecode of the word is just the things needed for the BEGIN AGAIN to work, there is no control flow stack shananigans there, just JUMP to specific address.
We will make a program that waits forever for the key 'q' to be pressed and if it is it quits the program.
: if
postpone 0branch
here
0
,
>cf
; immediate
: then
here
cf>
!
; immediate
: forever
begin
key dup 113 = if
bye
then
. cr
again
;
forever
First we need if and then, it is quite similar to begin again, but instead of jump we need to use BRANCH_ON_ZERO
Start by thinking how forever
's bytecode should look, we have the old
unconditionkal jump to the top, but inside we have a branch that jumps over the
if content if the top of the stack is 0.
80001000: 80000534 .word DOCOL
80001004: 800006a8 .word KEY <-----------.
80001008: 80000140 .word DUP |
8000100c: 800000b0 .word LIT |
80001010: 00000071 .word 113 |
80001014: 80000264 .word EQUAL |
80001018: 80000310 .word BRANCH_ON_ZERO |
8000101c: 80001024 .word 0x80001024 ---. |
80001020: 800000f0 .word BYE | |
80001024: 80000134 .word EMIT <--------' |
80001028: 800000f0 .word CR |
8000102c: 8000032c .word JUMP |
80001030: 80001004 .word 0x80001004 ------'
Now how would we construct that at compile time, if
must store the location of
the 0branch's jump argument, 8000101c in this case somewhere, then leave an
empty placeholder cell, when then
is compiled it it has to write the value of
HERE at the placeholder, so that BRANCH_ON_ZERO
will jump over the if block.
: if
postpone 0branch \ put BRANCH_ON_ZERO in the word that is being compiled
here \ push the current end of the word's bytecode
0 , \ write a placeholder with value 0 and move HERE + 4
>cf \ store the previous HERE location in CF stack
; immediate
: then
here \ push the current end of the word's bytecode
cf> \ pop the placeholder location from CF stack
! \ write the value of here into the placeholder
; immediate
So again, if
prepares a placeholder for BRANCH_ON_ZERO
and stores the
placeholder's address on CF, then
loads the placeholder's address and inside
of it writes the last bytecode address in it.
else
is a bit more involved.
: else
postpone jump
here
0
,
here
cf>
!
>cf
; immediate
: forever
begin
key dup 113 = if
drop
0 . cr
bye
else
. cr
then
again
;
forever
Now we will change the program a bit, it will wait for 'q' and exit but before that it will drop the top of the stack and print 0, otherwise we will print the ascii of the key.
80001000: 80000534 .word DOCOL
80001004: 800006a8 .word KEY <-----------------------.
80001008: 80000140 .word DUP |
8000100c: 800000b0 .word LIT |
80001010: 00000071 .word 113 |
80001014: 80000264 .word EQUAL |
80001018: 80000310 .word BRANCH_ON_ZERO |
8000101c: 80001034 .word 0x8000103c ----------. |
80001020: 80000138 .word DROP | |
80001024: 800000b0 .word LIT | |
80001028: 00000000 .word 0 | |
8000102c: 80000134 .word EMIT | |
80001030: 800000f0 .word CR | |
80001032: 800000f0 .word BYE | |
80001034: 8000032c .word JUMP | |
80001038: 80001044 .word 0x80001044 -----. | |
8000103c: 80000134 .word EMIT <----------|----' |
80001040: 800000f0 .word CR | |
80001044: 8000032c .word JUMP <----------' |
80001048: 80001004 .word 0x80001004 -----------------'
You can do this, take your time, take pen and paper, and think through it. I
will just give you the high level overview of what is going onif
adds
branch_on_zero
and leaves a placeholder address on the control flow stack to
be patched later, else
adds jump
and creates another placeholder to be
patched by then
, it also patches the if
's placeholder to point at here
, so
if if
fails we jump into the else code, after that then
patches the
placeholder with here
. And so if you follow the bytecode, if the user presses
'q' which has ascii 113, EQUAL
pushes -1 to the top of the stack,
BRANCH_ON_ZERO
pops it and does not jump, because it is not zero, then we
execute the code inside the if, in the end of the code you see we have the
JUMP
left there by else, which jumps over the code we had in the else block
. cr
in our case. However if the user presses any other key than 'q', EQUAL
will push 0 to the stack and BRANCH_ON_ZERO
will jump over the if block into
8000103c, where we have the else code . cr
. And we still have the begin
again
jump that jumps back to the top.
One thing you might notice, is what happens if we call if
not in compile mode?
3 0 = if bye then
What would this do? Well, worse than nothing really, it will add junk to wherever HERE is pointing to, at the end of the last defined word. Cnomplete Forth implementations will warn when calling certain words while not in compile mode.
If you try this code in gforth
you will see the warning.
3 0 = if bye then
*terminal*:1:7: warning: IF is compile-only
*terminal*:1:7: warning: Compiling outside a definition
Our purpose is not to make a complete Forth implementation, but to understand the very core of symbol manipulation and programming languages, so we will leave ours as simple as possible.
Lets add loops.
# ( limit index -- R: limit index)
DO_DO:
POP t0 # t0 = index
POP t1 # t1 = limit
RPUSH t1 # limit
RPUSH t0 # index
j NEXT
# ( R: limit index -- R: limit index )
DO_LOOP:
RPOP t0 # pop index
RPOP t1 # pop limit
addi t0, t0, 1
blt t0, t1, .L_do_loop_jump # if limit < index
# skip over the jump address
addi :IP, :IP, 4
j NEXT
.L_do_loop_jump:
# push them back on Rstack if still looping
RPUSH t1 # push limit
RPUSH t0 # push index
# read the jump address from IP (the next cell in the thread)
lw :IP, 0(:IP)
j NEXT
LOOP_I:
lw t0, 0(:RSP)
PUSH t0
j NEXT
LOOP_J:
lw t0, 8(:RSP)
PUSH t0
j NEXT
LOOP_K:
lw t0, 16(:RSP)
PUSH t0
j NEXT
...
word_do_do:
.word ...
.word 4
.ascii "(do)"
.word 0
.word DO_DO
word_do_loop:
.word word_do_do
.word 6
.ascii "(loo"
.word 0
.word DO_LOOP
word_i:
.word word_do_loop
.word 1
.ascii "i\0\0\0"
.word 0
.word LOOP_I
word_j:
.word word_i
.word 1
.ascii "j\0\0\0"
.word 0
.word LOOP_J
word_k:
.word word_j
.word 1
.ascii "k\0\0\0"
.word 0
.word LOOP_K
And we need to add support to put (do) and (loop) into the compiled word.
: do
postpone (do)
here
>cf
; immediate
: loop
postpone (loop)
cf>
,
; immediate
: test-simple-loop
10 0 do
i . cr
loop
;
test-simple-loop
The syntax for loops in Forth is LIMIT INDEX DO ... LOOP, the code ... will be executed until INDEX is less than LIMIT, so 10 0 means start from 0 and go up to 9, 10 3 means start from 3 and go to 10, and so on -30 -40 means start from -40 and go up to -31, or 0 -5 start from -5 and go up to -1.
Inside the loop you have access to the word i
, if you have nested loop you get
j
and k
. First I will explain how do loops work and then will discuss how
ijk works.
Again, when you want to understand something, start from itself, how does test-simple-loop look in memory:
80001000: 80000534 .word DOCOL
80001004: 800000b0 .word LIT
80001008: 00000000 .word 10
8000100c: 800000b0 .word LIT
80001010: 0000000a .word 0
80001014: 80000714 .word DO_DO
80001018: 80000734 .word LOOP_I <-------. `here` at time of :do
8000101c: 80000134 .word EMIT |
80001020: 800000f0 .word CR |
80001024: 800000b4 .word DO_LOOP |
80001028: 80001018 .word 0x80001018 -----'
8000102c: 800004f8 .word EXIT
do
will embed DO_DO
into the thread of the compiled word. DO_DO
will push
the values from the data stack into the return stack, pretty clear, it will also
push here
to the control flow stack so later DO_LOOP
can jump back if we
have not reached the LIMIT
. loop
will embed DO_LOOP
into the thread, pop
the address stored at do
from the control flow stack and write it in the
thread, DO_LOOP
uses this address to jump to while the loop is still going,
when the loop is done it simply does IP += 4
and continues.
DO_LOOP
and DO_DO
are using the return stack to update the index value, so if we have 3 nested loops our return stack looks like this:
+----------------+ <-- RSP
| innermost idx | <-- i (offset 0)
| innermost lim |
+----------------+
| middle idx | <-- j (offset 8)
| middle lim |
+----------------+
| outermost idx | <-- k (offset 16)
| outermost lim |
+----------------+
i
, j
and k
are simply the values at memory[RSP], memory[RSP-8] and memory[RSP-16]. Depending where you use them they change meaning, for example
: test-simple-loop
10 0 do
i . cr
20 15 do
i . cr
loop
loop
;
test-simple-loop
Both i
will look at memory[RSP], so inside each loop it always refers to
itself, if you want to reach the index of the outer loop from the inner loop you
need to use j
, but if you try to use j
in the outer loop it makes no sense.
That is not how it actually works in real Forth, but I think its OK since we know its limitation.
Now everything together:
: test-loop
begin
10 5 do
35 30 do
53 50 do
i 52 = if
i 2 + i do
999999 . cr
loop
else
i 2 + i do
5555 . cr
loop
then
i . cr
j . cr
k . cr
loop
loop
loop
again
;
test-loop
Few things are missing, but I think most importantly and the ability exit
early from words and the ability to create variables and arrays, we are also
missing a lot of quality of life improvements, like comments and strings.
I think this is great time for you to pause and think how would you implement those things, if the book ends right here. How would you do it? What would your Forth look like? By now you know there are infinite many ways to create something. My mind works in certain way, I like certain patterns and structures, sometimes I am willing to sacrifice beauty for performance, or for education, sometimes I sacrifice performance for what I think is elegance. That means nothing.
Look again at our tic tac toe program.
create board 9 allot
: board[] board + ;
: reset-board ( -- )
9 0 do
'-' i board[] c!
loop
;
: print ( -- )
3 0 do \ j
3 0 do \ i
j 3 * i + board[] c@ emit
loop
cr
loop
;
: check-line ( a b c -- flag )
board[] c@ rot board[] c@ rot board[] c@
dup '-' = if
drop drop drop 0
else
over \ a b c -> a b c b
= \ a b c==b
rot rot \ c==b a b
= \ c==b a==b
and \ c==b && a==b
then
;
: check-win ( -- )
0 1 2 check-line if 1 exit then
3 4 5 check-line if 1 exit then
6 7 8 check-line if 1 exit then
0 3 6 check-line if 1 exit then
1 4 7 check-line if 1 exit then
2 5 8 check-line if 1 exit then
0 4 8 check-line if 1 exit then
2 4 6 check-line if 1 exit then
0
;
: play ( -- )
'X' 'O'
begin
over emit ." 's turn" cr
print
over key '0' - board[] c!
swap
1 check-win = if
print cr emit ." wins" cr
exit
then
again
;
reset-board play bye
The first time you saw it, it must'have been like seing alien language, now, you
can see through it, you can understand even the symbols you have never seen, and
imagine how would they work, what would ."
or create board 9 allot
do. This
ability to say: "If I was creating this, how would I make it", requires both
to deeply believe in yourself, to understand what you dont understand, to listen
to yourself and to have courage to dive into your doubt. The shadow of doubt
stretches long through the graph of knowledge. You can not swim in the sea of
doubt if you do not believe you will get through it, and I promise you, you will
get through, you just have to listen carefully, as the doubt only whispers.
Ignorance is required for understanding, as ignorance allows you to do
impossible things, and understanding is impossible at first, anything that you
have understood seem simple, but it seemed impossible before. Just look at this
line over = rot rot = and
, looks like absoloute nonsense, I guarantee you that
if you saw it before reading this chapter you would've said it is impossible to
understand this alien technology. Curiocity is required as well, otherwise over = rot rot = and
will stay just meaningless string of characters, unless you get uncontrollable desire to demistify it, which is what curiocity is, desire to understand.
When you are creating your language, or any program, you know the machine, you know yourself, you know what is important for you, and what is important for the machine. This is not how modern programmers think, they are thought in school about design patterns and separation of concerns, SOLID principles and so on, how to make composable, maintainable software, how to manage complexity, how to work on the same project with a thousand other people chainging the same code, or a million other people. I think this has nothing to do with programming computers. To use a computer means to program it to do what you want. If someone else makes a program for you, it does not matter if 1 programmer made it or 1 million programmers worked on it, it will always be incomplete, as they have to guess what you want from the computer, but only you know. It is the same when you make a chair, or a bed, or a spoon, or a cup of coffee.
Ignorance will allow you to make a better cup of coffee for you.
When you drink coffee you have two choices, you can say 'those peole are experts, they make coffee all their life, they have read research from scientists, they know everything that is to know about coffee, this must be the best coffee that humanity will ever make', or you could be ignorant, and curious, and say 'It must be possible to make a better cup of coffee, I will try to make one'.
Lets go back to our Forth, I will take some shortcuts, as I am also learning how to write Forth as I am writing this book, and by now the chapter is too long. First I will take advantage that we have 11 save registers in RISCV, the s registers are normal general purpose registers but the convention is if you use them in your function you have to save and restore them, like we do with ra
, so I will use more registers to track more stacks, instead of having the return stack holding both the loop limit/index it also holds the return address for the word, and now if we want to do early exit, we must unroll all the loops to get to the return address to jump to,
Imagine we change the program : test 10 0 do i . cr loop ;
to : test 10 0 do i dup . cr 5 = if exit then loop ;
so we return from test
early if i becomes 5.
: test
10 0 do
i dup . cr
5 = if
drop exit
then
loop
;
80001000: 80000534 .word DOCOL
80001004: 800000b0 .word LIT
80001008: 00000000 .word 10
8000100c: 800000b0 .word LIT
80001010: 0000000a .word 0
80001014: 80000714 .word DO_DO
80001018: 80000734 .word LOOP_I <-------.
.word DUP |
8000101c: 80000134 .word EMIT |
80001020: 800000f0 .word CR |
.word LIT |
.word 5 |
.word BRANCH_ON_ZERO |
.word 0x80001024---------.
.word DROP | |
.word EXIT | |
80001024: 800000b4 .word DO_LOOP <-------|--'
80001028: 80001018 .word 0x80001018 -----'
8000102c: 800004f8 .word EXIT
EXIT
just pops IP from the return stack, but now we have do-loop's limit and
index there as well, so we cant just pop once, we need to pop pop pop to get to
the actual value, but how do we know how many do loops we have on top of each
other? This of course can be done if we store more information on the return
stack, like instead of just limit/index we store 'limit, index, 7' and when we
jump into a subroutine we dont just store the Instruction Pointer but we store
some sort of tag about what the value means, e.g. 'IP, 9', 9 means subroutine
return address, 7 means do-loop loop data, and then if we want to EXIT
we just
start popping until we see the first 9
and then we know we have reached the
closest exit address. You can do everything with cleaver stack manipulations.
We can also just split the stacks, so the subroutine stack is separate and it only contains return addresses, since our computer has many registers things will be fast, BTW, I am using registers, but you can do it with fewer registers and just storing the top of the stack on some memory location, just every stack push and pop will require more instructions to do.
We will call this the exit stack
.
[ FIXME ]
It is not often that we know exactly what we want, in this case I want to be able to exit early from a word, I know the code I have written so far, I know the machine, I know I could do it in many ways, but now, I feel like adding few more stacks, so each stack has its own purpose, this is not what common Forth interpreters do, but, I am not making a "common" forth interpreter am I? I am writing code to learn about Forth, and in the same time to teach you about it. At this very moment, I dont know if this is a good idea or not to add more stacks, it seems fine, but in few chapters I might need registers to store something and I wont have any available, and that is OK, its OK to not know if a decision is good or bad, just listen to your intuition, try to have as much foresight as you can, and then go, if things become unmaintainable or the price you pay for a bad decision is too high, you must promise yourself that you will go back and fix all the broken things. This allows you to not overthink, few people are prophets of complexity, some can see much further than others, like Rob Pike, or Ken Thompson, but most people like me can see a bit further from their nose. Make mistakes, go back and fix them, then make more mistakes. This is the way. Some mistakes require you to start from scratch, and you have to allow yourself to do so, do not overestimate the work needed to start from scratch. In life this is not the case, almost always the future holds infinite possibilities, and the past is closed, but when you design systems, every decision constrains the future possibilities, but the past remains open.
Can you imagine, after writing million lines of code, thinking "NOOO, I need one more register, I could've had it if 5 years ago I didnt split the stacks", and I will be honest, this happens more often than you think, you find a work around, and move on, but deep down you know, you have made a grave mistake that will haunt you forever while you are adding code to this project. But next time, for the next project, your foresight will extend just a bit further than before, and then again, and again, mistake after mistake. This is how we grow.
Lets add the new exit stack, :ESP will be s8.
.macro EPUSH reg
addi :ESP, :ESP, -4
sw \reg, 0(:ESP)
.endm
.macro EPOP reg
lw \reg, 0(:ESP)
addi :ESP, :ESP, 4
.endm
forth:
...
la :ESP, EXIT_STACK_END
...
...
CONTROL_FLOW_STACK_END:
.space 2048
EXIT_STACK_END:
Changing DOCOL and EXIT to use the exit stack instead of the return stack. We also add unloop in case you want to exit from a loop you want to clear the return stack, otherwise it will contain left over garbage.
DOCOL:
EPUSH :IP
mv :IP, :XT
j NEXT
EXIT:
EPOP :IP
j NEXT
UNLOOP:
RPOP zero
RPOP zero
j NEXT
Add the words exit
and unloop
to the dictionary
word_exit:
.word ...
.word 4
.ascii "exit"
.word 0
.word EXIT
word_unloop:
.word word_exit
.word 6
.ascii "unlo"
.word 0
.word LOOP_UNLOOP
That was easy, exit works now, and we can exit early from words.
: wait
begin
key dup 113 = if
drop exit
else
. cr
then
again
;
wait bye
If you want to exit from within a loop you have to unloop it:
: wait
10 0 do
key dup 113 = if
unloop drop exit
else
. cr
then
loop
;
wait bye
For our purposes that is enough, we just need to add the ability to create arrays and be able to manipulate bytes in memory.
We will add CREATE
word that we can use for variables, we want when we use the
word to get the address of its data field, we dont need to push :IP and we dont
need DOCOL
and EXIT
, we just need to push the data field address to the
stack. For that we will generate a bit different jit code than what we do in
COLON
. For the jit we need to add support for sw
and addi
.
Then we will add support for ALLOT
which just moves :HERE
some amount, so we
can allocate memory space in the current word, for example 9 allot
will just
do :HERE = :HERE + 9
, this creates a small problem because now the next word
we create wont be aligned to be at address multiple by 4, which is a problem
since we jump into jitted code of some words, so we will patch do_create
to
make sure it always increments :HERE
up to the closest multiple of 4.
And we need to add byte level AT and BANG, called c@
and c!
, which is the
same as @
and !
but it uses lbu
and sb
instead of lw
and sw
, we also
need few helper functions like the ability to print characters, and to do AND
to check for multiple flags. We will expose ROT
and OVER
, we had them before
but we only used them in the inner interpreter, now we will add them to the
dictionary.
# update do_create to always create words at 4 byte boundary
do_create:
addi sp, sp, -4
sw ra, 0(sp)
jal do_next_token
jal do_parse_token
beqz a1, .L_create_error
# align to closest multiple of 4
addi t0, :HERE, 3
li t1, -4
and :HERE, t0, t1
...
# rest of do_create remains the same
# sw ( a0: rs1, a1: rs2 source -- a0: opcode_sw )
do_sw:
# bits [31:25] = 0 (imm[11:5] = 0)
# bits [24:20] = rs2 (source register to store)
# bits [19:15] = rs1 (base address register)
# bits [14:12] = 0x2 (funct3 for SW)
# bits [11:7] = 0 (imm[4:0] = 0)
# bits [6:0] = 0x23 (opcode for store)
li a4, 0x23 # opcode
li t0, 0x2000 # 2 << 12
or a4, a4, t0
slli t0, a0, 15
or a4, a4, t0
slli t0, a1, 20
or a4, a4, t0
mv a0, a4
ret
# addi ( a0: rd, a1: rs1, a2: imm -- a0: opcode_addi )
do_addi:
# ADDI instruction format:
# bits [31:20] = immediate
# bits [19:15] = rs1 (source register)
# bits [14:12] = 0x0 (funct3)
# bits [11:7] = rd (destination register)
# bits [6:0] = 0x13 (opcode)
li t0, 0x13 # ADDI opcode
slli t1, a0, 7 # Shift rd to position [11:7]
or t0, t0, t1
slli t1, a1, 15 # Shift rs1 to position [19:15]
or t0, t0, t1
li t1, 0xfff
and t2, a2, t1 # Mask to 12 bits
slli t2, t2, 20 # Shift immediate to position [31:20]
or t0, t0, t2
mv a0, t0
ret
CREATE:
jal do_create
# point the execution token to the machine code, same as COLON
addi t0, :HERE, 4
sw t0, 0(:HERE)
addi :HERE, :HERE, 4
# create foo
#
# 80000880: w_foo:
# 80000880: 80000..# link
# 80000884: 3 # size
# 80000888: "foo\0"# token
# 8000088c: 0 # flags
# 80000890: 80000894 # CODE FIELD >--------.
# 80000894: lui t0, HIGH(HERE+24) <-------' >-.
# 80000898: addi t0, t0, LOW(HERE+24) >-----------.
# 8000089c: addi SP, SP, -4 |
# 800008a0: sw t0, 0(SP) |
# 800008a4: lui t0, HIGH(NEXT) |
# 800008a8: addi t0, t0, LOW(NEXT) |
# 800008ac: jr t0 |
# 800008b0: <data field...> <------------------'
# li t0, :HERE
# addi :SP, :SP, -4
# sw t0, 0(SP)
# la t0, NEXT
# jr t0
addi t1, :HERE, 28 # HERE + 28, 7 instructions 4 bytes each
# li t0, value of :HERE + 28
li a0, 5 # t0 is x5
mv a1, t1 # HERE + 28
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4
# addi :SP, :SP, -4
li a0, 9 # :SP is s1, x9
li a1, 9 # :SP is s1, x9
li a2, -4
call do_addi
sw a0, 0(:HERE)
addi :HERE, :HERE, 4
# sw t0, 0(:SP)
li a0, 9 # :SP is s1, x9
li a1, 5 # t0 is x5
call do_sw
sw a0, 0(:HERE)
addi :HERE, :HERE, 4
# la t0, NEXT
li a0, 5 # t0 is x5
la a1, NEXT
jal do_li
sw a0, 0(:HERE) # lui
addi :HERE, :HERE, 4
sw a1, 0(:HERE) # addi
addi :HERE, :HERE, 4
# jr t0
li a0, 5 # t0 is x5
jal do_jr
sw a0, 0(:HERE) # jr
addi :HERE, :HERE, 4
j NEXT
# ( n -- )
ALLOT:
POP t0
mv a0, t0
add :HERE, :HERE, t0
j NEXT
# ( addr -- value )
C_AT:
POP t0
lbu t0, 0(t0)
PUSH t0
j NEXT
# ( value addr -- )
C_BANG:
POP t0 # address
POP t1 # value
sb t1, 0(t0)
j NEXT
# ( x1 x2 -- flag )
AND:
POP t0
POP t1
# Check if either value is zero
beqz t0, .L_false
beqz t1, .L_false
# Both non-zero, return TRUE (-1)
li t0, -1
PUSH t0
j NEXT
.L_false:
# At least one zero, return FALSE (0)
mv t0, zero
PUSH t0
j NEXT
# ( n -- )
EMIT_CHAR:
POP a0
jal putc
j NEXT
# ( a b -- c )
MINUS:
POP t0
POP t1
sub t0, t1, t0
PUSH t0
j NEXT
...
word_create:
.word ...
.word 6
.ascii "crea"
.word 0
.word CREATE
word_allot:
.word word_create
.word 5
.ascii "allo"
.word 0
.word ALLOT
word_c_bang:
.word word_allot
.word 2
.ascii "c!\0\0"
.word 0
.word C_BANG
word_c_at:
.word word_c_bang
.word 2
.ascii "c@\0\0"
.word 0
.word C_AT
word_emit_char:
.word word_c_at
.word 4
.ascii "emit"
.word 0
.word EMIT_CHAR
word_rot:
.word word_emit_char
.word 3
.ascii "rot\0"
.word 0
.word ROT
word_over:
.word word_rot
.word 4
.ascii "over"
.word 0
.word OVER
word_and:
.word word_over
.word 3
.ascii "and\0"
.word 0
.word AND
word_minus:
.word word_minus
.word 1
.ascii "-\0\0\0"
.word 0
.word MINUS
And this is a slightly modified tic-tac-toe, since we still dont support comments or strings, but it is close enough:
: begin
here
>cf
; immediate
: again
postpone jump
cf>
,
; immediate
: until
postpone 0branch
cf>
,
; immediate
: if
postpone 0branch
here
0
,
>cf
; immediate
: then
here
cf>
!
; immediate
: else
postpone jump
here
0
,
here
cf>
!
>cf
; immediate
: do
postpone (do)
here
>cf
; immediate
: loop
postpone (loop)
cf>
,
; immediate
create board 9 allot
: board[] board + ;
: reset-board
9 0 do
45 i board[] c!
loop
;
: print
3 0 do
3 0 do
j 3 * i + board[] c@ emit
loop
cr
loop
;
: check-line
board[] c@ rot board[] c@ rot board[] c@
dup 45 = if
drop drop drop 0
else
over
=
rot rot
=
and
then
;
: check-win
0 1 2 check-line if 1 exit then
3 4 5 check-line if 1 exit then
6 7 8 check-line if 1 exit then
0 3 6 check-line if 1 exit then
1 4 7 check-line if 1 exit then
2 5 8 check-line if 1 exit then
0 4 8 check-line if 1 exit then
2 4 6 check-line if 1 exit then
0
;
: play
88 79
begin
over emit cr
print
over key 48 - board[] c!
swap
1 check-win = if
print cr emit cr
exit
then
again
;
reset-board play bye
And our game works!
X
---
---
---
O
X--
---
---
...
What a journey! What joy! We actually made a language from scratch! Now its time to make an operating system.