Operating System
An Operating System (OS) is just a program, its a program with the only purpose of allowing the user of the computer to use the computer to its fullest potential, be able to run other programs and be able to access the hardware.
There are many operating systems, the most used are Linux, Windows, MacOS, iOS, Android (which is on top of Linux), FreeBSD, Minix, and FreeRTOS, but there are many others, BeOS, Haiku, CollapseOS, and so on.
Pretty much all devices you use have some operating system, even the oven. The TV usually runs Android, your phone is with iOS, your laptop Windows. There are operating systems even in parts of your computer, for example in most Intel CPUs itself there is small Minix that handles certain security features and power management. There might be a small operating system in your bluetooth headphones. even in your keyboard.
Again, its just a program, it does not have to be the first program the computer
starts, in our SUBLEQ computer the first program would be a circuit that copies
the actual SUBLEQ program from non volatile storage to RAM and then disconnects
itself, set the program counter to 0, and start the normal execution. Then we
could have an operating system written in SUBLEQ (if we had a bigger version of
the computer, its not much we can do in 16 memory locations. After the computer boots, and on some computers this is quite complicated process, somehow at some at point the operating system will be jumped into, like in our Forth, we have j forth, and from then on, it is in charge.
A lot of complexity goes into it because most popular operating systems try to
be very generic, for example Windows has to work on AMD processors and on Intel
processors, and even though they both have x86 instruction set, there are subtle
hardware bugs, certain functionalities exist only in one not in the other.
Windows also has to work on all kinds of motherboards, with all kinds of RAM,
and network cards, and USB controllers, all kinds of graphic cards, and so on.
When you write a program for Windows, you don't know at all what kind of
computer it will run on, you compile putc(65), and know that it will somehow
print 'A' on the screen, and you do not care how this is going to happen.
In our QEMU computer, the operating system would have to check if the UART is
ready to write by busy polling address 0x10000005 and when its available we
write the character to address 0x10000000
putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret
Now imagine we have different QEMU computers, one has the UART register at address 0x10000000, the other at 0x20000000, we could write two versions of the OS, compile them and let the users choose which one they need.
putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret
and
putc:
  li t0, 0x20000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret
There are multiple problems with this, first the user would have to very carefully install the right operating system, and if there are 200 kinds of computers, chances are they will try to install the wrong one, the user does not care, they just want to display text on the screen. Also the combinations of devices is ridiculous, you have 50 kinds of RAM, 200 kinds of wifi controllers, maybe families 150 network adapters, .. how many variants of the operating systems we have to compile to support all of the combinations?
The hardware manufacturers want their device to be used, so they make them
compatible with another device that is already supported by some operating
system, or sometimes we have standards that attempt to guarantee compatibility,
for example the USB HID specification, or the RISC-V specification. If I make a
RISC-V CPU and it adheres to the specification, than I can say my CPU is RISC-V
compatible, any program written for another RISC-V should be able to be executed
on my CPU, but, the CPU is not the whole computer, even if my CPU executes the
same sb instruction, as the other ones, writing to address 0x10000000 could
mean a very different thing on each computer. Some say that software
compatibiility is the biggest technological advancement since the
transistor. Everybody works really hard for things to be compatible. I think
this will change soon, as the main reason for this compatibility is because it
was hard to write software, and almost nobody can program computers, but now,
the wind of change is here, I am quite excited what it will bring. There is
price we must pay for compatibility, for example your Intel x86 CPU, still boots
in "Real Mode" which emulates Intel's 1978 8086 processor until it is switched
to protected mode. I often wonder, what world would it be if we knew how to
program computers, and are free from the chains of the past. Your bluetooth
headphones have to adhere to a standard definition that is more than 10,000
pages, why cant you just make some headphones and describe how their wires work
and someone else can write their program to use them? We keep adding standards
and layers of indirection everywhere, so that its easier to do things, but, I am
not sure anything is easier. In Commodore64 BASIC, in 1984 or so, people just
did peek and poke and were able to read and write memory, now computers do
not even come with a programming environment installed, not to mention you cant
read any memory you want (we will discuss this more in the security
section). The entire software industry has given up on teaching people how to
program, it has lost all trust in us. It is too complicated they say.. but is
it? Is it really complicated? Just tell us how it works, wires are wires, bits
are bits. It is too dangerous they say, we have to secure the computer for you,
otherwise someone might hack into it, well.. what if I want to hack into it, its
my computer!
Anyway, in order to make it easy to add support for more kinds of hardware, and
for hardware manufacturers to also add code to the operating systems, because
after all they know best how their hardware actually works, we "hide" hide how
putc works under the hood, we add a layer of indirection between the user and
the hardware.
boot:
  li t0, 2 # somehow check what kind of QEMU are we running on
  li t1, 2 # if the QEMU kind is 2, load the correct putc variant
  beq t0, t1, .L_qemu_2
  
.L_qemu_1:
  la t0, putc
  la t1, putc_qemu_1
  j .L_boot_done
.L_qemu_2:
  la t0, putc
  la t1, putc_qemu_2
  j .L_boot_done
  
.L_boot_done:
  j main
putc:
  .word putc_qemu_1
putc_qemu_1:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret
putc_qemu_2:
  li t0, 0x20000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret
syscall:
  addi sp, sp, -4
  sw ra, 0(sp)
  lw t0,0(a5)
  jalr t0
  lw ra, 0(sp)
  addi sp, sp, 4
and use it like this
    ...
    li a0, 65
    la a5, putc
    call syscall
    ...
This is an example of doing so, adding a layer of indirection between the
programmer and the UART, so they only say "system please print a character on
screen", we call syscall, with the function we want, in this case putc, it
will load the value of putc and jump to wherever it points to. We wont even care
if its written on the UART or displayed on a screen, the operating system will
hide that from us. This also allows us to actually change during the runtime of
the program, what putc does, for example imagine I plugin some different
display, the operating system somehow detects that, and now putc() could write
to the display, instead of UART.
This is what a device driver is BTW, it is just a bit of code that enables
certain hardware functionality, like putc support. Our display driver could
put the machine code that controls the leds somewhere in memory and change
putc to point to that location, so we have added support for displaying
characters on the screen. You might have heard 'you need to install a driver for
it' or 'there is new version of the NVIDIA driver, you have to update'
This is one aspect of the operating system, to abstract the hardware functionality, print, display pixel, get keyboard input, read from memory, write to memory, etc.
There is another aspect of it, which is to create an environment that allow the user to use the computer, write programs and execute programs, to run multiple programs in the same time. The most basic use could be to be able to edit a file, compile it and execute it. This is what we will focus on with our operating system.
At the moment our program is somewhere in memory at the label "human_program",
and we can only execute that one program, also our getch function is busy
looping, it is consuming all the cpu resources just to check if the UART is
ready to read, if our cpu runs at 100 mhz, and we type one character per second,
it will execute 100,000,000 useless instructions. Most CPUs have a way to go
into 'wait' mode, where they consume very little power until an interrupt
happens, for example UART interrupt or timer interrupt. Our code at the moment
does not take advantage of it. Our interpreter is also quite ridgit, we can only
start it once. We dont have the concept of a process, or inter process
communication.
It seems like a lot to build, but its not that much, and we will start small, first we will create the operating system scheduler, which is a tiny bit of code that runs at regular intervals, e.g. every millisecond, and just changes the running process, it snapshots all the registers as they are at the moment of the interrupt, and loads the registers of another process.
Imagine the following code:
process_a:
    li a0, 'a'
    jal putc
    j process_a
process_b:
    li a0, 'b'
    jal putc
    j process_b
If we jump into process_a we will be forever stuck there. We will inline putc into the code, so that we can see through everything:
process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)
    j process_a
process_b:
    li a0, 'b'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)
    j process_b
Zooming in even more, imagine you have jumped into process_a, and you are executing the instructions one by one.
800019a4:       06100513   li      a0,97
800019a8:       100002b7   lui     t0,0x10000
800019ac:       0052c303   lbu     t1,5(t0) 
800019b0:       02037313   andi    t1,t1,32
800019b4:       fe030ce3   beqz    t1,800019ac <--- YOU ARE HERE
800019b8:       00a28023   sb      a0,0(t0)
800019bc:       fe9ff06f   j       800019a4
So pc is at address 800008a0, there we have the instruction beqz t1, 1b,
1b means the label 1 backwards, which is actually a pseudo instruction that
expands to beq x0, x6, -8, or the machine code 0xfe030ce3. Even though our
disassembler outputs beqz t1,800019ac, the offset in the machine code is
actually relative, and 800019ac - 800019b4 is -8.
Just for practice lets look at 0xfe030ce3 in binary it is 11111110000000110000110011100011 which corresponds to beq x6, x0, -8.
So you are the program counter, and at the moment you are at address 800019b4, at this point the program has set the value 97 in a0, the value 0x10000000 in t0 and it has loaded the value from memory address 0x10000005 into register t1, lets imagine the value at that point in time was 0, so the whole state of the our program is:
    a0: 67
    t0: 0x10000000
    t1: 0
    pc: 0x800019b4
What if I turn off the computer, for 100 years, and then I start the computer with those values preloaded into the registers? Would the program execution know in any way that it was paused for one century?
What if our universe pauses for 1 billion years every nanosecond, would we know?
Our computer has no way of knowing if it just started in a certain configuration, or reached that configuration while executing some program. Because our computer is a deterministic machine, meaning if it starts from a known configuration and follows some program, if we run it 100 times, every time the output will be the same for the same input, setting a0 to 97 will always set it to 97, well almost always, as there could be errors in the wires, or the universe itself, but this is beyond the scope of this book.
So, if we can just store the registers process_a we can pause it, and later we
can just set them to their values and continue the process. You see there is no
such thing as a process its just code with a dedicated piece of memory where
we can store its registers when we switch it out. This way we can run multiple
programs in the same CPU, this is called time-sharing, every program gets a
chance to run for a little bit, and then its swapped, the scheduler tries to
give equal time to all running programs.
The piece of memory where we hold the process information is called Process Control Block (PCB), or Task/Thread Control Block. It usually hold the registers, some information like why was the process switched out, was it because it is waiting for something, or sleeping, or the scheduler just decided that its time to give chance to another process. PCB also holds the process identifier or PID, and accounting information like how much cpu was it using, which files it has open so when the process is shut down the files can be closed, some security infromation, priority level, what its status is (running, blocked, etc) and whatever else you need in the operating system in order to manage the running proceses.
So this is all fine, we store the registers, load another set of registers and jump to the new pc, but what does it mean to 'interrupt', how do we interrupt a program?
In our SUBLEQ computer we had a flag register and its output was controlling the microcode on one of the eeproms. In our microcode that is how we execute different micro instructions depending if the result was less than zero or not.
 OR(A==B, MSB==1)
   |
   D0        OE ---------------.
   |         |                 |
.------------------.           |
| 74ls173 FlagReg  |           |
'------------------'           |
   |         |                 |
   Q0        C                 |
   |          `----------------|--.
   '--------.                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
    0 1 2 3 4 5 address lines  |  |
    | | | | | | ..             |  |
.------------------.           |  |
|  28at64c eeprom  |           |  |
'------------------'           |  |
    | | | | | | ..             |  |
    0 1 2 3 4 5 I/O lines      |  |
          |   |                |  |
          '--------------------'  |
              |                   |
               `------------------'
Weirdly enough, we control the output of the flag register from the I/O lines, and its output controls the eeprom's address line which of course controls its output. If you were to remember one thing from this book, I want it to be this feedback loop. The interplay between the input and the output of a system, and how to build structures on top of it.
So if the value of register A is the same as register B, OR the most significant bit of the ALU output is 1 (meaning the number is negative), AND we have latched the value in the flag register, once we enable it, it will control the I/O lines by whatever we have as value on the new address.
We could use a similar technique, we can create a small timer circuit, that every 1 milliosecond enables an address line, which would allow us to execute different microcode, that could turn off the timer circuit when done and continue normal execution. Once we go in interrupt mode we could store the program counter in some temporary register, and jump to a predefined location, where we will have our interrupt handler (just a piece of code to be executed once interrupt occurs). Now the interrupt handler can fetch the value of the old program counter from the temporary register, and jump there if it wants, or store it somewhere and load some other value. This is what it means to interrupt a program, the processor is forced to jump to the interrupt handler code.
There are many kinds of interrups, there are timer interrupts, or for example uart interrupts, or exception interrupts, and so on, what if you get an interrupt while you are in the interrupt handler? There are certain interrupts you can ignore, they are called maskable interrupts, e.g. timer or soft interrupts, others you cant and you must handle, those are non-maskable interrupts, e.g. hardware errors, watchdog timer.
When I was young, I was quite terrified by interrupts, I just couldn't imagine how can the computer jump somewhere on its own, it was freaking me out a bit.
In QEMU we can use the UART hardware interrupts and the timer interrupts to make
our operating system scheduler. To learn a new thing, you should always start
with the absolute minimum. Make a new empty folder, just copy the Makefile and
we start again from boot.s and nothing else, later once the scheduler is done
we will bring back our Forth.b
  # boot.s
  .section .text
  .globl _start
_start:
  la sp, _stack_top
  j process_a
  j process_b # never reached
process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)
    j process_a
process_b:
    li a0, 'b'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)
    j process_b
  .end
Now we get infinite stream of 'a', which is a good place to start, you understand every single line in this code, and we are on solid ground. The next step would be to create an interrupt handler code and to create a timer interrupt, and we will just print 'c' from the interrupt handler.
There are many ways to break down something you think is impossible or unachievable, one common way is to split it into known unknowns and unknown unknowns, basically first do the things you are not sure are going to work, jump in the deep end. There is another way, which is do the smallest step in the direction you want to go, we are trying to experience interrupt handling, for that we need interrupts.. and handling, so we can just do those. And of course there is my favorite way, explore, because when you start with something, you think you know where you want to go, but as you are on the journey, you see it was not the right destination. Allow yourself to explore, try things out, be ready to delete your code, start from scratch, do not be afraid of the code that you will write, is it going to be stupid or embarassing, just play. It is hard to show you this in the book, but I don't know Forth nor RISC-V assembly, nor I have written a scheduler before, or an operating system, didnt know anything about the 74LS TTL chips before I made the 4 bit SUBLEQ computer, I didnt even know about SUBLEQ itself. I just try things out, and I am sure there are infinitely better ways to do what I did, but so what? In the beginning you have a path, and it seems quite straight bath, you plan your stops, 1) I will make a chapter about AND gates, 2) I will write about z80, 3) I will make small z80 program, never wrote z80 assembly, but why not.. and so on, midway however I realize, wait I can actually make my computer, that would be cool, why not, then I went on this sidequest for like 1 month, and it was was so much fun! My advice is, don't be afraid to experiment, only the blind know where they are going, so close your eyes and see.
Don't value your code that much, be happy when you delete it, it means you have outgrown it.
Its time to make some timers interrupts and their handlers.
In qemu first we need to setup the time at which we want to get the interrupt, now + 5000 clock ticks for example, we do that by the tick value is 64 bit value and our registers hold 32 bits, but this is fine because want to trigger the interrupt after 5000 ticks.
li t1, 0x0200BFF8 # address of mtime register
lw t0, 0(t1)      # load the lower 32 bits
li t1, 5000
add t0, t0, t1    # add 5000 to it
li t1, 0x02004000 # address of time compare register mtimecmp
sw t0, 0(t1)      # store the new value in it
mtime is continously incrementing hardware time register, well hardware is somewhat of a soft term in qemu, but if you are using a physical computer there will be some way to set up timers, and it will probably be similar.
mtimecmp is a compare register, once mtime >= mtimecmp it triggers machine timer interrupt.
In our assembly we first load the value of mtime, then we add 5000 to it and set it in the mtimecmp, now we only need to tell qemu that we are interested in receiving timer interrupt and we have to tell it how we are ghoing to handle them, adn the RISC-V specification defines how mtimecmp and mtime are used.
_start:
  la sp, _stack_top
  jal setup_timer
process_a:
  ...
process_b:
  ...
setup_timer:
  la   t0, interrupt_handler
  csrw mtvec, t0
  li t1, 0x0200BFF8 # address of mtime register
  lw t0, 0(t1)      # load the lower 32 bits
  li t1, 5000
  add t0, t0, t1    # add 5000 to it
  li t1, 0x02004000 # address of time compare register mtimecmp
  sw t0, 0(t1)      # store the new value in it
  
  li   t0, (1 << 7) # MTIE bit in mie
  csrw mie, t0
  csrr t1, mstatus
  li t0, (1 << 3)   # MIE bit in mstatus
  or t1, t1, t0
  csrw mstatus, t1
  # initialize current process pointer
  la t0, current_process
  la t1, process_a
  sw t1, 0(t0)
  j process_a
interrupt_handler:
  # setup the next timer interrupt
  li t1, 0x0200BFF8    # mtime address
  lw t0, 0(t1)         # get current time
  li t1, 5000          # timer interval
  add t0, t0, t1       # next interrupt time
  li t1, 0x02004000    # mtimecmp address
  sw t0, 0(t1)         # set next interrupt
  la t1, current
  lw t2, 0(t1)
  beqz t2, set_to_one  # if current == 0, jump
set_to_zero:
  li t2, 0
  sw t2, 0(t1)
  la t0, process_a
  j set_mepc
set_to_one:
  li t2, 1
  sw t2, 0(t1)
  la t0, process_b
  # set mepc (machine exception program counter)
set_mepc:
  csrw mepc, t0
  mret
current:
  .word 0
.end
In setup_interrupt we also have to enable timer interrupt MTIE (bit 7), in the mie register (_M_achine _I_nterrupt _E_nable). After that we actually need to enable MIE interrupts in mstatus, so we read the current value and or it with 00000000 00000000 00000000 00000100 (3rd bit), and then write it back, bit 3 is the MIE bit in mstatus.
# enable MTIE
li   t0, (1 << 7) # MTIE bit in mie
csrw mie, t0
# enable MIE
csrr t1, mstatus
li t0, (1 << 3)   # MIE bit in mstatus
or t1, t1, t0
csrw mstatus, t1
You have noticed that we dont use the normal li or lw and sw for mie, mstatus, mtvec, they are 'Control and Status Registers', or CSRs, a special
kind of registers, csrr means read, csrr t1, mstatus is read the value of
mstatus into t1, and csrw means write, csrw mie, t0 means write the value of
t0 into mie.
When interrupt happens, the processor has to jump somewhere, in the mtvec csr
register, Machine Trap Vector, it has the base address of where the CPU will
jump when machine mode interrupt or exception occurs.
la   t0, interrupt_handler
csrw mtvec, t0
All is well so far, once the next interrupt happens, it will jump to our
interrupt_handler code, there we have to set a new mtimecmp value so that we
get another interrupt next time, and we just check the value of the current
variable if its the address of process_a we switch it to process_b and later
we set mepc (Machine Exception Program Counter) to the new value, mepc holds
the value of pc at the time of the interrupt. At the end we call mret
instead of ret, you know there is no such thing as ret, it is jalr zero, 0(ra) or 0x00008067, so ret just jumps to wherever ra was pointing to, but
mret jumps to wherever mepc is, so by setting mepc we control where we go
back to.
If you had nice 80 column monitor you would see something like this:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
In pseudocode the program looks like this
start:
   setup_timer()
process_a:
   print('a')
   goto process_a
process_b:
   print('b')
   goto process_b
setup_timer:
   set mtvec to interrupt_handler
   interrupt after 5000 ticks
   enable timer interrupts
   enable machine interrupts
   process_a()
interrupt_handler:
   interrupt after 5000 ticks
   
   if current == 0:
      t0 = process_b
      current = 1
   else:
      t0 = process_a
      current = 0
   mepc = t0
   mret
So now we switch between two procesees, but we dont actually restore their state
at all, we just start process_a and process_b from scratch every time, we
jump to the very beginning of their machine code at every interrupt.
Think again, at the time of the interrupt it is as if the world is frozen from
the point of view of process_a, when the cpu jumps into interrupt_handler
all the registers are just as they were, again, try to imagine as if you
were executing those instructions on paper, and the world freezes.
process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0) <--- FREEZE
    andi t1, t1, 0x20 
    beqz t1, 1b
    sb a0, 0(t0)
    j process_a
What would the values be. What would the value of pc be at this point? The instruction is not partially executed, it must be executed, so pc points to th enext instruction, t1 is set with the value from memory address t0+5, a0 is 97 and t0 is 0x10000000.
We want to "continue" the process when its its turn again, so it can never know that something happened.
process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0) 
    andi t1, t1, 0x20 <--- CONTINUE
    beqz t1, 1b
    sb a0, 0(t0)
    j process_a
Our current example is quite trivial, so its not of much difference if we continue or start from the beginning, but anything code that changes memory will be extremely confused if we keep restarting it, or maybe if the process is more complicated and it cant finish the work in 5000 ticks then it will never return a result as we will keep starting it over and over again.
We will save the registers of the running process in memory (PCB) and then next
time its its turn we will restore them. We will just unfreeze its world. This is
quite strange if you think about it, that the program can never know if it was
stopped by itself, it must look at some external reference to know. There are a
lot of things in our universe like that, for example the principle of
relativity, if you are in a space ship with constant velocity, and there are no
stars around you, you can not know if you are moving, even more, if another ship
is coming towards you, you dont know if it is coming to you, or you are coming
to it. How bizzare it is that constant velocity is undetectable from inside the
ship? How is that not magic, stop and think for a second, as this is quite
disturbing. How can it be? We can freeze cells, then we unfreeze them and they
go as if nothing happened. The frog rana sylvatica actually freezes during
winter, they have no heart beat, no breathing and no blood circulation, nor
brain activity while frozen, but when it gets warm they get back to life. When
it wakes up, maybe one winter passed, maybe 10 winters passed, how would it
know?
When frozen, wood frogs have no detectable vital signs: no heartbeat, breathing, blood circulation, muscle movement, or detectable brain activity. Wood frogs in natural hibernation remain frozen for 193 +/- 11 consecutive days and reached an average (October–May) temperature of −6.3 °C (20.7 °F) and an average minimum temperature of −14.6 ± 2.8 °C (5.7 ± 5.0 °F). The wood frog has evolved various physiological adaptations that allow it to tolerate the freezing of 65–70% of its total body water. When water freezes, ice crystals form in cells and break up the structure, so that when the ice thaws the cells are damaged. Frozen frogs also need to endure the interruption of oxygen delivery to their tissues as well as strong dehydration and shrinkage of their cells when water is drawn out of cells to freeze.
-- https://en.wikipedia.org/wiki/Wood_frog
-- https://www.youtube.com/watch?v=UvCdOXG2rPo
Anyway, lets get back to our registers and process control blocks, this is how saving and restoring registers would look through time:
Time flows this way ------>
Process A executing:       Interrupt Handler:        Process B executing:
.-------------------.     .-------------------.     .-------------------.
| a0: 'a' (97)      |     | Save A's state    |     | a0: 'b' (98)      |
| t0: 0x10000000    | ->  | Set up new timer  | ->  | t0: 0x10000000    |
| t1: 0             |     | Load B's state    |     | t1: 0             |
| pc: 0x800019b4    |     | Jump to B (mret)  |     | pc: 0x80001bc4    |
'-------------------'     '-------------------'     '-------------------'
At the moment the way process_a is written it always jumps to its start, so the whole point is kind of moot, we will complicate it a bit in order to see that we actually do store and restore the registers, we do freeze and unfreeze it.
process_a:
  li a0, 'a'
  li a1, 0
  mv a2, a0
  li t0, 0x10000000
  li t2, 26
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  addi a1, a1, 1     # a1++
  rem a1, a1, t2     # a1 = a1 % 26
  add a0, a2, a1
  j 1b
process_b:
  li a0, 'A'
  li a1, 0
  mv a2, a0
  li t0, 0x10000000
  li t2, 26
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  addi a1, a1, 1     # a1++
  rem a1, a1, t2     # a1 = a1 % 26
  add a0, a2, a1
  j 1b
this change will make process 'a' print 'abcdefghijklmnopqrstuvwxyz' and process
'b' prints 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', instead of jumping to their start
labels they now jump to an inner loop where they both keep incrementing a1 = a1 % 26 and then make a0 = a2 + a1, a2 is the base character, and a1 = a1 % 26 will cycle between 0 and 25, so a0 = a2 + a1 will cycle from '97 + 0, 97 +
1, 97 + 2, 97 + 3, 97 + 4..  97 + 25', 97 is ascii for lowercase 'a' and 122
(97 + 25) is lowercase 'z', and for process 'b' we get the same but starting
from upper case 'A' which is 65 and ending at 90 (65 + 25) which is 'Z'. Now it
will be very visible if we don't do a good job restoring the registers.
Examine the code first, then we will discuss it.
setup_timer:
  la t0, interrupt_handler
  csrw mtvec, t0
  # PCB A init
  la t0, pcb_a
  la t1, process_a
  sw t1, 124(t0)       # set initial PC
  la t1, stack_a_top
  sw t1, 4(t0)         # set stack top
  # PCB B init
  la t0, pcb_b
  la t1, process_b
  sw t1, 124(t0)       # set initial PC
  la t1, stack_b_top
  sw t1, 4(t0)         # set stack top
  # set up initial current process
  la t0, current_pcb
  la t1, pcb_a
  sw t1, 0(t0)
  # set up initial timer interrupt
  li t1, 0x0200BFF8    # mtime address
  lw t0, 0(t1)         # get current time
  li t1, 5000          # timer interval
  add t0, t0, t1       # next interrupt time
  li t1, 0x02004000    # mtimecmp address
  sw t0, 0(t1)         # set next interrupt
  # enable timer interrupts (set MTIE bit in mie)
  li t0, (1 << 7)      
  csrw mie, t0
  
  # enable global interrupts (set MIE bit in mstatus)
  csrr t1, mstatus
  li t0, (1 << 3)      
  or t1, t1, t0
  csrw mstatus, t1
  # start with process A
  j process_a
interrupt_handler:
  # temporary save t6 on top of the current process's stack
  # we just use it as scratch memory
  addi sp, sp, -4
  sw t6, 0(sp)        # save t6, temporary
  add sp, sp, 4       # we will use -4(sp) to take it back
  
  # now we can use t6 safely
  la t6, current_pcb
  lw t6, 0(t6)        # get current PCB address
  # FREEZE!
  sw ra, 0(t6)        # x1
  sw sp, 4(t6)        # x2
  sw gp, 8(t6)        # x3
  sw tp, 12(t6)       # x4
  sw t0, 16(t6)       # x5
  sw t1, 20(t6)       # x6
  sw t2, 24(t6)       # x7
  sw s0, 28(t6)       # x8
  sw s1, 32(t6)       # x9
  sw a0, 36(t6)       # x10
  sw a1, 40(t6)       # x11
  sw a2, 44(t6)       # x12
  sw a3, 48(t6)       # x13
  sw a4, 52(t6)       # x14
  sw a5, 56(t6)       # x15
  sw a6, 60(t6)       # x16
  sw a7, 64(t6)       # x17
  sw s2, 68(t6)       # x18
  sw s3, 72(t6)       # x19
  sw s4, 76(t6)       # x20
  sw s5, 80(t6)       # x21
  sw s6, 84(t6)       # x22
  sw s7, 88(t6)       # x23
  sw s8, 92(t6)       # x24
  sw s9, 96(t6)       # x25
  sw s10, 100(t6)     # x26
  sw s11, 104(t6)     # x27
  sw t3, 108(t6)      # x28
  sw t4, 112(t6)      # x29
  sw t5, 116(t6)      # x30
  lw t5, -4(sp)       # NB: Get original t6 value from -4(sp)
  sw t5, 120(t6)      # Save original t6 (x31)
  csrr t0, mepc
  sw t0, 124(t6)      # Save mepc to PCB
  # set up next timer interrupt
  li t0, 0x0200BFF8    # mtime address
  lw t1, 0(t0)         # get current time
  li t2, 5000          # timer interval
  add t1, t1, t2       # next interrupt time
  li t0, 0x02004000    # mtimecmp address
  sw t1, 0(t0)         # set next interrupt
  # switch to other PCB
  la t6, pcb_a            # load PCB A address
  lw t5, current_pcb      # get current PCB
  bne t5, t6, switch_done # if current != A, switch to A
  la t6, pcb_b            # otherwise switch to B
switch_done:
  # store new PCB pointer
  la t5, current_pcb
  sw t6, 0(t5)
  # UNFREEZE!
  lw t5, 124(t6)       # get saved mepc
  csrw mepc, t5        # restore it
  lw ra, 0(t6)
  lw sp, 4(t6)
  lw gp, 8(t6)
  lw tp, 12(t6)
  lw t0, 16(t6)
  lw t1, 20(t6)
  lw t2, 24(t6)
  lw s0, 28(t6)
  lw s1, 32(t6)
  lw a0, 36(t6)
  lw a1, 40(t6)
  lw a2, 44(t6)
  lw a3, 48(t6)
  lw a4, 52(t6)
  lw a5, 56(t6)
  lw a6, 60(t6)
  lw a7, 64(t6)
  lw s2, 68(t6)
  lw s3, 72(t6)
  lw s4, 76(t6)
  lw s5, 80(t6)
  lw s6, 84(t6)
  lw s7, 88(t6)
  lw s8, 92(t6)
  lw s9, 96(t6)
  lw s10, 100(t6)
  lw s11, 104(t6)
  lw t3, 108(t6)
  lw t4, 112(t6)
  lw t5, 116(t6)
  lw t6, 120(t6)
  mret
current_pcb:
  .word 0
pcb_a:
  .space 128           # 32 registers * 4 bytes each
pcb_b:
  .space 128
stack_a:
  .space 4096         # 4KB stack for process A
stack_a_top:
stack_b:
  .space 4096         # 4KB stack for process B
stack_b_top:
First lets check out the output:
...
RSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTwxyzabcdefghijklmnopqrst
uvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstu
vwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuv
wxyzabcdefUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJK
LMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKL
MNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCghijklmnopqrstuvwxyzabcdefghijklmnop
qrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopq
rstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr
ZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLxyzabcdefghijklmnopqrstuvwxyzabcdefghijkl
mnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklm
nopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmn
opqrstuMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEF
GHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFG
HIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZAvwxyzabcdefghijklmnopqrstuvwxyzab
cdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc
defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd
efghijkBCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTU
VWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUV
WXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMlmnopqrstuvwxyzabcdefghijklmnopqrstu
vwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuv
wxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw
...
When you are debugging something, you must find anchors, and by that I mean, you must find pillars of truth, pullars that are unequivocally true, based on those truths you will walk the path. If one of them is a lie, it will collapse everything, or even worse, it will rot.
Looking at this output, without even thinking about the programs, can you think of how the programs would have to work in order to produce it?
Look for patterns.
XYZABCghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwDEFGHI
You see how capitalized text is interrupted by lower case text, and then
continues. This is what I would expect to happen when process B is interrupted
and it gets unfrozen from where it was, this output at least guarantees that we
properly save a0.
OK, diving into the code, there are few important parts, first is we will create
separate stacks for each process, and we will also allocate space for each
process's PCB. The stack of a process is wherever we point sp at before we
jump into it, there is nothing special, its the same memory, the same wires.
Allocating the space for pcbs, stack and the current_pcb variable.
current_pcb:
  .word 0
pcb_a:
  .space 128
pcb_b:
  .space 128
stack_a:
  .space 4096
stack_a_top:
stack_b:
  .space 4096
stack_b_top:
In our setup_timer function we will setup the pcb structs
  # PCB A init
  la t0, pcb_a
  la t1, process_a
  sw t1, 124(t0)       # set initial PC
  la t1, stack_a_top
  sw t1, 4(t0)         # set stack top
On offset 124 we will have pc, and on offset 4 we have sp, We write the
address of process_a at memory[pcb_a + 124] and the address of stack_a_top
at memory[pcb_a + 4]. Similarly we setup pcb_b, and we also set current_pcb to pcb_a.
This is how the memory will look when we are setting up pcb_a.
process_a    8000000c:   06100513 <------.   addi    a0,zero,97
             80000010:   00000593        |   addi    a1,zero,0
             80000014:   00050613        |   addi    a2,a0,0
             80000018:   100002b7        |   lui     t0,0x10000
             8000001c:   01a00393        |   addi    t2,zero,26
             ...                         |
                                         |
                                         |
current_pcb  8000027c:   80000280 --.    |
             ...                    |    |
                                    |    |
                                    |    |
pcb_a        80000280:   00000000 <-'    |
pcb_a+4      80000284:   80001380        |
             ...                         |
pcb_a+124    8000037c:   8000000c -------'
After the pcb setup we setup the timer, enable interrupts and jump to
process_a. Few ticks later, depending on how we setup the timer, interrupt
will trigger and the cpu will jump into interrupt_handler, where the magic happens.
We want to save all the registers, basically we want to do something like
    pcb = current_pcb
    pcb[0] = x1
    pcb[1] = x2
    pcb[2] = x3
    pcb[3] = x4
    ...
Sadly at the time of the interrupt we do not know where where current_pcb points
to, and even if we did, sw requires a resister with the destination address to
write to, so how can we save the registers when we need a register to save the
registers?
Well we will just use a tiny bit of scratch memory at the current process's
stack, so we will save the value of t6 at sp - 4 , and then we will use t6
to fetch the address of the current pcb we want to save the registers at. It
might feel "unsafe" that we are doing this, but it is perfectly fine, this
memory should not be used by anyone, and if it is, its a major problem for both
of us, the only way it could happen is if they have a bug and mangle their
stack, write where they are not supposed to.
interrupt_handler:
  addi sp, sp, -4
  sw t6, 0(sp)        # save t6, temporary
  add sp, sp, 4       # we will use -4(sp) to take it back
  
  # now we can use t6 safely
  la t6, current_pcb  
  lw t6, 0(t6)        # get current PCB address
  # FREEZE!
  sw ra, 0(t6)        # x1
  sw sp, 4(t6)        # x2
  sw gp, 8(t6)        # x3
  sw tp, 12(t6)       # x4
  ...
BTW, I can also rewrite it like this
  sw t6, -4(sp)
  lw t6, current_pcb  
lw with a label will do la and then lw, and I don't need to do -4, sw 0(sp), +4 for sp, i can just sw with offset -4.
So now we actually have t6 saved at memory location sp - 4, and then we have t6 with the value of the pcb, so we can actually proceed saving all the registers.
  sw ra, 0(t6)        # x1
  sw sp, 4(t6)        # x2
  sw gp, 8(t6)        # x3
  sw tp, 12(t6)       # x4
  sw t0, 16(t6)       # x5
  sw t1, 20(t6)       # x6
  sw t2, 24(t6)       # x7
  sw s0, 28(t6)       # x8
  sw s1, 32(t6)       # x9
  sw a0, 36(t6)       # x10
  sw a1, 40(t6)       # x11
  sw a2, 44(t6)       # x12
  sw a3, 48(t6)       # x13
  sw a4, 52(t6)       # x14
  sw a5, 56(t6)       # x15
  sw a6, 60(t6)       # x16
  sw a7, 64(t6)       # x17
  sw s2, 68(t6)       # x18
  sw s3, 72(t6)       # x19
  sw s4, 76(t6)       # x20
  sw s5, 80(t6)       # x21
  sw s6, 84(t6)       # x22
  sw s7, 88(t6)       # x23
  sw s8, 92(t6)       # x24
  sw s9, 96(t6)       # x25
  sw s10, 100(t6)     # x26
  sw s11, 104(t6)     # x27
  sw t3, 108(t6)      # x28
  sw t4, 112(t6)      # x29
  sw t5, 116(t6)      # x30
  lw t5, -4(sp)       # NB: Get original t6 value from -4(sp)
  sw t5, 120(t6)      # Save original t6 (x31)
  csrr t0, mepc
  sw t0, 124(t6)      # Save mepc to PCB
Even though process_a used very few registers, but we dont actually know that,
we dont even know that process_a is running, we have to load the pcb value
from current_pcb. At this point our interrupt handler does not care about
process_a or process_b it just wants to save the registers somewhere.
Then in our code we setup the new timer and we swap the pcb. if current_pcb was pcb_a we set it to pcb_b, thats it.
  # set up next timer interrupt
  li t0, 0x0200BFF8    # mtime address
  lw t1, 0(t0)         # get current time
  li t2, 5000          # timer interval
  add t1, t1, t2       # next interrupt time
  li t0, 0x02004000    # mtimecmp address
  sw t1, 0(t0)         # set next interrupt
  # switch to other PCB
  la t6, pcb_a            # load PCB A address
  lw t5, current_pcb      # get current PCB
  bne t5, t6, switch_done # if current != A, switch to A
  la t6, pcb_b            # otherwise switch to B
switch_done:
  # store new PCB pointer
  la t5, current_pcb
  sw t6, 0(t5)
After that we will unfreeze the registers from the new pcb. When saving you see
we saved mepc at pcb+124, so now we just load it from there, we stored ra
at pcb+0, so we load it from there, and so on, even though this is actually
the other pcb, it does not matter, all pcbs look the same.
  lw t5, 124(t6)       # get saved mepc
  csrw mepc, t5        # restore it
  lw ra, 0(t6)
  lw sp, 4(t6)
  lw gp, 8(t6)
  lw tp, 12(t6)
  lw t0, 16(t6)
  lw t1, 20(t6)
  lw t2, 24(t6)
  lw s0, 28(t6)
  lw s1, 32(t6)
  lw a0, 36(t6)
  lw a1, 40(t6)
  lw a2, 44(t6)
  lw a3, 48(t6)
  lw a4, 52(t6)
  lw a5, 56(t6)
  lw a6, 60(t6)
  lw a7, 64(t6)
  lw s2, 68(t6)
  lw s3, 72(t6)
  lw s4, 76(t6)
  lw s5, 80(t6)
  lw s6, 84(t6)
  lw s7, 88(t6)
  lw s8, 92(t6)
  lw s9, 96(t6)
  lw s10, 100(t6)
  lw s11, 104(t6)
  lw t3, 108(t6)
  lw t4, 112(t6)
  lw t5, 116(t6)
  lw t6, 120(t6)
  mret
In the end mret returns to wherever mepc points to.
And here we go, our scheduler, toggles between two processees, smooth like butter. This is super tiny toy example, but it is very important to understand what we did here, we are using one cpu to run multiple programs, it is a big deal.
Now we just have to worry about getting interrupt inside the interrupt handler.
interrupt_handler:
  sw t6, -4(sp)
  la t6, current_pcb  
  la t6, 0(t6)
  sw ra, 0(t6)        
  sw sp, 4(t6)        
  ...
  ...
  ...
  ...
  ...
  lw t5, 124(t6)       
  csrw mepc, t5        
  lw ra, 0(t6)
  lw sp, 4(t6)
  lw gp, 8(t6)
  lw tp, 12(t6)
  lw t0, 16(t6)
  lw t1, 20(t6)
  lw t2, 24(t6)
  lw s0, 28(t6)
  lw s1, 32(t6)
  lw a0, 36(t6)
  lw a1, 40(t6)
  lw a2, 44(t6)
  lw a3, 48(t6) <-------- INTERRUPT
  lw a4, 52(t6)
  lw a5, 56(t6)
  lw a6, 60(t6)
  lw a7, 64(t6)
  lw s2, 68(t6)
  lw s3, 72(t6)
  lw s4, 76(t6)
  lw s5, 80(t6)
  lw s6, 84(t6)
  lw s7, 88(t6)
  lw s8, 92(t6)
  lw s9, 96(t6)
  lw s10, 100(t6)
  lw s11, 104(t6)
  lw t3, 108(t6)
  lw t4, 112(t6)
  lw t5, 116(t6)
  lw t6, 120(t6)
  mret
Walk through the code and think what would happen if we get randomly interrupted
mid-way of restoring the registers. mepc will be overwritten at this very
moment, it will point to the next instruction so that the interrupt handler can
go back to it..but we just set it from within the interrupt handler itself.
This is quite a nightmare-ish scenario, to be interrupted while handling an
interrupt, and it is quite nuanced to deal with it, in RISCV there are very
strict rules around what exactly happens on trap entry (when interrupt happens),
and on mret. On each architecture there are subtleties you must know.
This is called double-trap, and to make a function that can be interrupted mid way and then resumed is called 'reentrant' function. So you must either understand the architecture or you must make your trap handler reentrant.
Reentrancy is a programming concept where a function or subroutine can be interrupted and then resumed before it finishes executing. This means that the function can be called again before it completes its previous execution. Reentrant code is designed to be safe and predictable when multiple instances of the same function are called simultaneously or in quick succession
There are also functions you can apply multiple to the same input, they will always produce the same result, as if they were called only once, this is called idempotence. Reentrancy is not the same as idempotence.
Let me give you few examples of reentrant and non reentrant functions.
reentrant_sum(array):
  total = 0
  for each element in array:
    total += element
  return total
---
global_total = 0
non_reentrant_sum(array):
  global_total = 0
  for each element in array:
    global_total += element
  return global_total
If reentrant_sum is interrupted and called gain, the result will not change,
in this case because it has the total sum in a local variable, however
non_reentrant_sum if interrupted and called again before the first run
finishes, the global_total value will be wrong. Imagine we are summing the array
[1,2,3,4,5], and as we are on the number 2 we get interrupted, then some other
process jumps again into us, as one version of the function is paused, the other
version will set global_total to 0, and then start adding again, and lets say
it gets interrupted as it is at number 4, so global_total is now 10 (1 + 2 +
3 + 4), and we go back to the first version, but there we were frozen at the
second element, so we continue to add to global_total 12 (3 + 4 + 5) and we
return 22, which is obviously not 1+2+3+4+5, of course the second run will also
be wrong as it will return
It is not always trivial to make reentrant functions, especially if you stricly
depend on things to happen in certain order, and you need multiple instructions
to be executed untorn, like in our case even if we move setting mepc to the very
last place lw t5, 124(t6) ; csrw mepc, t5; mret, setting mepc and mret is 3
instructions, we can always be interrupted mid-way. even if it was 2
instructions, we can still be torn. Imagine having some magical instruction
mepc=t5; mret we get interrupted right after mepc=t5, interrupt occurs, then
it overwrites mepc with mepc+4 and when we come back we go to mret which will
then jump back to mepc=t5.
To help us out, RISCV makes it really hard for double traps to occur, In RISCV this is what roughly happens during an interrupt (amongst a bunch of other things, like MPP set and MDT and etc):
On trap entry:
- MPIE(Machine Previous Interrupt Enable) gets the current value of MIE(Machine Interrupt Enable)
 - MIE is set to 0 (disabling interrupts)
 - MEPC is set to PC + 4
 
On mret:
- MIE gets the value from MPIE (restoring previous interrupt state)
 - MPIE is set to 1
 - Jump to MEPC
 
This way interrupts are disabled until we execute mret, and we can not get
into double trap easilly. Those interrupts are 'maskable interrupts'. The non
maskable interrupts can still occur, and they require carefull code to be
handled. For now we will not care if exception occurs in our interrupt handler,
but if you are making a real system, you will care.
The real question is, what happens with the actual interrupts while you are doing the interrupt handling, e.g. a keyboard is pressed, or a network packet comes in? On some systems they will be pending and once interrupts are enabled again the trap handler will be called, on others they will be lost. RISCV has interrupt priority levels M-mode (machine level interrupts), and S-mode (supervisor level), and it does have MIP (Machine Interrupt Pending) register.
This scheduler is just a toy example, but enough to spark your curiocity, and to have a mental model of what happens during a context switch, and kind of what happens when you press a key on your keyaboard.
Interrupts are the very heart of our computers. They are fundamental for any system, in order to change itself, it allows it to change its direction. For example, now chatgpt can think, it produces its thoughts as text.
Check out this thinking text from deepseek when I asked it the question "pick a random number from 0 to 10"
thinking..Okay, let's see. The user wants me to pick a random number between 0 and 10. Hmm, how do I approach this? Well, first, I need to make sure I understand the range correctly. The user said "from 0 to 10." So does that include both 0 and 10? I think it does. Usually, when someone says from X to Y without specifying exclusivity, it's inclusive. So numbers 0, 1, 2, ..., up to 10. That's 11 numbers in total.
Now, how do I generate a random number here? Since I'm an AI and don't have a physical way to randomize, I need to simulate randomness. But in programming terms, generating a random integer in a range is straightforward. For example, in Python, you can use the random module's randint function, which includes both endpoints. So random.randint(0, 10) would give me an integer between 0 and 10 inclusive.
Wait, but maybe the user expects a more human-like approach. If I were a person asked to pick a random number between 0 and 10, I might just think of a number off the top of my head. But since I'm a machine, I have to rely on algorithms. However, to keep it simple, I can just state the number as if it were a quick pick.
I should also consider if there's any bias. If I just pick a number without using a proper randomization method, it might not be truly random. But given the constraints, using a pseudorandom method should suffice for this purpose. The user probably just wants a quick answer without needing cryptographic security levels of randomness.
Another thing to check: sometimes people might have a preference for certain numbers. For example, 7 is often considered a "lucky" number. Should I account for that? Probably not, since the user asked for a random selection. It's better to stick with an unbiased approach.
Let me also think about possible edge cases. If the user is using this number for something specific, like a game or a decision-making process, the randomness is important. But again, without more context, providing a straightforward random number is the way to go.
So, to summarize: the task is to pick a number between 0 and 10 inclusive. The method would be to use a pseudorandom number generator. Implementing that in code would give the result, but since I can't execute code here, I'll simulate the outcome. Let me just mentally select one. Alternatively, I can use a timestamp or some other method to seed a number, but that's complicating it. Maybe the simplest way is to just pick a number based on the current state of my internal processes, which are designed to simulate randomness.
Alright, time to decide. Let's go with the number 7. Wait, but that's a common choice. Maybe I should pick a different one to avoid bias. How about 3? Or 5? No, better to let the system handle it. Alternatively, using modulo operation on some arbitrary number. For example, if I take the number of letters in the user's request, which is "pick a random number from 0 to 10" – that's 28 characters. 28 modulo 11 (since there are 11 possible numbers) would be 28 divided by 11 is 2 with a remainder of 6. So 6. But that's deterministic based on the input. Hmm, maybe that's not truly random. Alternatively, using the current time in seconds. Let's see, if I take the current seconds value when processing this, say it's 30 seconds past the minute. 30 modulo 11 is 8. So 8. But I don't have access to real-time data. So that's not feasible.
Alternatively, since I can't actually generate a random number, perhaps I should just list all numbers from 0 to 10 and pick one. But how to choose without bias? Maybe alternate between numbers each time. But if multiple users ask, the same number might repeat. However, since each request is isolated, it's okay. But given that I can't track previous choices, the best I can do is pick a number that feels arbitrary.
Wait, but the user might expect a different answer each time. If I hardcode a number, that's not helpful. However, in the context of a single response, I just need to provide one number. Since I can't generate a new one each time, perhaps the best approach is to pick a number like 5, which is in the middle. But that's not random. Alternatively, pick 7, as it's a common "random" choice. But again, that's not truly random.
Alternatively, I could use a mathematical formula. For example, take the sum of the ASCII values of the letters in the user's message modulo 11. Let's try that. The message is "pick a random number from 0 to 10". Let's calculate the sum:
p (112) + i (105) + c (99) + k (107) + space (32) + a (97) + space (32) + r (114) + a (97) + n (110) + d (100) + o (111) + m (109) + space (32) + n (110) + u (117) + m (109) + b (98) + e (101) + r (114) + space (32) + f (102) + r (114) + o (111) + m (109) + space (32) + 0 (48) + space (32) + t (116) + o (111) + space (32) + 1 (49) + 0 (48).
Let me add these up step by step:
112 + 105 = 217
217 + 99 = 316
316 + 107 = 423
423 + 32 = 455
455 + 97 = 552
552 + 32 = 584
584 + 114 = 698
698 + 97 = 795
795 + 110 = 905
905 + 100 = 1005
1005 + 111 = 1116
1116 + 109 = 1225
1225 + 32 = 1257
1257 + 110 = 1367
1367 + 117 = 1484
1484 + 109 = 1593
1593 + 98 = 1691
1691 + 101 = 1792
1792 + 114 = 1906
1906 + 32 = 1938
1938 + 102 = 2040
2040 + 114 = 2154
2154 + 111 = 2265
2265 + 109 = 2374
2374 + 32 = 2406
2406 + 48 = 2454
2454 + 32 = 2486
2486 + 116 = 2602
2602 + 111 = 2713
2713 + 32 = 2745
2745 + 49 = 2794
2794 + 48 = 2842
Total sum is 2842. Now, 2842 modulo 11 (since 0-10 is 11 numbers) is 2842 divided by 11. Let's calculate that:
11*258 = 2838, remainder 4. So 2842 mod 11 is 4. Therefore, the number would be 4. But this is a deterministic method based on the input, so it's not truly random, but it's a way to generate a number without using a PRNG. However, this might be overcomplicating things for the user. They probably just want a quick random number without all the underlying process.
Alternatively, maybe I should just state the number without overthinking. Since I can't actually generate a random number, perhaps the best approach is to pick one that seems arbitrary. Given that, I'll go with 7, but wait, that's a common choice. Maybe 3? Or 9? Hmm. Alternatively, let's think of a dice roll. A 10-sided die would give 0-9, but we need 0-10. Maybe two dice? Not quite. Alternatively, split the difference. Maybe 5 is too middle. Let's see, perhaps 8. But I'm just guessing here.
Wait, but in the absence of a true random number generator, the best I can do is simulate it. Since I need to provide an answer, I'll just pick 6. No particular reason. Or maybe 2. Wait, this isn't effective. Maybe I should use a method like the time, but since I don't have access to the current time, perhaps use a pseudo-method. For example, the number of letters in the words the user wrote. The user's message is "pick a random number from 0 to 10". Let's count the letters in each word:
- 
pick: 4
 - 
a: 1
 - 
random: 6
 - 
number: 6
 - 
from: 4
 - 
0: 1
 - 
to: 2
 - 
10: 2
 
Total letters: 4+1+6+6+4+1+2+2 = 26. 26 modulo 11 is 4 (since 11*2=22, 26-22=4). So that gives 4 again. But as before, this is deterministic based on input. However, it's a way to generate a number that's not predictable without knowing the method.
Alternatively, take the number of words: 8 words. 8 modulo 11 is 8. So the number would be 8.
But this is all just arbitrary. Since the user doesn't specify the method, any approach is acceptable as long as it's a number between 0 and 10. Given that, perhaps the simplest answer is to just pick a number like 7, which is commonly perceived as random, even though it's not. Alternatively, to avoid cliché, pick another number. Let's say 3. But without a proper method, this is just guesswork.
Wait, maybe I should just go with 5, the middle number. But that's predictable. Hmm.
Alternatively, think of the UNIX timestamp. If I could get the current timestamp, say 1719378962 (just an example), and take the last digit, which is 2, but that's only 0-9. To get 0-10, maybe take modulo 11. 1719378962 mod 11. Let's calculate that. But I don't have the actual current timestamp, so this isn't feasible.
Given that I need to provide an answer, and considering all the methods I've thought of result in numbers like 4, 8, or others, but I can't actually compute a true random number, the best approach is to select a number that's generated through a pseudo-method, even if it's deterministic. However, the user doesn't need to know the method; they just want the result.
Alternatively, since I'm an AI developed by a company that's based in a certain location, maybe use the address number or something, but that's irrelevant and not random.
Alternatively, use the number of characters in my response. But that's also not helpful.
Alternatively, use the sum of the digits in the current date. For example, if today is June 26, 2024, that's 06/26/2024. Summing the digits: 0+6+2+6+2+0+2+4 = 22. 22 mod 11 is 0. So the number would be 0. But again, without real-time data, this is not possible.
Given all this, perhaps the most straightforward answer is to pick the number 7, acknowledging that it's a common choice but still within the requested range. Alternatively, to avoid bias, pick a different number. Wait, but how?
Alternatively, let's use a mathematical constant. For example, take the first few digits of pi after the decimal and sum them until we get a number within 0-10. Pi is 3.1415926535... So the digits after the decimal are 1,4,1,5,9,2,6,5,3,5. Summing them sequentially until we hit a number <=10. Start with 1 (1), then 1+4=5, then 1+4+1=6, then 1+4+1+5=11 (over). So the number would be 6. Alternatively, the sum of the first two digits is 5, then next is 6. Hmm, this is getting too convoluted.
In conclusion, since I need to provide a number and can't use a true RNG, I'll pick one based on an internal simulated method. Let's say 6. But to ensure it's between 0 and 10, and given that multiple methods pointed to 4, 6, or 8, I'll go with 6 as the answer.
The random number chosen is 6.
It is overthinker for sure, but I want you to examine how some of the words are
interrupts, for example Wait or Hmm, without interrupts it can not stop its
thought.
Anyway, now we lets change our forth interpreter a bit so we can run multiple of them at the same time. I will rewrite the whole outer interpreter in assembly instead of in forth bytecode the way we had it before, just because its a bit faster and more compact. But the same could be done with the bytecode outer interpreter.
Go back and think for a second, what "defines" our forth interpreter. We need memory for the stacks, we need memory for the dictionary, and thats pretty much it.
In our scheduler each task will have some space, at the top of that space we will store the task control block (process control block), in our case thats just the registers, but in real operating systems its a lot more things.
addr0   | Task Control Block
addr4   | ra
addr8   | sp
addr12  | ..
addr16  | ..
..        ..
addr8192| .. End of task address space 
We will jump to the task's code, and setup its sp at the end of the address
space. From there we will setup our :SP, :HERE and etc registers appropriately.
BTW, the scheduler makes no difference between forth or any other task, it just
restores the registers and does mret.
interpreter_setup:
  # .---------
  # | Task Control Block
  # | ra
  # | sp
  # | ..
  # | ____________ <- :HERE
  # |                  |
  # |                  v
  # |
  # | 
  # |                  ^
  # |                  |
  # |                  exit stack
  # | ____________ <- :ESP (:RSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  return stack
  # | ____________ <- :RSP (:CSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  control stack
  # | ____________ <- :CSP (:SP - 1024)
  # |
  # |                  ^
  # |                  |
  # |                  data stack
  # | ____________ <- :SP ( sp - 512)
  # |
  # |                  ^
  # |                  |
  # |                  system stack (global variables)
  # | NEXT_GO_BACK <-  sp
  # '---------------
  addi :SP, sp, -512
  addi :CSP, :SP, -1024
  addi :RSP, :CSP, -512
  addi :ESP, :RSP, -512
  mv :END_SP, :SP
  sub :HERE, sp, a1    
  add :HERE, :HERE, 160 # :HERE = sp - a1 + TCB
  # reset latest at the end of system dictionary
  la :LATEST, dictionary_end - 5*4
  # start in interpreter mode
  mv :MODE, zero
  # start with asciz at a0
  mv :TOK_POS, a0
  li :TOK_LEN, 0
  jal interpreter_eval
  j qemu_exit
  .equ INTE_NEXT_GO_BACK, 4
interpreter_eval:
  add sp, sp, -8
  sw ra, 0(sp)
  # write address for NEXT to jump back to
  la t0, .L_inte_next_token
  sw t0, INTE_NEXT_GO_BACK(sp)
.L_inte_next_token:
  add :IP, sp, INTE_NEXT_GO_BACK # return address for next
  mv :XT, :IP
  # move the token pointer
  add :TOK_POS, :TOK_POS, :TOK_LEN
  mv a0, :TOK_POS
  jal token
  beqz a1, interpreter_done # no token
  mv :TOK_POS, a0
  mv :TOK_LEN, a1
  # check if its numner
  jal is_number
  beqz a0, .L_inte_not_a_number
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal atoi
  
  beqz :MODE, .L_inte_push_number
  # compile number
  la t0, LIT
  sw t0, 0(:HERE)
  sw a0, 4(:HERE)
  addi :HERE, :HERE, 8
  
  j .L_inte_next_token
.L_inte_push_number:
  # push number
  PUSH a0
  j .L_inte_next_token
  
.L_inte_not_a_number:
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal do_find
  beqz a0, .L_inte_word_not_found
  beqz :MODE, .L_inte_execute
  lw t0, -4(a0) # flag
  bnez t0, .L_inte_execute # immediate flag
  # compile
  lw t0, 0(a0)
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j .L_inte_next_token
.L_inte_execute:
  lw t0, 0(a0)
  jr t0
  # we wait for NEXT to take us back to L_inte_next_token
interpreter_done:
  lw ra, 0(sp)
  add sp, sp, 8
  ret
  
.L_inte_word_not_found:
  li a0, '\n'
  call putc
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  call puts_len
  la a0, err_word_not_found
  call puts
  # if we are in compile mode, we need to close the word with semi
  # or we will corrupt the dictionary
  beqz :MODE, interpreter_done
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j interpreter_done
I think by now you know enough assembly to read this, but i will explain just one important detail, previously our interpreter was properly setting :IP when we call EXECUTE and then :IP was pointing to our JUMP and when NEXT was called it would go to our jump which would get back to the start of the outer interpreter's infinite loop.
execute_word:
  .word EXECUTE
  .word JUMP
  .word next_token
Now, we do this hack, where we just put the address of memory location sp + 4 where we have the address of .L_inte_next_token and set :IP to point to sp+4 (sp + INTE_NEXT_GO_BACK):
  la t0, .L_inte_next_token
  sw t0, INTE_NEXT_GO_BACK(sp)
.L_inte_next_token:
  add :IP, sp, INTE_NEXT_GO_BACK # return address for next
  mv :XT, :IP
So now NEXT will jump to whatever the value of memory[IP] is, and in our
case this will be sp+4, and then the jr will jump to .L_inte_next_token
NEXT:
  lw   t0, 0(:IP)       # load the actual code address from [IP]
  addi :IP, :IP, 4      # move IP to next cell
  jr   t0               # jump
Another change is that I moved TOK_POS and TOK_LEN to registers instead of
memory locations, its not an issue to just store them in memory, but we had
registers free so, why not? Also previously we had label for FORTH_STACK_END,
but now we dont know where the stack ends, so I used another register for
:END_SP.
The token parsing requires slight modification, but not much, things are a bit easier now.
The biggest change is that now the way the interpreter is done, we can actually invoke it from inside itself.
  # ( addr -- )
EVAL:
  add sp, sp, -36
  sw ra, 0(sp)
  sw :TOK_POS, 4(sp)
  sw :TOK_LEN, 8(sp)  
  sw :IP, 12(sp)
  sw :XT, 16(sp)
  sw :MODE, 20(sp)
  sw :CSP, 24(sp)
  sw :RSP, 28(sp)
  sw :ESP, 32(sp)
  POP :TOK_POS
  mv :TOK_LEN, zero
  jal interpreter_eval
  lw ra, 0(sp)
  lw :TOK_POS, 4(sp)
  lw :TOK_LEN, 8(sp)
  lw :IP, 12(sp)
  lw :XT, 16(sp)
  lw :MODE, 20(sp)
  lw :CSP, 24(sp)
  lw :RSP, 28(sp)
  lw :ESP, 32(sp)
  add sp, sp, 36
  j NEXT
We add the EVAL word (short for EVALUATE), which stores the current token
position and len, and loads new position from the stack and jump to the
interpreter. So we could create a string in memory and evaluate it.
We can create a REPL program now, (Read Evaluate Print Loop), where the user can input a string and we can evaluate it.
create buffer 200 allot 
: buffer[] buffer + ;
: zero 200 0 do 0 i buffer[] c! loop ;
: print 200 0 do i buffer[] dup 0 = if exit then c@ emit loop ;
: input
  bump .
  62 emit
  32 emit
  200 0 do
    key dup 13 = if
      drop
      -1
      unloop exit
    else
      dup 27 = if
        drop 
        0 
        unloop exit
      then
      dup emit
      i buffer[] c!
    then
  loop
;
: repl
  begin
    zero
    input
    0 = if drop unloop exit then
    cr
    buffer
    eval
    cr
  again
;
repl
Since we have a scheduler now we also don't need to do a busy loop in getch,
there is a special instruction wfi, which sets mepc to pc+4 and stops
execution until the next interrupt, there is also ecall which allows you to
jump to the trap handler if you want to force reschedule (e.g. imagine your want
your task to sleep).
I also enabled other interrupts so the uart interrupt can trigger our trap
handler, when we are waiting for a character for example.
Because there are quite a lot of small change in all the files I will just show them here again so you can copy paste them and try them out, and also will explain the changes in each of them.
   # file: schedule.s
  .section .text
  .globl sche_init
  .globl sche_create
  .globl sche_yield
  .globl sche_exit
  # QEMU 'virt' CLINT base, 32-bit offsets
  .equ CLINT_BASE,  0x02000000
  .equ MTIME,       (CLINT_BASE + 0xBFF8)   # 64-bit
  .equ MTIMECMP,    (CLINT_BASE + 0x4000)   # 64-bit (hart0)
  .equ TICKS,       50000
  .equ TASK_FREE,    0       # Unused
  .equ TASK_READY,   1       # Ready to run
  .equ TASK_BLOCKED, 2       # Waiting for something
  .equ TASK_RA,     0
  .equ TASK_SP,     4
  .equ TASK_GP,     8
  .equ TASK_TP,     12
  .equ TASK_T0,     16
  .equ TASK_T1,     20
  .equ TASK_T2,     24
  .equ TASK_S0,     28
  .equ TASK_S1,     32
  .equ TASK_A0,     36
  .equ TASK_A1,     40
  .equ TASK_A2,     44
  .equ TASK_A3,     48
  .equ TASK_A4,     52
  .equ TASK_A5,     56
  .equ TASK_A6,     60
  .equ TASK_A7,     64
  .equ TASK_S2,     68
  .equ TASK_S3,     72
  .equ TASK_S4,     76
  .equ TASK_S5,     80
  .equ TASK_S6,     84
  .equ TASK_S7,     88
  .equ TASK_S8,     92
  .equ TASK_S9,     96
  .equ TASK_S10,    100
  .equ TASK_S11,    104
  .equ TASK_T3,     108
  .equ TASK_T4,     112
  .equ TASK_T5,     116
  .equ TASK_T6,     120
  .equ TASK_MEPC,   124
  .equ TASK_STATUS, 128
  .equ TASK_NEXT,   132
  .equ TASK_PREV,   136
sche_exit:
  # Get current task pointer
  la t0, current_task
  lw t1, 0(t0)           # t1 = current task TCB
  # Mark task as free
  li t2, TASK_FREE
  sw t2, TASK_STATUS(t1)
  ret
  # Remove task from circular list
  lw t2, TASK_PREV(t1)   # t2 = prev task
  lw t3, TASK_NEXT(t1)   # t3 = next task
  sw t3, TASK_NEXT(t2)   # prev->next = next
  sw t2, TASK_PREV(t3)   # next->prev = prev
  
  # Add task to free list
  la t0, free_list_head
  lw t2, 0(t0)           # t2 = old free list head
  sw t2, TASK_NEXT(t1)   # task->next = old_head
  sw t1, 0(t0)           # free_list_head = task
  
  
  # Force a context switch
  ecall                  # This will never return
  # a5, arg
  # a6, size
  # a7, entrypoint
sche_create:
  addi sp, sp, -48
  sw ra, 0(sp)
  sw s0, 4(sp)
  sw s1, 8(sp)
  sw s2, 12(sp)
  sw s3, 16(sp)
  sw s4, 20(sp)
  sw s5, 24(sp)
  sw s6, 28(sp)
  sw s7, 32(sp)
  sw s8, 36(sp)
  sw s9, 40(sp)
  sw s10, 44(sp)
  mv s0, a5    # arg
  mv s1, a6    # size
  mv s2, a7    # entry point
  # First check free list
  la s3, free_list_head
  lw s4, 0(s3)              # s4 = free_list_head
  beqz s4, .L_allocate_new  # if no free TCBs, allocate new
  # Remove TCB from free list
  lw t0, TASK_NEXT(s4)   # t0 = next free TCB
  sw t0, 0(s3)           # update free_list_head
  j .L_init_task
.L_allocate_new:
  # get current tail and head
  la s3, tasks_tail
  lw s4, 0(s3)        # s4 = tail (where we'll create new task)
  la s5, tasks_head
  lw s6, 0(s5)        # s6 = head
.L_init_task:
  # zero initialize new task's space
  mv s8, s4           # s8 = start of space to zero 
  add s9, s8, s1      # s9 = end of space
.L_zero_bss:
  beq s8, s9, .L_setup_tcb
  sw zero, 0(s8)
  addi s8, s8, 4
  j .L_zero_bss
.L_setup_tcb:
  # initialize new task's TCB
  sw s0, TASK_A0(s4)      
  sw s1, TASK_A1(s4)      
  sw s2, TASK_MEPC(s4)    
  sw s2, TASK_RA(s4)      
    
  li s7, TASK_READY
  sw s7, TASK_STATUS(s4)  
  
  # calculate stack pointer
  add s7, s4, s1         # s7 = new stack top
  sw s7, TASK_SP(s4)
  beq s4, s6, .L_setup_first_task  # if this is the first task
  # for tasks after first:
  # s4 = new task
  # s6 = head task
  # insert between last task and head
  lw s7, TASK_PREV(s6)    # s7 = last task
  
  # connect new task to previous
  sw s7, TASK_PREV(s4)    # new->prev = last
  sw s4, TASK_NEXT(s7)    # last->next = new
  
  # connect new task to head
  sw s6, TASK_NEXT(s4)    # new->next = head
  sw s4, TASK_PREV(s6)    # head->prev = new
  
  j .L_done
.L_setup_first_task:
  # first task points to itself initially
  sw s4, TASK_NEXT(s4)    # self->next = self
  sw s4, TASK_PREV(s4)    # self->prev = self
.L_done:
  # update tail pointer to next free space
  add s4, s4, s1          # move past task space
  sw s4, 0(s3)            # update tail pointer
  lw ra, 0(sp)
  lw s0, 4(sp)
  lw s1, 8(sp)
  lw s2, 12(sp)
  lw s3, 16(sp)
  lw s4, 20(sp)
  lw s5, 24(sp)
  lw s6, 28(sp)
  lw s7, 32(sp)
  lw s8, 36(sp)
  lw s9, 40(sp)
  lw s10, 44(sp)
  addi sp, sp, 48
  ret
sche_init:
  add sp, sp, -8
  sw s0, 0(sp)
  sw s1, 4(sp)
  
  la t0, _bss_end
  la t1, tasks_head
  sw t0, 0(t1)           # head = _bss_end
  la t1, tasks_tail
  sw t0, 0(t1)           # tail = _bss_end
  la t1, current_task
  sw t0, 0(t1)           # current = _bss_end
  
  la t1, current_tick
  sw zero, 0(t1)
  mv s0, a0              # save config start
  mv s1, a1              # save config end
.L_create_tasks:
  beq s0, s1, .L_init_done
  
  lw a5, 0(s0)   # arg
  lw a6, 4(s0)   # size
  lw a7, 8(s0)   # entry
  jal sche_create
  
  # next config entry
  addi s0, s0, 12
  j .L_create_tasks
.L_init_done:
  # set up interrupts
  la t0, trap_handler
  csrw mtvec, t0
  
  # enable timer and external interrupts
  li t0, (1 << 7) | (1 << 11)
  csrw mie, t0
  
  # set up first task context
  la t0, tasks_head
  lw t0, 0(t0)
  lw sp, TASK_SP(t0)
  lw a0, TASK_A0(t0)
  lw a1, TASK_A1(t0)
  lw ra, TASK_RA(t0)
  # restore s0, s1
  lw s0, 0(sp)
  lw s1, 4(sp)
  add sp, sp, 8
  
  # Set first timer
  li t1, MTIME
  lw t0, 0(t1)
  li t1, TICKS
  add t0, t0, t1
  li t1, MTIMECMP
  sw t0, 0(t1)
  csrr t1, mstatus
  li t0, (1 << 3)
  or t1, t1, t0
  csrw mstatus, t1
  jr ra
sche_yield:
  ecall
  
trap_handler:
  # we assume:
  #  * there is always a current task
  #  * it must be the current stack that was interrupted
  csrw mscratch, t0  # save t0 to mscratch to restore later
  la t0, current_task
  lw t0, 0(t0)
  
  sw ra, TASK_RA(t0)
  sw sp, TASK_SP(t0)
  sw gp, TASK_GP(t0)
  sw tp, TASK_TP(t0)
  sw t1, TASK_T1(t0)
  sw t2, TASK_T2(t0)
  sw s0, TASK_S0(t0)
  sw s1, TASK_S1(t0)
  sw a0, TASK_A0(t0)
  sw a1, TASK_A1(t0)
  sw a2, TASK_A2(t0)
  sw a3, TASK_A3(t0)
  sw a4, TASK_A4(t0)
  sw a5, TASK_A5(t0)
  sw a6, TASK_A6(t0)
  sw a7, TASK_A7(t0)
  sw s2, TASK_S2(t0)
  sw s3, TASK_S3(t0)
  sw s4, TASK_S4(t0)
  sw s5, TASK_S5(t0)
  sw s6, TASK_S6(t0)
  sw s7, TASK_S7(t0)
  sw s8, TASK_S8(t0)
  sw s9, TASK_S9(t0)
  sw s10, TASK_S10(t0)
  sw s11, TASK_S11(t0)
  sw t3, TASK_T3(t0)
  sw t4, TASK_T4(t0)
  sw t5, TASK_T5(t0)
  sw t6, TASK_T6(t0)
  csrr t1, mepc  # t1 is already saved
  sw t1, TASK_MEPC(t0)
  csrr t1,  mscratch # restore t0 as it was when we were called
  sw t1, TASK_T0(t0) 
  # decode the interrupt
  csrr t1, mcause
  li t2, 11               # ecall from M-mode
  beq t1, t2, .L_schedule 
  li t2, 0x80000007       # machine timer interrupt
  beq t1, t2, .L_timer
  j .L_schedule           # unknown interrupt
.L_timer:
  li t1, MTIME
  lw t0, 0(t1)      # read lower 32 bits
  li t1, TICKS
  add t0, t0, t1
  li t1, MTIMECMP
  sw t0, 0(t1)      # write lower 32 bits
  # increment the tick
  la t0, current_tick
  lw t1, 0(t0)
  addi t1, t1, 1
  sw t1, 0(t0)
.L_schedule:
  # Start with next task after current
  la t0, current_task
  lw t1, 0(t0)           # t1 = current task
  mv t3, t1              # t3 = original task (to detect full circle)
  lw t2, TASK_NEXT(t1)   # t2 = starting point for search
  
  # Check if next is head, skip it if so
  la t4, tasks_head
  lw t4, 0(t4)          # t4 = head task
  bne t4, t2, .L_search_ready
  lw t2, TASK_NEXT(t2)   # skip head
.L_search_ready:
  # Check if task is ready
  lw t5, TASK_STATUS(t2)
  li t6, TASK_READY
  beq t5, t6, .L_found_ready
  # Move to next task
  lw t2, TASK_NEXT(t2)
  
  # If next is head, skip it
  beq t2, t4, .L_skip_head
.L_continue_search:
  # If we're back to start, no READY tasks found
  beq t2, t3, .L_use_head
  j .L_search_ready
.L_skip_head:
  lw t2, TASK_NEXT(t2)   # Skip head, move to next
  j .L_continue_search
.L_found_ready:
  sw t2, 0(t0)          # current_task = ready_task
  j .L_context_switch
.L_use_head:
  sw t4, 0(t0)          # Use head (idle) task
  j .L_context_switch
.L_context_switch:
  la sp, current_task
  lw sp, 0(sp)
  lw t0, TASK_MEPC(sp)
  csrw mepc, t0
  lw ra, TASK_RA(sp)
  lw gp, TASK_GP(sp)
  lw tp, TASK_TP(sp)
  lw t0, TASK_T0(sp)
  lw t1, TASK_T1(sp)
  lw t2, TASK_T2(sp)
  lw s0, TASK_S0(sp)
  lw s1, TASK_S1(sp)
  lw a0, TASK_A0(sp)
  lw a1, TASK_A1(sp)
  lw a2, TASK_A2(sp)
  lw a3, TASK_A3(sp)
  lw a4, TASK_A4(sp)
  lw a5, TASK_A5(sp)
  lw a6, TASK_A6(sp)
  lw a7, TASK_A7(sp)
  lw s2, TASK_S2(sp)
  lw s3, TASK_S3(sp)
  lw s4, TASK_S4(sp)
  lw s5, TASK_S5(sp)
  lw s6, TASK_S6(sp)
  lw s7, TASK_S7(sp)
  lw s8, TASK_S8(sp)
  lw s9, TASK_S9(sp)
  lw s10, TASK_S10(sp)
  lw s11, TASK_S11(sp)
  lw t3, TASK_T3(sp)
  lw t4, TASK_T4(sp)
  lw t5, TASK_T5(sp)
  lw t6, TASK_T6(sp)
  lw sp, TASK_SP(sp)
  mret
tasks_head:
  .word 0
tasks_tail:  
  .word 0
current_task:
  .word 0
current_tick:
  .word 0
free_list_head:
  .word 0
I added helper aliases for each register e.g. TASK_A7 is 64, In the trap handler
I also use mscratch to store t0 instead of using the stack, mscratch is just
a scratch register, I am using it just to show you it exists, but you would
notice it makes our trap handler more vulnerable to double-traps.
idle_task:
  wfi
  j idle_task
process_a:
  li a0,'a'
  jal putc
  j process_a
process_b:
  li a0,'b'
  jal putc
  j process_b
  
sche_config:
  .word 0
  .word 200
  .word idle_task
  .word 0
  .word 200
  .word process_a
  .word 0
  .word 200
  .word process_b
sche_config_end:
setup:
  la a0, sche_config
  la a1, sche_config_end
  jal sche_init
You use the new scheduler like that, prepare some configuration of the tasks you want to run, each task has 3 parameters, 1) value set to a0 when we jump into it the first time, 2) memory size, 3) address to jump to. Task 0 is special, it is executed only when no other task is in READY state.
In our example we dont actually have blocking tasks, but we could implement
sleep, or maybe waiting for getch interrupt, we want to be woken up only when
the interrupt we are waiting for happens.
I have complicated the scheduler a bit, it is even possible to add more tasks at
runtime by calling sche_create.
The way it works is we have a doubly linked circular list, that we just cycle through and execute the next task.
   .----------------.
   |  .----------.  |
   |  |          |  |
   |  v          |  v
.--------.   .--------.
| TASK 0 |   | TASK 1 |
|  IDLE  |   |        |
'--------'   '--------'
   ^  |          ^  |
   |  v          |  v
.--------.   .--------.
| TASK 3 |   | TASK 2 |
|        |   |        |
'--------'   '--------'
   ^  |          ^  |
   |  |          |  |
   |  '----------'  |
   '----------------'
Each element points to the next and previous one, so TASK2's next is TASK3,
and its prev is TASK1, TASK3's next is TASK0 and TASK3's prev is TASK2, and so
on.
We just need to keep two pointers, one at the current HEAD of the list and one at the TAIL, and then we can just insert new elements at the front or at the back.
Writing this in assembly was quite annoying, but the idea is simple, if we want to add at the end. Imagine the list looks like this n1, n2, n3.
           tail
            |
n1 --> n2 -->n3
^           |
'-----------'
To add at the end
1. Create new node
n = {}  # Zero initialized node
2. Connect tail's prev node to new node
tail->prev->next = n    # n2->next = n
3. Connect new node to tail
n->next = tail         # n->next = n3
4. Update tail pointer
tail = n              # tail now points to n
Now the list looks like this:
           tail
            |
n1 --> n2 -->n ---> n3
^                    |
'--------------------'
If we want to insert at the head
head 
  |
  v
 n1 --> n2 --> n3
 ^             |
 '-------------'
1. Create new node
n = {}  # Zero initialized node
2. Connect new node to current head
n->next = head           # n->next = n1
3. Connect new node to tail
n->prev = head->prev    # n->prev = n3
4. Connect tail to new node
head->prev->next = n    # n3->next = n
5. Connect old head to new node
head->prev = n          # n1->prev = n
6. Update head pointer
head = n                # head now points to n
head
 |
 v
 n --> n1 --> n2 --> n3
 ^                    |
 '--------------------'
You can actually insert a node anywhere in the chain, lets say we want to insert it after n2
1. Connect new node to its neighbors
# connect n to n2
n->next = n2
n->prev = n2->prev
# fix up n2's prev (n1 in our case) ot 
# point to the new node n
n2->prev->next = n
n2->prev = n
It is very powerful data structure, the doubly linked list. It always amazes me, we just keep few numbers in memory, two addresses, prev and next, and we can create this magical structures.
We start by creating the tasks at the very start of our memory just after our program is loaded, at _bss_end, if you remember our linked script
OUTPUT_ARCH( "riscv" )
ENTRY( _start )
MEMORY
{
  RAM (rwx) : ORIGIN = 0x80000000, LENGTH = 128M
}
SECTIONS
{
  .text :
  {
    *(.text.init)
    *(.text)
  } > RAM
  .rodata :
  {
    *(.rodata)
  } > RAM
  .data :
  {
    *(.data)
  } > RAM
  .bss :
  {
    *(.bss)
    . = ALIGN(8);
  } > RAM
  _bss_end = .;
  _stack_top = ORIGIN(RAM) + LENGTH(RAM);
  _ram_end = ORIGIN(RAM) + LENGTH(RAM);
  _end = .;
}
_bss_end is immediately after the program's definition and its global
variables, and the stack is at the end of our RAM.
So when you request to create a task that uses 2000 bytes of memory, we will "allocate" _bss_end + 2000 for this task, and we will set its sp at the end of this block, in the beginning of the block we will store the PCB, so in the PCB we set sp = _bss_end + 2000, and then its up to the task how to use the memory, then when we add another task, lets say it neeed 5000 bytes of memory, we will add it after the previous task, so its memory will be from  bss_end + task1 size, to bess_end + task1 size + 5000, and so on.
The task has to decide how to use the memory, how much of it is for its stack and how much of it is for other purposes, in our case we will use it for interpreter's dictionary.
We dont have a way to "delete" tasks, at the moment, usually what you do is once a slot is free you put it in another linked list called "free list", and later if we want to create a new task we first look if the free has any candidates that are big enough, if not we will add it to the end after the last task. You might have heard the concept of "stack" and "heap", this is what "the heap" is, just a place where you can allocate and free memory, it is implemented in various ways, with free lists, arenas, slabs and so on, with various tradeoffs, how to handle fragmentation (e.g. you want 20 bytes then 60 bytes then 3 bytes then 120 bytes, and you free the 3 bytes, then you have a hole that will be used only if the program requests 3 or less bytes), and how to make sure that things are fast to allocate, you cant be scanning the free list all the time, imagine you have 100000 entries there, how do you quickly find the one that will match the required size?
I will not add delete, for simplicity, but search the internet for slab
allocation and memory allocation. In C the standard library provides malloc()
and free(), you call a = malloc(10) and it will reserve 10 bytes somewhere
in memory for you, once you dont need it anymore you have to free it with
free(a). Actually malloc will reserve a few bytes more where it will store
how many bytes you actually wanted and some other information that it needs
during free, and it will give you the address just after that. When you
free(a) it will the memory before it to understand how big it was.
BTW, when you kill a process, the OS has to delete ist PCB from the active
list, but also has to release resources that the process has aquired, for
example open files or network connections, but kill 42234 will find the PCB of
process 42234 and start the process of deletion, you use fork to start a new process.
This is an excerpt from man fork
FORK(2)                   System Calls Manual                  FORK(2)
NAME
     fork – create a new process
SYNOPSIS
     #include <unistd.h>
     pid_t
     fork(void);
DESCRIPTION
     fork() causes creation of a new process.  The new process (child
     process) is an exact copy of the calling process (parent process)
     except for the following:
           •   The child process has a unique process ID.
           •   The child process has a different parent process ID
               (i.e., the process ID of the parent process).
           •   The child process has its own copy of the parent's
               descriptors.  These descriptors reference the same
               underlying objects, so that, for instance, file
               pointers in file objects are shared between the child
               and the parent, so that an lseek(2) on a descriptor in
               the child process can affect a subsequent read or write
               by the parent.  This descriptor copying is also used by
               the shell to establish standard input and output for
               newly created processes as well as to set up pipes.
           •   The child processes resource utilizations are set to 0;
               see setrlimit(2).
RETURN VALUES
     Upon successful completion, fork() returns a value of 0 to the
     child process and returns the process ID of the child process to
     the parent process.  Otherwise, a value of -1 is returned to the
     parent process, no child process is created, and the global
     variable errno is set to indicate the error.
So you start a new process and from inside you know if you are the "parent" or
the "child". A lot more goes on in a real operating system and the way it
manages memory and how it gives it to the running processees. Search for
virtual memory, computer paging, translation lookaside buffer and copy on write.
The files linker.ld, jit.s, boot.s and string.s remain the same.
The only change in the Makefile is to set the new registers "mnemonics" we use, s8 for :TOK_POS and etc..
# Compile .s files to object files
$(OBJ_DIR)/%.o: $(SRC_DIR)/%.s
        @sed -e 's/:IP/s0/g' \
       -e 's/:SP/s1/g' \
       -e 's/:RSP/s2/g' \
       -e 's/:CSP/s3/g' \
       -e 's/:HERE/s4/g' \
       -e 's/:XT/s5/g' \
       -e 's/:LATEST/s6/g' \
       -e 's/:MODE/s7/g' \
       -e 's/:TOK_POS/s8/g' \
       -e 's/:ESP/s9/g' \
       -e 's/:TOK_LEN/s10/g' \
       -e 's/:END_SP/s11/g' $< > $@.pre.s
        $(AS) $(ASFLAGS) $@.pre.s -o $@
I moved the jit code in jit.s
  # file: jit.s
  .section .text
  .globl do_addi
  .globl do_add
  .globl do_sw
  .globl do_jr
  .globl do_li
# addi ( a0: rd, a1: rs1, a2: imm -- a0: opcode_addi )
do_addi:
  # ADDI instruction format:
  # bits [31:20] = immediate
  # bits [19:15] = rs1        (source register)
  # bits [14:12] = 0x0       (funct3)
  # bits [11:7]  = rd        (destination register)
  # bits [6:0]   = 0x13      (opcode)
  li t0, 0x13            # ADDI opcode
  slli t1, a0, 7         # Shift rd to position [11:7]
  or t0, t0, t1
  slli t1, a1, 15        # Shift rs1 to position [19:15]
  or t0, t0, t1
  li t1, 0xfff
  and t2, a2, t1         # Mask to 12 bits
  slli t2, t2, 20        # Shift immediate to position [31:20]
  or t0, t0, t2
  mv a0, t0
  ret
# add ( a0: rd, a1: rs1, a2: rs2 -- a0: opcode_add )
do_add:
  # ADD instruction format:
  # bits [31:25] = 0x00     (funct7)
  # bits [24:20] = rs2      (second source register)
  # bits [19:15] = rs1      (first source register)
  # bits [14:12] = 0x0      (funct3)
  # bits [11:7]  = rd       (destination register)
  # bits [6:0]   = 0x33     (opcode for R-type)
  li t0, 0x33            # R-type opcode
  slli t1, a0, 7         # Shift rd to position [11:7]
  or t0, t0, t1
  slli t1, a1, 15        # Shift rs1 to position [19:15]
  or t0, t0, t1
  slli t1, a2, 20        # Shift rs2 to position [24:20]
  or t0, t0, t1
  mv a0, t0
  ret
# sw ( a0: rs1, a1: rs2 source -- a0: opcode_sw )
do_sw:
  # bits [31:25] = 0         (imm[11:5] = 0)
  # bits [24:20] = rs2       (source register to store)
  # bits [19:15] = rs1       (base address register)
  # bits [14:12] = 0x2       (funct3 for SW)
  # bits [11:7]  = 0         (imm[4:0] = 0)
  # bits [6:0]   = 0x23      (opcode for store)
  
  li a4, 0x23   # opcode
  li t0, 0x2000 # 2 << 12
  or a4, a4, t0 
  slli t0, a0, 15
  or a4, a4, t0
  slli t0, a1, 20
  or a4, a4, t0
  mv a0, a4
  ret
  # JR ( a0: reg -- a0: opcode_jr )
do_jr:
  mv t0, a0
  # bits [31:20] = 0 for imm=0
  # bits [19:15] = reg
  # bits [14:12] = 0 (funct3=0)
  # bits [11:7]  = x0 => 0
  # bits [6:0]   = 0x67 (opcode for JALR)
  #
  # So the entire instruction is:
  #   (reg << 15) | 0x67
  slli t1, t0, 15      # reg << 15
  li   t2, 0x67        # opcode JALR
  or   t1, t1, t2      # final 32-bit instruction
  mv a0, t1
  ret
  # Li ( a0: reg, a1: imm  -- a0: opcode_lui a1: opcode_addi )
do_li:
  # Extract upper immediate
  # compensating for sign extension if needed
  srli t0, a1, 12        # First get upper 20 bits
  li t3, 0x800
  and t1, a1, t3         # Check bit 11
  beqz t1, no_adjust
  addi t0, t0, 1         # Adjust for sign extension
no_adjust:
  # LUI
  #
  # bits [31:12] = immediate
  # bits [11:7]  = rd
  # bits [6:0]   = 0x37 (opcode)
  #
  li a2, 0x37            # LUI opcode
  slli t2, t0, 12        # upper immediate
  or a2, a2, t2
  slli t2, a0, 7         # rd
  or a2, a2, t2
  # ADDI
  #
  # bits [31:20] = immediate
  # bits [19:15] = rs1
  # bits [14:12] = 0 (funct3)
  # bits [11:7]  = rd
  # bits [6:0]   = 0x13 (opcode)
  #
  li a3, 0x13            # ADDI opcode
  li t1, 0xfff
  and t0, a1, t1         # lower 12 bits
  slli t2, t0, 20        # immediate
  or a3, a3, t2
  slli t2, a0, 15        # rs1
  or a3, a3, t2
  slli t2, a0, 7         # rd
  or a3, a3, t2
  mv a0, a2
  mv a1, a3
  ret
  .end
I changed qemu.s to use wfi instead of busy looping
  # file: qemu.s
  .section .text
  .globl spi_init
  .globl spi_write_byte
  .globl spi_read_byte
  .globl putc
  .globl getch
  .globl qemu_exit
putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret
getch:
  li t0, 0x10000000
1:  
  lbu t1, 5(t0)
  andi t1, t1, 0x01
  bnez t1, 2f
  wfi
  j 1b
2:  
  lbu a0, 0(t0)
  ret                    
qemu_exit:  
  li t0, 0x100000
  li t1, 0x5555
  sw t1, 0(t0)
  j .
  .end
Now getch will just wait for interrupt, ideally we will actually "yield" the process, we will set the current task to "BLOCKED" and flag it that it is waiting for UART interrupt, then we will switch to another task, when the interrupt comes we will "wakeup" the task by setting it to READY. The way its implemented now we will just block until the timer interrupt triggers and then we will context switch.
The way you cycle through task is quite important, and is a topic of active
research since timesharing was invented. Search for fair scheduling.
   # file: forth.s
   
  .section .text
  .globl forth
  .globl NEXT
  .macro PUSH reg
  addi :SP, :SP, -4
  sw \reg, 0(:SP)
  .endm
  .macro POP reg
  lw \reg, 0(:SP)
  addi :SP, :SP, 4
  .endm
  .macro RPUSH reg
  addi :RSP, :RSP, -4
  sw \reg, 0(:RSP)
  .endm
  .macro RPOP reg
  lw \reg, 0(:RSP)
  addi :RSP, :RSP, 4
  .endm
  .macro EPUSH reg
  addi :ESP, :ESP, -4
  sw \reg, 0(:ESP)
  .endm
  .macro EPOP reg
  lw \reg, 0(:ESP)
  addi :ESP, :ESP, 4
  .endm
  .macro CFPUSH reg
  addi :CSP, :CSP, -4
  sw \reg, 0(:CSP)
  .endm
  .macro CFPOP reg
  lw \reg, 0(:CSP)
  addi :CSP, :CSP, 4
  .endm
.equ INTERPRETER_SIZE, 20000
idle_task:
  wfi
  j idle_task
  
sche_config:
  .word 0
  .word 200
  .word idle_task
  .word human_program
  .word INTERPRETER_SIZE
  .word interpreter_setup
  .word human_program_small
  .word INTERPRETER_SIZE
  .word interpreter_setup
sche_config_end:  
forth:
  la a0, sche_config
  la a1, sche_config_end
  jal sche_init
  # never reached, sche_init jumps to the first task
  call qemu_exit
interpreter_setup:
  # .---------
  # | Task Control Block
  # | ra
  # | sp
  # | ..
  # | ____________ <- :HERE
  # |                  |
  # |                  v
  # |
  # | 
  # |                  ^
  # |                  |
  # |                  exit stack
  # | ____________ <- :ESP (:RSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  return stack
  # | ____________ <- :RSP (:CSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  control stack
  # | ____________ <- :CSP (:SP - 1024)
  # |
  # |                  ^
  # |                  |
  # |                  data stack
  # | ____________ <- :SP ( sp - 512)
  # |
  # |                  ^
  # |                  |
  # |                  system stack (global variables)
  # | NEXT_GO_BACK <-  sp
  # '---------------
  addi :SP, sp, -512
  addi :CSP, :SP, -1024
  addi :RSP, :CSP, -512
  addi :ESP, :RSP, -512
  mv :END_SP, :SP
  sub :HERE, sp, a1    
  add :HERE, :HERE, 160 # :HERE = sp - a1 + TCB
  # reset latest at the end of system dictionary
  la :LATEST, dictionary_end - 5*4
  # start in interpreter mode
  mv :MODE, zero
  # start with asciz at a0
  mv :TOK_POS, a0
  li :TOK_LEN, 0
  jal interpreter_eval
  j qemu_exit
  .equ INTE_NEXT_GO_BACK, 4
interpreter_eval:
  add sp, sp, -8
  sw ra, 0(sp)
  # write address for NEXT to jump back to
  la t0, .L_inte_next_token
  sw t0, INTE_NEXT_GO_BACK(sp)
.L_inte_next_token:
  add :IP, sp, INTE_NEXT_GO_BACK # return address for next
  mv :XT, :IP
  # move the token pointer
  add :TOK_POS, :TOK_POS, :TOK_LEN
  mv a0, :TOK_POS
  jal token
  beqz a1, interpreter_done # no token
  mv :TOK_POS, a0
  mv :TOK_LEN, a1
  # check if its numner
  jal is_number
  beqz a0, .L_inte_not_a_number
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal atoi
  
  beqz :MODE, .L_inte_push_number
  # compile number
  la t0, LIT
  sw t0, 0(:HERE)
  sw a0, 4(:HERE)
  addi :HERE, :HERE, 8
  
  j .L_inte_next_token
.L_inte_push_number:
  # push number
  PUSH a0
  j .L_inte_next_token
  
.L_inte_not_a_number:
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal do_find
  beqz a0, .L_inte_word_not_found
  beqz :MODE, .L_inte_execute
  lw t0, -4(a0) # flag
  bnez t0, .L_inte_execute # immediate flag
  # compile
  lw t0, 0(a0)
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j .L_inte_next_token
.L_inte_execute:
  lw t0, 0(a0)
  jr t0
  # we wait for NEXT to take us back to L_inte_next_token
interpreter_done:
  lw ra, 0(sp)
  add sp, sp, 8
  ret
  
.L_inte_word_not_found:
  li a0, '\n'
  call putc
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  call puts_len
  la a0, err_word_not_found
  call puts
  # if we are in compile mode, we need to close the word with semi
  # or we will corrupt the dictionary
  beqz :MODE, interpreter_done
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j interpreter_done
NEXT:
  lw   t0, 0(:IP)       # load the actual code address from [IP]
  addi :IP, :IP, 4      # move IP to next cell
  jr   t0               # jump
  # ( a b -- c )
PLUS:
  POP t0
  POP t1
  add t0, t0, t1
  PUSH t0
  j NEXT
  # ( a b -- c )
MINUS:
  POP t0
  POP t1
  sub t0, t1, t0
  PUSH t0
  j NEXT
  # ( a b -- c )
MUL:
  POP t0
  POP t1
  mul t0, t0, t1
  PUSH t0
  j NEXT
  # ( -- n )
LIT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  lw t1, 0(:IP)
  PUSH t0
  j NEXT
  # ( n -- )
EMIT:
  POP a0
  jal print_int
  j NEXT
  # ( n -- )
EMIT_CHAR:
  POP a0
  jal putc
  j NEXT
  # ( value addr -- )
BANG:
  POP t0 # address
  POP t1 # value
  sw t1, 0(t0)
  j NEXT
  # ( -- )
BYE:
  j interpreter_done
  # ( -- )
CR:
  li a0, '\n'
  jal putc
  j NEXT
  # ( len addr -- n )
ATOI:
  POP a0 # address
  POP a1 # length
  jal atoi
  PUSH a0
  j NEXT
  # ( len addr -- f )
IS_NUMBER:
  POP a0 # address
  POP a1 # length
  jal is_number
  PUSH a0
  j NEXT
  # ( a -- a a )
DUP:
  POP t0
  PUSH t0
  PUSH t0
  j NEXT
  # ( a b -- b a )
SWAP:
  POP t0 # b
  POP t1 # a
  PUSH t0
  PUSH t1
  j NEXT
  # ( a -- )
DROP:
  POP zero
  j NEXT
  # ( a b -- )
TWODROP:
  POP zero
  POP zero
  j NEXT
  # ( a b -- a b a b )
TWODUP:
  POP t0 # b
  POP t1 # a
  PUSH t1 # a
  PUSH t0 # b
  PUSH t1 # a
  PUSH t0 # b
  j NEXT
  # ( n1 n2 -- n1 n2 n1 )
OVER:
  POP t0 # n2
  POP t1 # n1
  PUSH t1 # n1
  PUSH t0 # n2
  PUSH t1 # n1
  j NEXT
  # (x1 x2 x3 x4 -- x3 x4 x1 x2)
TWOSWAP:
  POP t0 # x4
  POP t1 # x3
  POP t2 # x2
  POP t3 # x1
  PUSH t1
  PUSH t0
  PUSH t3
  PUSH t2
  j NEXT
  # (x1 x2 x3 -- x2 x3 x1 )
ROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t1 # x2
  PUSH t0 # x3
  PUSH t2 # x1
  j NEXT
  # (x1 x2 x3 -- x3 x1 x2)
NROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t0 # x3
  PUSH t2 # x1
  PUSH t1 # x2
  j NEXT
  # ( a b -- f)
EQUAL:
  POP t0
  POP t1
  beq t0, t1, .L_equal
  li t0, 0
  PUSH t0
  j NEXT
.L_equal:
  li t0, -1
  PUSH t0
  j NEXT
  # ( len1 addr1 len2 addr2 -- flag)
MEMCMP:
  POP a2
  POP a3
  POP a0
  POP a1
  call memcmp
  PUSH a0
  j NEXT
  # ( f -- )
BRANCH_ON_ZERO:
  POP t0
  beqz t0, .L_do_branch
  addi :IP, :IP, 4
  j NEXT
.L_do_branch:
  lw :IP, 0(:IP)
  j NEXT
  # ( -- )
JUMP:
  lw :IP, 0(:IP)
  j NEXT
  
  # just a debug function to print the whole stack
  # print debugging.. some people hate it some people love it
  # i both hate it and love it
DEBUG_STACK:
  addi sp, sp, -12
  sw ra, 0(sp)
  sw s8, 4(sp)
  li a0, '<'
  call putc
  li a0, '>'
  call putc
  li a0, ' '
  call putc
  mv s8, :END_SP
  add s8, s8, -4
.L_debug_stack_loop:
  blt s8, :SP, .L_debug_stack_loop_end
  lw a0, 0(s8)
  call print_unsigned_hex
  li a0, ' '
  call putc
  addi s8, s8, -4
  j .L_debug_stack_loop
.L_debug_stack_loop_end:
  li a0, '\n'
  call putc
  lw ra, 0(sp)
  lw s8, 4(sp)
  addi sp, sp, 12
  j NEXT
do_parse_token:
  addi sp, sp, -4
  sw ra,  0(sp)
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal token # parse the token
  mv :TOK_POS, a0
  mv :TOK_LEN, a1
  lw ra,  0(sp)
  addi sp, sp, 4
  # return a0 a1 from token
  ret
  # Input:
  #   a0: token address
  #   a1: token length
  # Output:
  #   a0: execution token address (or 0 if not found)
do_find:
  li t1, 0
  mv t3, a1
  # The shananigans here are so we can build little endian version of the token
  # in 4 bytes dont be intimidated by them, I just made the tokens in the
  # dictionary as "bye\0" instead of "\0eyb" to be easier to read
  beqz t3, .L_not_found # zero length token
  lbu t1, 0(a0)
  addi t3, t3, -1
  beqz t3, .L_find_start
  lbu t2, 1(a0)
  sll t2,t2, 8
  or t1, t1, t2
  addi t3, t3, -1
  beqz t3, .L_find_start
  lbu t2, 2(a0)
  sll t2, t2, 16
  or t1, t1, t2
  addi t3, t3, -1
  beqz t3, .L_find_start
  lbu t2, 3(a0)
  sll t2, t2, 24
  or t1, t1, t2
  # t1: has the input token as 4 byte number
  # a1: is the length of the input token
  # t0: pointer to the entry, we will start at the end
.L_find_start:
  mv t0, :LATEST
.L_find_loop:
  beqz t0, .L_not_found     # if the entry is 0, means we didnt find a match
  lw t2, 4(t0)              # load the length of the entry
  bne t2, a1, .L_next_entry # compare lengths
  lw t2, 8(t0)              # load entry name
  bne t2, t1, .L_next_entry # compare names
  add a0, t0, 16            # return the code address
  ret
.L_next_entry:
  lw t0, 0(t0)              # follow link to next entry
  j .L_find_loop
.L_not_found:
  li a0, 0                  # return 0 for not found
  ret
DOCOL:
  EPUSH :IP
  mv :IP, :XT
  j NEXT
EXIT:
  EPOP :IP
  j NEXT
COLON:
  li :MODE, -1 # enter compile mode
  jal do_create
  # we want to achieve this, creating a new word
  #
  # : square dup * ;
  #
  # ...
  #               DOCOL:
  # 80000534:      RPUSH :IP    <-----------------.
  # 80000538:                                     |
  # 8000053c:      mv :IP, :XT                    |
  # 80000540:      j NEXT                         |
  # ...                                           |
  # 80000148 <DUP>:                               |
  # 80000148:      lw t0, 0(:SP)                  |
  # 8000014c:      PUSH t0                        |
  # ...                                           |
  # 80000880: w_square:                           |
  # 80000880:      80000..# link                  |
  # 80000884:      6      # size                  |
  # 80000888:      "squa" # token                 |
  # 8000088c:      0      # flags                 |
  # 80000890:      80000894 # CODE FIELD >--------|---.
  # 80000894:      lui :XT, 0x80001  >---.        | <-'
  # 80000898:      addi :XT, :XT, 0x8a8 >--.      |
  # 8000089c:      lui t0, 0x80000 >---.   |      |
  # 800008a0:      addi t0, t0, 0x534 >----|------'
  # 800008a4:      jr t0                   |
  # 800008a8:      80000148 # DUP <--------'
  # 800008ac:      80000... # MUL
  # 800008b0:      80000... # EXIT
  # ...
  # 1. EXECUTION CODE FIELD point to HERE + 4, where we will
  # put the machine code: memory[HERE] = HERE+4
  add t0, :HERE, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  # 2. Generate absolute address for where we want DOCOL to jump, in our case we want HERE+20
  mv t0, :HERE
  addi t0, t0, 20
  # 3. Generate the machine code
  # li :XT, value of :HERE + 20
  # la t0, DOCOL
  # jr t0
  # and expanded
  # lui :XT, value << 12
  # addi :XT, :XT, value << 20 >> 20
  # lui t0, value << 12
  # addi t0, t0, value << 20 >> 20
  # jr t0
  # 3.1 Generate machine code for XT = HERE + 20 at time of compilation
  li a0, 21 # XT is s5, which is register x21
  mv a1, t0
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4
  # 3.1 Generate machine code for la t0, DOCOL
  li a0, 5 # t0 is x5
  la a1, DOCOL
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4
  # 3.2 Generate machine code for jr t0
  li a0, 5 # t0 is x5
  jal do_jr
  sw a0, 0(:HERE) # jr
  addi :HERE, :HERE, 4
  j NEXT
CREATE:
  jal do_create
  # point the execution token to the machine code
  addi t0, :HERE, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  # create foo
  #
  # 80000880: w_foo:                            
  # 80000880:       80000..# link               
  # 80000884:       3      # size               
  # 80000888:       "foo\0"# token              
  # 8000088c:       0      # flags              
  # 80000890:       80000894 # CODE FIELD >--------.
  # 80000894:       lui t0, HIGH(HERE+24)  <-------'   >-.
  # 80000898:       addi t0, t0, LOW(HERE+24) >-----------.     
  # 8000089c:       addi SP, SP, -4                       |
  # 800008a0:       sw t0, 0(SP)                          |
  # 800008a4:       lui t0, HIGH(NEXT)                    |
  # 800008a8:       addi t0, t0, LOW(NEXT)                |       
  # 800008ac:       jr t0                                 |
  # 800008b0:       <data field...>    <------------------'
  # li t0, :HERE
  # addi :SP, :SP, -4
  # sw t0, 0(SP)
  # la t0, NEXT
  # jr t0
  addi t1, :HERE, 28 # HERE + 28
  # li t0, value of :HERE + 28
  li a0, 5  # t0 is x5
  mv a1, t1 # HERE + 28
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4
  
  # addi :SP, :SP, -4
  li a0, 9  # :SP is s1, x9
  li a1, 9  # :SP is s1, x9
  li a2, -4 
  call do_addi
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4
  # sw t0, 0(:SP)
  li a0, 9  # :SP is s1, x9
  li a1, 5  # t0 is x5
  call do_sw
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4
  # la t0, NEXT
  li a0, 5 # t0 is x5
  la a1, NEXT
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4
  # jr t0
  li a0, 5 # t0 is x5
  jal do_jr
  sw a0, 0(:HERE) # jr
  addi :HERE, :HERE, 4
  j NEXT
  # ( -- )
SEMICOLON:
  mv :MODE, zero # exit compile mode
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT
  # ( x -- )
COMMA:
  POP t0
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT
  # ( -- flag )
MODE:
  PUSH :MODE
  j NEXT
do_create:
  addi sp, sp, -4
  sw ra, 0(sp)
  add :TOK_POS, :TOK_POS, :TOK_LEN
  jal do_parse_token
  beqz a1, .L_create_error
  # align to closest multiple of 4
  addi t0, :HERE, 3    
  li t1, -4            
  and :HERE, t0, t1
  # link field (4 bytes)
  sw :LATEST, 0(:HERE)
  # length field (4 bytes)
  sw a1, 4(:HERE)
  # token field (4 bytes)
  li t1, 0
  mv t3, a1  # Initialize t3 with token length
.L_create_build_token:
  lbu t1, 0(a0)
  addi t3, t3, -1
  beqz t3, .L_create_write_token
  lbu t2, 1(a0)
  sll t2, t2, 8
  or t1, t1, t2
  addi t3, t3, -1
  beqz t3, .L_create_write_token
  lbu t2, 2(a0)
  sll t2, t2, 16
  or t1, t1, t2
  addi t3, t3, -1
  beqz t3, .L_create_write_token
  lbu t2, 3(a0)
  sll t2, t2, 24
  or t1, t1, t2
.L_create_write_token:
  sw t1, 8(:HERE)
  # flags field
  sw zero, 12(:HERE)
  # move the dictionary end
  mv :LATEST, :HERE
  # update HERE to point to the end of the word
  addi :HERE, :HERE, 16
  lw ra, 0(sp)
  addi sp, sp, 4
  ret
.L_create_error:
  la a0, err_create_error
  j panic
panic:
  jal puts
  jal getch
  j interpreter_done
  # ( -- )
IMMEDIATE:
  li t1, 1
  sw t1, 12(:LATEST) # flag value
  j NEXT
  # ( addr -- value )
AT:
  POP t0
  lw t0, 0(t0)
  PUSH t0
  j NEXT
  # ( addr -- value )
C_AT:
  POP t0
  lbu t0, 0(t0)
  PUSH t0
  j NEXT
  # ( value addr -- )
C_BANG:
  POP t0 # address
  POP t1 # value
  sb t1, 0(t0)
  j NEXT
  # ( -- c )
KEY:
  jal getch
  PUSH a0
  j NEXT
  # ( -- addr )
PUSH_HERE:
  PUSH :HERE
  j NEXT
  # ( reg imm -- lui addi )
LI:
  POP a1   # imm
  POP a0   # reg
  call do_li
  PUSH a0 # lui
  PUSH a1 # addi
  j NEXT
  # JR ( reg -- opcode_jr )
JR:
  POP a0
  call do_jr
  PUSH a0
  j NEXT
  # ( x -- ) (R: -- x)
TO_R:
  POP t0
  RPUSH t0
  j NEXT
  # ( -- x ) (R: x -- )
FROM_R:
  RPOP t0
  PUSH t0
  j NEXT
  # ( -- x ) (R: x -- x)
R_FETCH:
  lw t0, 0(:RSP)
  PUSH t0
  j NEXT
  # ( x -- ) (CF: -- x)
TO_CF:
  POP t0
  CFPUSH t0
  j NEXT
  # ( -- x ) (CF: x -- )
FROM_CF:
  CFPOP t0
  PUSH t0
  j NEXT
  # ( -- x ) (CF: x -- x)
CF_FETCH:
  lw t0, 0(:CSP)
  PUSH t0
  j NEXT
POSTPONE:
  add :TOK_POS, :TOK_POS, :TOK_LEN
  jal do_parse_token 
  jal do_find        
  beqz a0, .L_word_not_found  
  la t1, LIT
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4
  lw a0, 0(a0) # dereference
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4
  
  la t1, COMMA 
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4
  
  j NEXT
.L_word_not_found:
  la a0, err_word_not_found
  j panic
  # ( n -- )
LITERAL:
  POP t0
  la t1, LIT
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT
  # stop compilation
  # ( -- )
LEFT_BRACKET:
  mv :MODE, zero
  j NEXT
  # start compilation
  # ( -- )
RIGHT_BRACKET:
  li t1, 1
  mv :MODE, t1
  j NEXT
  # ( a b -- f)
GREATHER:
  POP t0
  POP t1
  bgt t0, t1, .L_greather
  li t0, 0
  PUSH t0
  j NEXT
.L_greather:
  li t0, -1
  PUSH t0
  j NEXT
  # ( len addr -- )
DUMP:
  POP s9  # addr
  POP s10 # len
1:  
  beqz s10, NEXT
  mv a0, s9
  jal print_unsigned_hex
  li a0, ':'
  jal putc
  li a0, ' '
  jal putc
  lw a0, 0(s9)
  jal print_unsigned_hex
  li a0, ' '
  jal putc
  lb a0, 0(s9)
  jal putc_ascii
  lb a0, 1(s9)
  jal putc_ascii
  lb a0, 2(s9)
  jal putc_ascii
  lb a0, 3(s9)
  jal putc_ascii
  li a0, '\n'
  jal putc
  addi s9, s9, 4
  addi s10, s10, -1
  j 1b
putc_ascii:
  addi sp, sp, -4
  sw ra, 0(sp)
  li t0, 32
  blt a0, t0, .L_putc_dot
  li t0, 127
  bge a0, t0, .L_putc_dot
  j .L_putc_char
  
.L_putc_dot:  
  li a0, '.'
.L_putc_char:
  call putc
  lw ra, 0(sp)
  addi sp, sp, 4
  ret
  # ( limit index -- R: limit index)
DO_DO:
  POP t0            # t0 = index
  POP t1            # t1 = limit
  RPUSH t1          # limit
  RPUSH t0          # index
  j NEXT
  # ( R: limit index -- R: limit index ) ; (L: -- ) when done
DO_LOOP:
  RPOP t0    # pop index
  RPOP t1    # pop limit
  addi t0, t0, 1
  blt t0, t1, .L_do_loop_jump # if limit < index
  # skip over the jump address
  addi :IP, :IP, 4
  j NEXT
.L_do_loop_jump:
  # push them back on Rstack if still looping
  RPUSH t1   # push limit
  RPUSH t0   # push index
  # read the jump address from IP (the next cell in the thread)
  lw :IP, 0(:IP)
  j NEXT
  # ( -- ) (R: limit index -- ) (L: addr -- )
UNLOOP:
  RPOP zero   # index
  RPOP zero   # limit
  j NEXT
LOOP_I:
  lw t0, 0(:RSP)
  PUSH t0
  j NEXT
LOOP_J:
  lw t0, 8(:RSP)
  PUSH t0
  j NEXT
LOOP_K:
  lw t0, 16(:RSP)
  PUSH t0
  j NEXT
ALLOT:
  POP t0
  mv a0, t0
  add :HERE, :HERE, t0
  j NEXT
# ( x1 x2 -- flag )
AND:
  POP t0
  POP t1
  # Check if either value is zero
  beqz t0, .L_false
  beqz t1, .L_false
  # Both non-zero, return TRUE (-1)
  li t0, -1
  PUSH t0
  j NEXT
.L_false:
  # At least one zero, return FALSE (0)
  mv t0, zero
  PUSH t0
  j NEXT
WFI:
  wfi
  j NEXT
  # ( addr -- )
EVAL:
  add sp, sp, -36
  sw ra, 0(sp)
  sw :TOK_POS, 4(sp)
  sw :TOK_LEN, 8(sp)  
  sw :IP, 12(sp)
  sw :XT, 16(sp)
  sw :MODE, 20(sp)
  sw :CSP, 24(sp)
  sw :RSP, 28(sp)
  sw :ESP, 32(sp)
  POP :TOK_POS
  mv :TOK_LEN, zero
  jal interpreter_eval
  lw ra, 0(sp)
  lw :TOK_POS, 4(sp)
  lw :TOK_LEN, 8(sp)
  lw :IP, 12(sp)
  lw :XT, 16(sp)
  lw :MODE, 20(sp)
  lw :CSP, 24(sp)
  lw :RSP, 28(sp)
  lw :ESP, 32(sp)
  add sp, sp, 36
  j NEXT
BUMP:
  la t0, bump
  lw t1, 0(t0)
  addi t1, t1, 1
  PUSH t1
  sw t1, 0(t0)
  j NEXT
dictionary:
word_bye:
  .word 0         # link
  .word 3         # token length
  .ascii "bye\0"  # first 4 characters of token
  .word 0         # flags
  .word BYE       # address of execution token
word_plus:
  .word word_bye
  .word 1
  .ascii "+\0\0\0"
  .word 0
  .word PLUS
word_minus:
  .word word_plus
  .word 1
  .ascii "-\0\0\0"
  .word 0
  .word MINUS
word_mul:
  .word word_minus
  .word 1
  .ascii "*\0\0\0"
  .word 0
  .word MUL
word_bang:
  .word word_mul
  .word 1
  .ascii "!\0\0\0"
  .word 0
  .word BANG
word_at:
  .word word_bang
  .word 1
  .ascii "@\0\0\0"
  .word 0
  .word AT
word_dup:
  .word word_at
  .word 3
  .ascii "dup\0"
  .word 0
  .word DUP
word_emit:
  .word word_dup
  .word 1
  .ascii ".\0\0\0"
  .word 0
  .word EMIT
word_cr:
  .word word_emit
  .word 2
  .ascii "cr\0\0"
  .word 0
  .word CR
word_debug_stack:
  .word word_cr
  .word 2
  .ascii ".s\0\0"
  .word 0
  .word DEBUG_STACK
word_debug_stack_compile:
  .word word_debug_stack
  .word 4
  .ascii "[.s]"
  .word 1
  .word DEBUG_STACK
word_colon:
  .word word_debug_stack_compile
  .word 1
  .ascii ":\0\0\0"
  .word 0
  .word COLON
word_semicolon:
  .word word_colon
  .word 1
  .ascii ";\0\0\0"
  .word 1 # immediate
  .word SEMICOLON
word_li:
  .word word_semicolon
  .word 2
  .ascii "li\0\0"
  .word 0
  .word LI
word_jr:
  .word word_li
  .word 2
  .ascii "jr\0\0"
  .word 0
  .word JR
word_key:
  .word word_jr
  .word 3
  .ascii "key\0"
  .word 0
  .word KEY
word_here:
  .word word_key
  .word 4
  .ascii "here"
  .word 0
  .word PUSH_HERE
word_comma:
  .word word_here
  .word 1
  .ascii ",\0\0\0"
  .word 0
  .word COMMA
word_create:
  .word word_comma
  .word 6
  .ascii "crea"
  .word 0
  .word CREATE
word_branch0:
  .word word_create
  .word 7
  .ascii "0bra"
  .word 0
  .word BRANCH_ON_ZERO
word_jump:
  .word word_branch0
  .word 4
  .ascii "jump"
  .word 0
  .word JUMP
word_literal:
  .word word_jump
  .word 7
  .ascii "lite"
  .word 1 # immediate
  .word LITERAL
word_to_r:
  .word word_literal
  .word 2
  .ascii ">r\0\0"
  .word 0
  .word TO_R
word_from_r:
  .word word_to_r
  .word 2
  .ascii "r>\0\0"
  .word 0
  .word FROM_R
word_r_fetch:
  .word word_from_r
  .word 2
  .ascii "r@\0\0"
  .word 0
  .word R_FETCH
word_to_cf:
  .word word_r_fetch
  .word 3
  .ascii ">cf\0"
  .word 0
  .word TO_CF
word_from_cf:
  .word word_to_cf
  .word 3
  .ascii "cf>\0"
  .word 0
  .word FROM_CF
word_cf_fetch:
  .word word_from_cf
  .word 3
  .ascii "cf@\0"
  .word 0
  .word CF_FETCH
word_immediate:
  .word word_cf_fetch
  .word 9
  .ascii "imme"
  .word 0
  .word IMMEDIATE
word_postpone:
  .word word_immediate
  .word 8
  .ascii "post"
  .word 1 # immediate
  .word POSTPONE
word_drop:
  .word word_postpone
  .word 4
  .ascii "drop"
  .word 0
  .word DROP
word_dump:
  .word word_drop
  .word 4       
  .ascii "dump" 
  .word 0       
  .word DUMP    
word_left_bracket:
  .word word_dump
  .word 1       
  .ascii "[\0\0\0" 
  .word 1       
  .word LEFT_BRACKET
word_right_bracket:
  .word word_left_bracket
  .word 1       
  .ascii "]\0\0\0" 
  .word 1       
  .word RIGHT_BRACKET
word_swap:
  .word word_right_bracket
  .word 4       
  .ascii "swap" 
  .word 0       
  .word SWAP
word_equal:
  .word word_swap
  .word 1       
  .ascii "=\0\0\0" 
  .word 0       
  .word EQUAL
word_greather:
  .word word_equal
  .word 1       
  .ascii ">\0\0\0" 
  .word 0       
  .word GREATHER
word_do_do:
  .word word_greather      
  .word 4                  
  .ascii "(do)"            
  .word 0                  
  .word DO_DO              
word_do_loop:
  .word word_do_do
  .word 6
  .ascii "(loo"
  .word 0
  .word DO_LOOP
word_unloop:
  .word word_do_loop
  .word 6
  .ascii "unlo"
  .word 0
  .word UNLOOP
word_i:
  .word word_unloop
  .word 1
  .ascii "i\0\0\0"
  .word 0
  .word LOOP_I
word_j:
  .word word_i
  .word 1
  .ascii "j\0\0\0"
  .word 0
  .word LOOP_J
word_k:
  .word word_j
  .word 1
  .ascii "k\0\0\0"
  .word 0
  .word LOOP_K
word_allot:
  .word word_k
  .word 5
  .ascii "allo"
  .word 0
  .word ALLOT
word_exit:
  .word word_allot
  .word 4
  .ascii "exit"
  .word 0
  .word EXIT
word_c_bang:
  .word word_exit
  .word 2
  .ascii "c!\0\0"
  .word 0
  .word C_BANG
word_c_at:
  .word word_c_bang
  .word 2
  .ascii "c@\0\0"
  .word 0
  .word C_AT
word_emit_char:
  .word word_c_at
  .word 4
  .ascii "emit"
  .word 0
  .word EMIT_CHAR
word_rot:
  .word word_emit_char
  .word 3
  .ascii "rot\0"
  .word 0
  .word ROT
word_over:
  .word word_rot
  .word 4
  .ascii "over"
  .word 0
  .word OVER
word_and:
  .word word_over
  .word 3
  .ascii "and\0"
  .word 0
  .word AND
word_qexit:
  .word word_and
  .word 5
  .ascii "qexi"
  .word 0
  .word qemu_exit
word_wfi:
  .word word_qexit
  .word 3
  .ascii "wfi\0"
  .word 0
  .word WFI
word_eval:
  .word word_wfi
  .word 4
  .ascii "eval"
  .word 0
  .word EVAL
word_bump:
  .word word_eval
  .word 4
  .ascii "bump"
  .word 0
  .word BUMP
dictionary_end:
bump:
  .word 0
err_create_error:
  .asciz "\nerror: create missing name, usage: create [name]\n"
err_word_not_found:
  .asciz ": word not found\n"
  .align 2
  # our actual human readable program
human_program_small:
  .asciz "
: begin 
   here
   >cf
; immediate
: again 
   postpone jump 
   cf>
  ,
; immediate
: if
   postpone 0branch
   here
   0
   ,
   >cf
; immediate  
: then
   here       
   cf>        
   !
; immediate  
: nothing
  begin
    1 +
    dup 1000000 = if
      97 emit
      drop
      1
    then 
  again
;
1 nothing
  "
  
human_program:
  .asciz "
: begin 
   here
   >cf
; immediate
: again 
   postpone jump 
   cf>
  ,
; immediate
: until
   postpone 0branch
   cf>             
   ,               
; immediate
: if
   postpone 0branch
   here
   0
   ,
   >cf
; immediate  
: then
   here       
   cf>        
   !
; immediate  
: else
   postpone jump    
   here             
   0                
   ,                
   here            
   cf>             
   !               
   >cf             
; immediate
: loop
  postpone (loop)
  cf>
  ,
; immediate
: do
  postpone (do)
  here 
  >cf
; immediate
create buffer 200 allot 
: buffer[] buffer + ;
: zero 200 0 do 0 i buffer[] c! loop ;
: print 200 0 do i buffer[] dup 0 = if exit then c@ emit loop ;
: input
  bump .
  62 emit
  32 emit
  200 0 do
    key dup 13 = if
      drop
      -1
      unloop exit
    else
      dup 27 = if
        drop 
        0 
        unloop exit
      then
      dup emit
      i buffer[] c!
    then
  loop
;
: repl
  begin
    zero
    input
    0 = if drop unloop exit then
    cr
    buffer
    eval
    cr
  again
;
repl
qexit  
"
  .end
We have two interpreters, one loads the human_program_small program, which
just prints the letter a from time to time, the other loads human_program
which is our repl where you can write forth code and execute it.
One thing to note is that you see there is zero restriction or protection for one task to completely corrupt the other task's memory. In modern hardware and operating systems there is massive amount of work to make that impossible, or at least harder, and in fact it is an active area of research. We will briefly discuss it in the Security chapter.
Now what we are missing is loading and saving files, and communicating with other computers. I think for those we have to go to a real computer, because qemu's disk and networking are too involved.
It is important to remember there is very little difference between any of those systems, be it showing pixels on screen or writing to disk or receiving network packets, its all the same, there are protocols that you need to comply to, write to some registers to control the hardware in a specific way, and then send or read data. Some times the hardware will trigger an interrupt, sometimes you will have to poll and check if there is data. For example the USB keyboard interface never pushes data, when I press a key on the keyboard it gets into an internal buffer in the keyboard itself, and the operating system will ask the keyboard every 5 millisecons: give me the pressed keys. In the past we had PS2 keyboars which triggered interrupts, so when you press a key, the operating system's interrupt handler is called and your press is handled immediately. 5 milliseconds is not a lot of time, its 200 times per second, so you wont notice, but you see it is fundamental difference of control. The keyboard has to be more complicated now, because it must have a buffer of pressed keys, otherwise with PS2 you just send as soon as the key is pressed.
PULL versus PUSH is the most fundamental property of system interractions,
and multiple books can be written to examine the emergent properties of systems
depending on how they use pull/push between their components, where are their
buffers and how are they drained.
This will conclude this chapter, and we will continue in the next one where we move to a real Raspberry Pi Pico 2350 computer, it has two ARM cores and two RISCV cores, you can use either one, and we will use the RISCV cores, I have not said much about Symmetric Multi Processing, where we have multiple CPU cores that access the same memory and hardware, and the challenges it has, and for now we will just use 1 of the cores, and pretend its only 1, to keep thing simpler. It is hard enough to have multiple programs using the same CPU to communicate, e.g. pressing a character gets it in a buffer where multiple programs might read it, but with 1 CPU, only one program runs at a time, which makes things considerably easier. Even ths simplest things are difficult when two entities are using the same resources, imagine a deck of cards, and two people reach to get the top card in the very same time, you remember how in our 74LS chips there are nanosecond intervals in which the voltage is read, well.. thats quite a lot of time for things to go wrong. A whole new set of primitives has to be used to guarantee order of operations, we won't get into it for now, but if you are interested search for "SMP" or "Symmetric MultiProcessing".
When you are exploring those concepts, use chatgpt or some other AI to help you, you can trully ask it the silliest questions, even though it might be wrong, and mislead you, just think about the wires, and listen to your doubt. Look at the source code of the Linux kernel version 1, paste it in chatgpt and ask it how it works. The recent kernel versions are very complicated as they must support tens of thousands of different devices and systems and subsystems, but you can also ask chatgpt about it. Look into preemtive scheduling, cooperative scheduling, fair scheduling, real time operating systems, processess and threads, push and pull, the dining philosophers problem, the barber problem, the byzantine generals problem. clockless computers, tickless kernels, micro kernels, synchronization primitives, semaphores, spinlocks, mutexes and futexes. Don't stress, it seems like a lot, but everything is similar, you just have to understand the reason for its existence.