Operating System

An Operating System (OS) is just a program, its a program with the only purpose of allowing the user of the computer to use the computer to its fullest potential, be able to run other programs and be able to access the hardware.

There are many operating systems, the most used are Linux, Windows, MacOS, iOS, Android (which is on top of Linux), FreeBSD, Minix, and FreeRTOS, but there are many others, BeOS, Haiku, CollapseOS, and so on.

Pretty much all devices you use have some operating system, even the oven. The TV usually runs Android, your phone is with iOS, your laptop Windows. There are operating systems even in parts of your computer, for example in most Intel CPUs itself there is small Minix that handles certain security features and power management. There might be a small operating system in your bluetooth headphones. even in your keyboard.

Again, its just a program, it does not have to be the first program the computer starts, in our SUBLEQ computer the first program would be a circuit that copies the actual SUBLEQ program from non volatile storage to RAM and then disconnects itself, set the program counter to 0, and start the normal execution. Then we could have an operating system written in SUBLEQ (if we had a bigger version of the computer, its not much we can do in 16 memory locations. After the computer boots, and on some computers this is quite complicated process, somehow at some at point the operating system will be jumped into, like in our Forth, we have j forth, and from then on, it is in charge.

A lot of complexity goes into it because most popular operating systems try to be very generic, for example Windows has to work on AMD processors and on Intel processors, and even though they both have x86 instruction set, there are subtle hardware bugs, certain functionalities exist only in one not in the other. Windows also has to work on all kinds of motherboards, with all kinds of RAM, and network cards, and USB controllers, all kinds of graphic cards, and so on. When you write a program for Windows, you don't know at all what kind of computer it will run on, you compile putc(65), and know that it will somehow print 'A' on the screen, and you do not care how this is going to happen.

In our QEMU computer, the operating system would have to check if the UART is ready to write by busy polling address 0x10000005 and when its available we write the character to address 0x10000000

putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

Now imagine we have different QEMU computers, one has the UART register at address 0x10000000, the other at 0x20000000, we could write two versions of the OS, compile them and let the users choose which one they need.

putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

and

putc:
  li t0, 0x20000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

There are multiple problems with this, first the user would have to very carefully install the right operating system, and if there are 200 kinds of computers, chances are they will try to install the wrong one, the user does not care, they just want to display text on the screen. Also the combinations of devices is ridiculous, you have 50 kinds of RAM, 200 kinds of wifi controllers, maybe families 150 network adapters, .. how many variants of the operating systems we have to compile to support all of the combinations?

The hardware manufacturers want their device to be used, so they make them compatible with another device that is already supported by some operating system, or sometimes we have standards that attempt to guarantee compatibility, for example the USB HID specification, or the RISC-V specification. If I make a RISC-V CPU and it adheres to the specification, than I can say my CPU is RISC-V compatible, any program written for another RISC-V should be able to be executed on my CPU, but, the CPU is not the whole computer, even if my CPU executes the same sb instruction, as the other ones, writing to address 0x10000000 could mean a very different thing on each computer. Some say that software compatibiility is the biggest technological advancement since the transistor. Everybody works really hard for things to be compatible. I think this will change soon, as the main reason for this compatibility is because it was hard to write software, and almost nobody can program computers, but now, the wind of change is here, I am quite excited what it will bring. There is price we must pay for compatibility, for example your Intel x86 CPU, still boots in "Real Mode" which emulates Intel's 1978 8086 processor until it is switched to protected mode. I often wonder, what world would it be if we knew how to program computers, and are free from the chains of the past. Your bluetooth headphones have to adhere to a standard definition that is more than 10,000 pages, why cant you just make some headphones and describe how their wires work and someone else can write their program to use them? We keep adding standards and layers of indirection everywhere, so that its easier to do things, but, I am not sure anything is easier. In Commodore64 BASIC, in 1984 or so, people just did peek and poke and were able to read and write memory, now computers do not even come with a programming environment installed, not to mention you cant read any memory you want (we will discuss this more in the security section). The entire software industry has given up on teaching people how to program, it has lost all trust in us. It is too complicated they say.. but is it? Is it really complicated? Just tell us how it works, wires are wires, bits are bits. It is too dangerous they say, we have to secure the computer for you, otherwise someone might hack into it, well.. what if I want to hack into it, its my computer!

Anyway, in order to make it easy to add support for more kinds of hardware, and for hardware manufacturers to also add code to the operating systems, because after all they know best how their hardware actually works, we "hide" hide how putc works under the hood, we add a layer of indirection between the user and the hardware.


boot:
  li t0, 2 # somehow check what kind of QEMU are we running on
  li t1, 2 # if the QEMU kind is 2, load the correct putc variant
  beq t0, t1, .L_qemu_2
  
.L_qemu_1:
  la t0, putc
  la t1, putc_qemu_1
  j .L_boot_done

.L_qemu_2:
  la t0, putc
  la t1, putc_qemu_2
  j .L_boot_done
  
.L_boot_done:
  j main

putc:
  .word putc_qemu_1

putc_qemu_1:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

putc_qemu_2:
  li t0, 0x20000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

syscall:
  addi sp, sp, -4
  sw ra, 0(sp)

  lw t0,0(a5)
  jalr t0

  lw ra, 0(sp)
  addi sp, sp, 4

and use it like this

    ...
    li a0, 65
    la a5, putc
    call syscall
    ...

This is an example of doing so, adding a layer of indirection between the programmer and the UART, so they only say "system please print a character on screen", we call syscall, with the function we want, in this case putc, it will load the value of putc and jump to wherever it points to. We wont even care if its written on the UART or displayed on a screen, the operating system will hide that from us. This also allows us to actually change during the runtime of the program, what putc does, for example imagine I plugin some different display, the operating system somehow detects that, and now putc() could write to the display, instead of UART.

This is what a device driver is BTW, it is just a bit of code that enables certain hardware functionality, like putc support. Our display driver could put the machine code that controls the leds somewhere in memory and change putc to point to that location, so we have added support for displaying characters on the screen. You might have heard 'you need to install a driver for it' or 'there is new version of the NVIDIA driver, you have to update'

This is one aspect of the operating system, to abstract the hardware functionality, print, display pixel, get keyboard input, read from memory, write to memory, etc.

There is another aspect of it, which is to create an environment that allow the user to use the computer, write programs and execute programs, to run multiple programs in the same time. The most basic use could be to be able to edit a file, compile it and execute it. This is what we will focus on with our operating system.

At the moment our program is somewhere in memory at the label "human_program", and we can only execute that one program, also our getch function is busy looping, it is consuming all the cpu resources just to check if the UART is ready to read, if our cpu runs at 100 mhz, and we type one character per second, it will execute 100,000,000 useless instructions. Most CPUs have a way to go into 'wait' mode, where they consume very little power until an interrupt happens, for example UART interrupt or timer interrupt. Our code at the moment does not take advantage of it. Our interpreter is also quite ridgit, we can only start it once. We dont have the concept of a process, or inter process communication.

It seems like a lot to build, but its not that much, and we will start small, first we will create the operating system scheduler, which is a tiny bit of code that runs at regular intervals, e.g. every millisecond, and just changes the running process, it snapshots all the registers as they are at the moment of the interrupt, and loads the registers of another process.

Imagine the following code:

process_a:
    li a0, 'a'
    jal putc
    j process_a

process_b:
    li a0, 'b'
    jal putc
    j process_b

If we jump into process_a we will be forever stuck there. We will inline putc into the code, so that we can see through everything:

process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)

    j process_a

process_b:
    li a0, 'b'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)

    j process_b

Zooming in even more, imagine you have jumped into process_a, and you are executing the instructions one by one.

800019a4:       06100513   li      a0,97
800019a8:       100002b7   lui     t0,0x10000
800019ac:       0052c303   lbu     t1,5(t0) 
800019b0:       02037313   andi    t1,t1,32
800019b4:       fe030ce3   beqz    t1,800019ac <--- YOU ARE HERE
800019b8:       00a28023   sb      a0,0(t0)
800019bc:       fe9ff06f   j       800019a4

So pc is at address 800008a0, there we have the instruction beqz t1, 1b, 1b means the label 1 backwards, which is actually a pseudo instruction that expands to beq x0, x6, -8, or the machine code 0xfe030ce3. Even though our disassembler outputs beqz t1,800019ac, the offset in the machine code is actually relative, and 800019ac - 800019b4 is -8.

Just for practice lets look at 0xfe030ce3 in binary it is 11111110000000110000110011100011 which corresponds to beq x6, x0, -8.

So you are the program counter, and at the moment you are at address 800019b4, at this point the program has set the value 97 in a0, the value 0x10000000 in t0 and it has loaded the value from memory address 0x10000005 into register t1, lets imagine the value at that point in time was 0, so the whole state of the our program is:

    a0: 67
    t0: 0x10000000
    t1: 0
    pc: 0x800019b4

What if I turn off the computer, for 100 years, and then I start the computer with those values preloaded into the registers? Would the program execution know in any way that it was paused for one century?

What if our universe pauses for 1 billion years every nanosecond, would we know?

Our computer has no way of knowing if it just started in a certain configuration, or reached that configuration while executing some program. Because our computer is a deterministic machine, meaning if it starts from a known configuration and follows some program, if we run it 100 times, every time the output will be the same for the same input, setting a0 to 97 will always set it to 97, well almost always, as there could be errors in the wires, or the universe itself, but this is beyond the scope of this book.

So, if we can just store the registers process_a we can pause it, and later we can just set them to their values and continue the process. You see there is no such thing as a process its just code with a dedicated piece of memory where we can store its registers when we switch it out. This way we can run multiple programs in the same CPU, this is called time-sharing, every program gets a chance to run for a little bit, and then its swapped, the scheduler tries to give equal time to all running programs.

The piece of memory where we hold the process information is called Process Control Block (PCB), or Task/Thread Control Block. It usually hold the registers, some information like why was the process switched out, was it because it is waiting for something, or sleeping, or the scheduler just decided that its time to give chance to another process. PCB also holds the process identifier or PID, and accounting information like how much cpu was it using, which files it has open so when the process is shut down the files can be closed, some security infromation, priority level, what its status is (running, blocked, etc) and whatever else you need in the operating system in order to manage the running proceses.

So this is all fine, we store the registers, load another set of registers and jump to the new pc, but what does it mean to 'interrupt', how do we interrupt a program?

In our SUBLEQ computer we had a flag register and its output was controlling the microcode on one of the eeproms. In our microcode that is how we execute different micro instructions depending if the result was less than zero or not.

 OR(A==B, MSB==1)
   |
   D0        OE ---------------.
   |         |                 |
.------------------.           |
| 74ls173 FlagReg  |           |
'------------------'           |
   |         |                 |
   Q0        C                 |
   |          `----------------|--.
   '--------.                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
            |                  |  |
    0 1 2 3 4 5 address lines  |  |
    | | | | | | ..             |  |
.------------------.           |  |
|  28at64c eeprom  |           |  |
'------------------'           |  |
    | | | | | | ..             |  |
    0 1 2 3 4 5 I/O lines      |  |
          |   |                |  |
          '--------------------'  |
              |                   |
               `------------------'

Weirdly enough, we control the output of the flag register from the I/O lines, and its output controls the eeprom's address line which of course controls its output. If you were to remember one thing from this book, I want it to be this feedback loop. The interplay between the input and the output of a system, and how to build structures on top of it.

So if the value of register A is the same as register B, OR the most significant bit of the ALU output is 1 (meaning the number is negative), AND we have latched the value in the flag register, once we enable it, it will control the I/O lines by whatever we have as value on the new address.

We could use a similar technique, we can create a small timer circuit, that every 1 milliosecond enables an address line, which would allow us to execute different microcode, that could turn off the timer circuit when done and continue normal execution. Once we go in interrupt mode we could store the program counter in some temporary register, and jump to a predefined location, where we will have our interrupt handler (just a piece of code to be executed once interrupt occurs). Now the interrupt handler can fetch the value of the old program counter from the temporary register, and jump there if it wants, or store it somewhere and load some other value. This is what it means to interrupt a program, the processor is forced to jump to the interrupt handler code.

There are many kinds of interrups, there are timer interrupts, or for example uart interrupts, or exception interrupts, and so on, what if you get an interrupt while you are in the interrupt handler? There are certain interrupts you can ignore, they are called maskable interrupts, e.g. timer or soft interrupts, others you cant and you must handle, those are non-maskable interrupts, e.g. hardware errors, watchdog timer.

When I was young, I was quite terrified by interrupts, I just couldn't imagine how can the computer jump somewhere on its own, it was freaking me out a bit.

In QEMU we can use the UART hardware interrupts and the timer interrupts to make our operating system scheduler. To learn a new thing, you should always start with the absolute minimum. Make a new empty folder, just copy the Makefile and we start again from boot.s and nothing else, later once the scheduler is done we will bring back our Forth.b

  # boot.s
  .section .text
  .globl _start

_start:
  la sp, _stack_top

  j process_a
  j process_b # never reached

process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)

    j process_a

process_b:
    li a0, 'b'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0)
    andi t1, t1, 0x20
    beqz t1, 1b
    sb a0, 0(t0)

    j process_b

  .end

Now we get infinite stream of 'a', which is a good place to start, you understand every single line in this code, and we are on solid ground. The next step would be to create an interrupt handler code and to create a timer interrupt, and we will just print 'c' from the interrupt handler.

There are many ways to break down something you think is impossible or unachievable, one common way is to split it into known unknowns and unknown unknowns, basically first do the things you are not sure are going to work, jump in the deep end. There is another way, which is do the smallest step in the direction you want to go, we are trying to experience interrupt handling, for that we need interrupts.. and handling, so we can just do those. And of course there is my favorite way, explore, because when you start with something, you think you know where you want to go, but as you are on the journey, you see it was not the right destination. Allow yourself to explore, try things out, be ready to delete your code, start from scratch, do not be afraid of the code that you will write, is it going to be stupid or embarassing, just play. It is hard to show you this in the book, but I don't know Forth nor RISC-V assembly, nor I have written a scheduler before, or an operating system, didnt know anything about the 74LS TTL chips before I made the 4 bit SUBLEQ computer, I didnt even know about SUBLEQ itself. I just try things out, and I am sure there are infinitely better ways to do what I did, but so what? In the beginning you have a path, and it seems quite straight bath, you plan your stops, 1) I will make a chapter about AND gates, 2) I will write about z80, 3) I will make small z80 program, never wrote z80 assembly, but why not.. and so on, midway however I realize, wait I can actually make my computer, that would be cool, why not, then I went on this sidequest for like 1 month, and it was was so much fun! My advice is, don't be afraid to experiment, only the blind know where they are going, so close your eyes and see.

Don't value your code that much, be happy when you delete it, it means you have outgrown it.

Its time to make some timers interrupts and their handlers.

In qemu first we need to setup the time at which we want to get the interrupt, now + 5000 clock ticks for example, we do that by the tick value is 64 bit value and our registers hold 32 bits, but this is fine because want to trigger the interrupt after 5000 ticks.

li t1, 0x0200BFF8 # address of mtime register
lw t0, 0(t1)      # load the lower 32 bits
li t1, 5000
add t0, t0, t1    # add 5000 to it
li t1, 0x02004000 # address of time compare register mtimecmp
sw t0, 0(t1)      # store the new value in it

mtime is continously incrementing hardware time register, well hardware is somewhat of a soft term in qemu, but if you are using a physical computer there will be some way to set up timers, and it will probably be similar.

mtimecmp is a compare register, once mtime >= mtimecmp it triggers machine timer interrupt.

In our assembly we first load the value of mtime, then we add 5000 to it and set it in the mtimecmp, now we only need to tell qemu that we are interested in receiving timer interrupt and we have to tell it how we are ghoing to handle them, adn the RISC-V specification defines how mtimecmp and mtime are used.

_start:
  la sp, _stack_top
  jal setup_timer

process_a:
  ...
process_b:
  ...

setup_timer:
  la   t0, interrupt_handler
  csrw mtvec, t0

  li t1, 0x0200BFF8 # address of mtime register
  lw t0, 0(t1)      # load the lower 32 bits
  li t1, 5000
  add t0, t0, t1    # add 5000 to it
  li t1, 0x02004000 # address of time compare register mtimecmp
  sw t0, 0(t1)      # store the new value in it
  
  li   t0, (1 << 7) # MTIE bit in mie
  csrw mie, t0

  csrr t1, mstatus
  li t0, (1 << 3)   # MIE bit in mstatus
  or t1, t1, t0
  csrw mstatus, t1

  # initialize current process pointer
  la t0, current_process
  la t1, process_a
  sw t1, 0(t0)

  j process_a

interrupt_handler:
  # setup the next timer interrupt
  li t1, 0x0200BFF8    # mtime address
  lw t0, 0(t1)         # get current time
  li t1, 5000          # timer interval
  add t0, t0, t1       # next interrupt time
  li t1, 0x02004000    # mtimecmp address
  sw t0, 0(t1)         # set next interrupt

  la t1, current
  lw t2, 0(t1)

  beqz t2, set_to_one  # if current == 0, jump
set_to_zero:
  li t2, 0
  sw t2, 0(t1)
  la t0, process_a
  j set_mepc

set_to_one:
  li t2, 1
  sw t2, 0(t1)
  la t0, process_b

  # set mepc (machine exception program counter)
set_mepc:
  csrw mepc, t0
  mret

current:
  .word 0
.end

In setup_interrupt we also have to enable timer interrupt MTIE (bit 7), in the mie register (_M_achine _I_nterrupt _E_nable). After that we actually need to enable MIE interrupts in mstatus, so we read the current value and or it with 00000000 00000000 00000000 00000100 (3rd bit), and then write it back, bit 3 is the MIE bit in mstatus.

# enable MTIE
li   t0, (1 << 7) # MTIE bit in mie
csrw mie, t0

# enable MIE
csrr t1, mstatus
li t0, (1 << 3)   # MIE bit in mstatus
or t1, t1, t0
csrw mstatus, t1

You have noticed that we dont use the normal li or lw and sw for mie, mstatus, mtvec, they are 'Control and Status Registers', or CSRs, a special kind of registers, csrr means read, csrr t1, mstatus is read the value of mstatus into t1, and csrw means write, csrw mie, t0 means write the value of t0 into mie.

When interrupt happens, the processor has to jump somewhere, in the mtvec csr register, Machine Trap Vector, it has the base address of where the CPU will jump when machine mode interrupt or exception occurs.

la   t0, interrupt_handler
csrw mtvec, t0

All is well so far, once the next interrupt happens, it will jump to our interrupt_handler code, there we have to set a new mtimecmp value so that we get another interrupt next time, and we just check the value of the current variable if its the address of process_a we switch it to process_b and later we set mepc (Machine Exception Program Counter) to the new value, mepc holds the value of pc at the time of the interrupt. At the end we call mret instead of ret, you know there is no such thing as ret, it is jalr zero, 0(ra) or 0x00008067, so ret just jumps to wherever ra was pointing to, but mret jumps to wherever mepc is, so by setting mepc we control where we go back to.

If you had nice 80 column monitor you would see something like this:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

In pseudocode the program looks like this

start:
   setup_timer()

process_a:
   print('a')
   goto process_a

process_b:
   print('b')
   goto process_b


setup_timer:
   set mtvec to interrupt_handler
   interrupt after 5000 ticks
   enable timer interrupts
   enable machine interrupts
   process_a()

interrupt_handler:
   interrupt after 5000 ticks
   
   if current == 0:
      t0 = process_b
      current = 1
   else:
      t0 = process_a
      current = 0

   mepc = t0
   mret

So now we switch between two procesees, but we dont actually restore their state at all, we just start process_a and process_b from scratch every time, we jump to the very beginning of their machine code at every interrupt.

Think again, at the time of the interrupt it is as if the world is frozen from the point of view of process_a, when the cpu jumps into interrupt_handler all the registers are just as they were, again, try to imagine as if you were executing those instructions on paper, and the world freezes.

process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0) <--- FREEZE
    andi t1, t1, 0x20 
    beqz t1, 1b
    sb a0, 0(t0)

    j process_a

What would the values be. What would the value of pc be at this point? The instruction is not partially executed, it must be executed, so pc points to th enext instruction, t1 is set with the value from memory address t0+5, a0 is 97 and t0 is 0x10000000.

We want to "continue" the process when its its turn again, so it can never know that something happened.

process_a:
    li a0, 'a'
    li t0, 0x10000000
  1:
    lbu t1, 5(t0) 
    andi t1, t1, 0x20 <--- CONTINUE
    beqz t1, 1b
    sb a0, 0(t0)

    j process_a

Our current example is quite trivial, so its not of much difference if we continue or start from the beginning, but anything code that changes memory will be extremely confused if we keep restarting it, or maybe if the process is more complicated and it cant finish the work in 5000 ticks then it will never return a result as we will keep starting it over and over again.

We will save the registers of the running process in memory (PCB) and then next time its its turn we will restore them. We will just unfreeze its world. This is quite strange if you think about it, that the program can never know if it was stopped by itself, it must look at some external reference to know. There are a lot of things in our universe like that, for example the principle of relativity, if you are in a space ship with constant velocity, and there are no stars around you, you can not know if you are moving, even more, if another ship is coming towards you, you dont know if it is coming to you, or you are coming to it. How bizzare it is that constant velocity is undetectable from inside the ship? How is that not magic, stop and think for a second, as this is quite disturbing. How can it be? We can freeze cells, then we unfreeze them and they go as if nothing happened. The frog rana sylvatica actually freezes during winter, they have no heart beat, no breathing and no blood circulation, nor brain activity while frozen, but when it gets warm they get back to life. When it wakes up, maybe one winter passed, maybe 10 winters passed, how would it know?

When frozen, wood frogs have no detectable vital signs: no heartbeat, breathing, blood circulation, muscle movement, or detectable brain activity. Wood frogs in natural hibernation remain frozen for 193 +/- 11 consecutive days and reached an average (October–May) temperature of −6.3 °C (20.7 °F) and an average minimum temperature of −14.6 ± 2.8 °C (5.7 ± 5.0 °F). The wood frog has evolved various physiological adaptations that allow it to tolerate the freezing of 65–70% of its total body water. When water freezes, ice crystals form in cells and break up the structure, so that when the ice thaws the cells are damaged. Frozen frogs also need to endure the interruption of oxygen delivery to their tissues as well as strong dehydration and shrinkage of their cells when water is drawn out of cells to freeze.

-- https://en.wikipedia.org/wiki/Wood_frog

-- https://www.youtube.com/watch?v=UvCdOXG2rPo

Anyway, lets get back to our registers and process control blocks, this is how saving and restoring registers would look through time:

Time flows this way ------>

Process A executing:       Interrupt Handler:        Process B executing:
.-------------------.     .-------------------.     .-------------------.
| a0: 'a' (97)      |     | Save A's state    |     | a0: 'b' (98)      |
| t0: 0x10000000    | ->  | Set up new timer  | ->  | t0: 0x10000000    |
| t1: 0             |     | Load B's state    |     | t1: 0             |
| pc: 0x800019b4    |     | Jump to B (mret)  |     | pc: 0x80001bc4    |
'-------------------'     '-------------------'     '-------------------'

At the moment the way process_a is written it always jumps to its start, so the whole point is kind of moot, we will complicate it a bit in order to see that we actually do store and restore the registers, we do freeze and unfreeze it.

process_a:
  li a0, 'a'
  li a1, 0
  mv a2, a0
  li t0, 0x10000000
  li t2, 26
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  addi a1, a1, 1     # a1++
  rem a1, a1, t2     # a1 = a1 % 26
  add a0, a2, a1
  j 1b


process_b:
  li a0, 'A'
  li a1, 0
  mv a2, a0
  li t0, 0x10000000
  li t2, 26
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  addi a1, a1, 1     # a1++
  rem a1, a1, t2     # a1 = a1 % 26
  add a0, a2, a1
  j 1b

this change will make process 'a' print 'abcdefghijklmnopqrstuvwxyz' and process 'b' prints 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', instead of jumping to their start labels they now jump to an inner loop where they both keep incrementing a1 = a1 % 26 and then make a0 = a2 + a1, a2 is the base character, and a1 = a1 % 26 will cycle between 0 and 25, so a0 = a2 + a1 will cycle from '97 + 0, 97 + 1, 97 + 2, 97 + 3, 97 + 4.. 97 + 25', 97 is ascii for lowercase 'a' and 122 (97 + 25) is lowercase 'z', and for process 'b' we get the same but starting from upper case 'A' which is 65 and ending at 90 (65 + 25) which is 'Z'. Now it will be very visible if we don't do a good job restoring the registers.

Examine the code first, then we will discuss it.

setup_timer:
  la t0, interrupt_handler
  csrw mtvec, t0

  # PCB A init
  la t0, pcb_a
  la t1, process_a
  sw t1, 124(t0)       # set initial PC
  la t1, stack_a_top
  sw t1, 4(t0)         # set stack top

  # PCB B init
  la t0, pcb_b
  la t1, process_b
  sw t1, 124(t0)       # set initial PC
  la t1, stack_b_top
  sw t1, 4(t0)         # set stack top

  # set up initial current process
  la t0, current_pcb
  la t1, pcb_a
  sw t1, 0(t0)

  # set up initial timer interrupt
  li t1, 0x0200BFF8    # mtime address
  lw t0, 0(t1)         # get current time
  li t1, 5000          # timer interval
  add t0, t0, t1       # next interrupt time
  li t1, 0x02004000    # mtimecmp address
  sw t0, 0(t1)         # set next interrupt

  # enable timer interrupts (set MTIE bit in mie)
  li t0, (1 << 7)      
  csrw mie, t0
  
  # enable global interrupts (set MIE bit in mstatus)
  csrr t1, mstatus
  li t0, (1 << 3)      
  or t1, t1, t0
  csrw mstatus, t1

  # start with process A
  j process_a

interrupt_handler:
  # temporary save t6 on top of the current process's stack
  # we just use it as scratch memory
  addi sp, sp, -4
  sw t6, 0(sp)        # save t6, temporary
  add sp, sp, 4       # we will use -4(sp) to take it back
  
  # now we can use t6 safely
  la t6, current_pcb
  lw t6, 0(t6)        # get current PCB address

  # FREEZE!
  sw ra, 0(t6)        # x1
  sw sp, 4(t6)        # x2
  sw gp, 8(t6)        # x3
  sw tp, 12(t6)       # x4
  sw t0, 16(t6)       # x5
  sw t1, 20(t6)       # x6
  sw t2, 24(t6)       # x7
  sw s0, 28(t6)       # x8
  sw s1, 32(t6)       # x9
  sw a0, 36(t6)       # x10
  sw a1, 40(t6)       # x11
  sw a2, 44(t6)       # x12
  sw a3, 48(t6)       # x13
  sw a4, 52(t6)       # x14
  sw a5, 56(t6)       # x15
  sw a6, 60(t6)       # x16
  sw a7, 64(t6)       # x17
  sw s2, 68(t6)       # x18
  sw s3, 72(t6)       # x19
  sw s4, 76(t6)       # x20
  sw s5, 80(t6)       # x21
  sw s6, 84(t6)       # x22
  sw s7, 88(t6)       # x23
  sw s8, 92(t6)       # x24
  sw s9, 96(t6)       # x25
  sw s10, 100(t6)     # x26
  sw s11, 104(t6)     # x27
  sw t3, 108(t6)      # x28
  sw t4, 112(t6)      # x29
  sw t5, 116(t6)      # x30
  lw t5, -4(sp)       # NB: Get original t6 value from -4(sp)
  sw t5, 120(t6)      # Save original t6 (x31)
  csrr t0, mepc
  sw t0, 124(t6)      # Save mepc to PCB

  # set up next timer interrupt
  li t0, 0x0200BFF8    # mtime address
  lw t1, 0(t0)         # get current time
  li t2, 5000          # timer interval
  add t1, t1, t2       # next interrupt time
  li t0, 0x02004000    # mtimecmp address
  sw t1, 0(t0)         # set next interrupt

  # switch to other PCB
  la t6, pcb_a            # load PCB A address
  lw t5, current_pcb      # get current PCB
  bne t5, t6, switch_done # if current != A, switch to A
  la t6, pcb_b            # otherwise switch to B
switch_done:
  # store new PCB pointer
  la t5, current_pcb
  sw t6, 0(t5)

  # UNFREEZE!
  lw t5, 124(t6)       # get saved mepc
  csrw mepc, t5        # restore it

  lw ra, 0(t6)
  lw sp, 4(t6)
  lw gp, 8(t6)
  lw tp, 12(t6)
  lw t0, 16(t6)
  lw t1, 20(t6)
  lw t2, 24(t6)
  lw s0, 28(t6)
  lw s1, 32(t6)
  lw a0, 36(t6)
  lw a1, 40(t6)
  lw a2, 44(t6)
  lw a3, 48(t6)
  lw a4, 52(t6)
  lw a5, 56(t6)
  lw a6, 60(t6)
  lw a7, 64(t6)
  lw s2, 68(t6)
  lw s3, 72(t6)
  lw s4, 76(t6)
  lw s5, 80(t6)
  lw s6, 84(t6)
  lw s7, 88(t6)
  lw s8, 92(t6)
  lw s9, 96(t6)
  lw s10, 100(t6)
  lw s11, 104(t6)
  lw t3, 108(t6)
  lw t4, 112(t6)
  lw t5, 116(t6)

  lw t6, 120(t6)
  mret


current_pcb:
  .word 0

pcb_a:
  .space 128           # 32 registers * 4 bytes each
pcb_b:
  .space 128

stack_a:
  .space 4096         # 4KB stack for process A
stack_a_top:

stack_b:
  .space 4096         # 4KB stack for process B
stack_b_top:

First lets check out the output:

...
RSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTwxyzabcdefghijklmnopqrst
uvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstu
vwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuv
wxyzabcdefUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJK
LMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKL
MNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCghijklmnopqrstuvwxyzabcdefghijklmnop
qrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopq
rstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr
ZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLxyzabcdefghijklmnopqrstuvwxyzabcdefghijkl
mnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklm
nopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmn
opqrstuMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEF
GHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFG
HIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZAvwxyzabcdefghijklmnopqrstuvwxyzab
cdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc
defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd
efghijkBCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTU
VWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUV
WXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMlmnopqrstuvwxyzabcdefghijklmnopqrstu
vwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuv
wxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw
...

When you are debugging something, you must find anchors, and by that I mean, you must find pillars of truth, pullars that are unequivocally true, based on those truths you will walk the path. If one of them is a lie, it will collapse everything, or even worse, it will rot.

Looking at this output, without even thinking about the programs, can you think of how the programs would have to work in order to produce it?

Look for patterns.

XYZABCghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwDEFGHI

You see how capitalized text is interrupted by lower case text, and then continues. This is what I would expect to happen when process B is interrupted and it gets unfrozen from where it was, this output at least guarantees that we properly save a0.

OK, diving into the code, there are few important parts, first is we will create separate stacks for each process, and we will also allocate space for each process's PCB. The stack of a process is wherever we point sp at before we jump into it, there is nothing special, its the same memory, the same wires.

Allocating the space for pcbs, stack and the current_pcb variable.

current_pcb:
  .word 0

pcb_a:
  .space 128
pcb_b:
  .space 128

stack_a:
  .space 4096
stack_a_top:

stack_b:
  .space 4096
stack_b_top:

In our setup_timer function we will setup the pcb structs

  # PCB A init
  la t0, pcb_a
  la t1, process_a
  sw t1, 124(t0)       # set initial PC
  la t1, stack_a_top
  sw t1, 4(t0)         # set stack top

On offset 124 we will have pc, and on offset 4 we have sp, We write the address of process_a at memory[pcb_a + 124] and the address of stack_a_top at memory[pcb_a + 4]. Similarly we setup pcb_b, and we also set current_pcb to pcb_a.

This is how the memory will look when we are setting up pcb_a.

process_a    8000000c:   06100513 <------.   addi    a0,zero,97
             80000010:   00000593        |   addi    a1,zero,0
             80000014:   00050613        |   addi    a2,a0,0
             80000018:   100002b7        |   lui     t0,0x10000
             8000001c:   01a00393        |   addi    t2,zero,26
             ...                         |
                                         |
                                         |
current_pcb  8000027c:   80000280 --.    |
             ...                    |    |
                                    |    |
                                    |    |
pcb_a        80000280:   00000000 <-'    |
pcb_a+4      80000284:   80001380        |
             ...                         |
pcb_a+124    8000037c:   8000000c -------'

After the pcb setup we setup the timer, enable interrupts and jump to process_a. Few ticks later, depending on how we setup the timer, interrupt will trigger and the cpu will jump into interrupt_handler, where the magic happens.

We want to save all the registers, basically we want to do something like

    pcb = current_pcb
    pcb[0] = x1
    pcb[1] = x2
    pcb[2] = x3
    pcb[3] = x4
    ...

Sadly at the time of the interrupt we do not know where where current_pcb points to, and even if we did, sw requires a resister with the destination address to write to, so how can we save the registers when we need a register to save the registers?

Well we will just use a tiny bit of scratch memory at the current process's stack, so we will save the value of t6 at sp - 4 , and then we will use t6 to fetch the address of the current pcb we want to save the registers at. It might feel "unsafe" that we are doing this, but it is perfectly fine, this memory should not be used by anyone, and if it is, its a major problem for both of us, the only way it could happen is if they have a bug and mangle their stack, write where they are not supposed to.

interrupt_handler:
  addi sp, sp, -4
  sw t6, 0(sp)        # save t6, temporary
  add sp, sp, 4       # we will use -4(sp) to take it back
  
  # now we can use t6 safely
  la t6, current_pcb  
  lw t6, 0(t6)        # get current PCB address

  # FREEZE!
  sw ra, 0(t6)        # x1
  sw sp, 4(t6)        # x2
  sw gp, 8(t6)        # x3
  sw tp, 12(t6)       # x4
  ...

BTW, I can also rewrite it like this

  sw t6, -4(sp)
  lw t6, current_pcb  

lw with a label will do la and then lw, and I don't need to do -4, sw 0(sp), +4 for sp, i can just sw with offset -4.

So now we actually have t6 saved at memory location sp - 4, and then we have t6 with the value of the pcb, so we can actually proceed saving all the registers.

  sw ra, 0(t6)        # x1
  sw sp, 4(t6)        # x2
  sw gp, 8(t6)        # x3
  sw tp, 12(t6)       # x4
  sw t0, 16(t6)       # x5
  sw t1, 20(t6)       # x6
  sw t2, 24(t6)       # x7
  sw s0, 28(t6)       # x8
  sw s1, 32(t6)       # x9
  sw a0, 36(t6)       # x10
  sw a1, 40(t6)       # x11
  sw a2, 44(t6)       # x12
  sw a3, 48(t6)       # x13
  sw a4, 52(t6)       # x14
  sw a5, 56(t6)       # x15
  sw a6, 60(t6)       # x16
  sw a7, 64(t6)       # x17
  sw s2, 68(t6)       # x18
  sw s3, 72(t6)       # x19
  sw s4, 76(t6)       # x20
  sw s5, 80(t6)       # x21
  sw s6, 84(t6)       # x22
  sw s7, 88(t6)       # x23
  sw s8, 92(t6)       # x24
  sw s9, 96(t6)       # x25
  sw s10, 100(t6)     # x26
  sw s11, 104(t6)     # x27
  sw t3, 108(t6)      # x28
  sw t4, 112(t6)      # x29
  sw t5, 116(t6)      # x30
  lw t5, -4(sp)       # NB: Get original t6 value from -4(sp)
  sw t5, 120(t6)      # Save original t6 (x31)
  csrr t0, mepc
  sw t0, 124(t6)      # Save mepc to PCB

Even though process_a used very few registers, but we dont actually know that, we dont even know that process_a is running, we have to load the pcb value from current_pcb. At this point our interrupt handler does not care about process_a or process_b it just wants to save the registers somewhere.

Then in our code we setup the new timer and we swap the pcb. if current_pcb was pcb_a we set it to pcb_b, thats it.

  # set up next timer interrupt
  li t0, 0x0200BFF8    # mtime address
  lw t1, 0(t0)         # get current time
  li t2, 5000          # timer interval
  add t1, t1, t2       # next interrupt time
  li t0, 0x02004000    # mtimecmp address
  sw t1, 0(t0)         # set next interrupt

  # switch to other PCB
  la t6, pcb_a            # load PCB A address
  lw t5, current_pcb      # get current PCB
  bne t5, t6, switch_done # if current != A, switch to A
  la t6, pcb_b            # otherwise switch to B
switch_done:
  # store new PCB pointer
  la t5, current_pcb
  sw t6, 0(t5)

After that we will unfreeze the registers from the new pcb. When saving you see we saved mepc at pcb+124, so now we just load it from there, we stored ra at pcb+0, so we load it from there, and so on, even though this is actually the other pcb, it does not matter, all pcbs look the same.

  lw t5, 124(t6)       # get saved mepc
  csrw mepc, t5        # restore it

  lw ra, 0(t6)
  lw sp, 4(t6)
  lw gp, 8(t6)
  lw tp, 12(t6)
  lw t0, 16(t6)
  lw t1, 20(t6)
  lw t2, 24(t6)
  lw s0, 28(t6)
  lw s1, 32(t6)
  lw a0, 36(t6)
  lw a1, 40(t6)
  lw a2, 44(t6)
  lw a3, 48(t6)
  lw a4, 52(t6)
  lw a5, 56(t6)
  lw a6, 60(t6)
  lw a7, 64(t6)
  lw s2, 68(t6)
  lw s3, 72(t6)
  lw s4, 76(t6)
  lw s5, 80(t6)
  lw s6, 84(t6)
  lw s7, 88(t6)
  lw s8, 92(t6)
  lw s9, 96(t6)
  lw s10, 100(t6)
  lw s11, 104(t6)
  lw t3, 108(t6)
  lw t4, 112(t6)
  lw t5, 116(t6)

  lw t6, 120(t6)
  mret

In the end mret returns to wherever mepc points to.

And here we go, our scheduler, toggles between two processees, smooth like butter. This is super tiny toy example, but it is very important to understand what we did here, we are using one cpu to run multiple programs, it is a big deal.

Now we just have to worry about getting interrupt inside the interrupt handler.

interrupt_handler:
  sw t6, -4(sp)
  la t6, current_pcb  
  la t6, 0(t6)

  sw ra, 0(t6)        
  sw sp, 4(t6)        
  ...
  ...
  ...
  ...
  ...
  lw t5, 124(t6)       
  csrw mepc, t5        

  lw ra, 0(t6)
  lw sp, 4(t6)
  lw gp, 8(t6)
  lw tp, 12(t6)
  lw t0, 16(t6)
  lw t1, 20(t6)
  lw t2, 24(t6)
  lw s0, 28(t6)
  lw s1, 32(t6)
  lw a0, 36(t6)
  lw a1, 40(t6)
  lw a2, 44(t6)
  lw a3, 48(t6) <-------- INTERRUPT
  lw a4, 52(t6)
  lw a5, 56(t6)
  lw a6, 60(t6)
  lw a7, 64(t6)
  lw s2, 68(t6)
  lw s3, 72(t6)
  lw s4, 76(t6)
  lw s5, 80(t6)
  lw s6, 84(t6)
  lw s7, 88(t6)
  lw s8, 92(t6)
  lw s9, 96(t6)
  lw s10, 100(t6)
  lw s11, 104(t6)
  lw t3, 108(t6)
  lw t4, 112(t6)
  lw t5, 116(t6)

  lw t6, 120(t6)
  mret

Walk through the code and think what would happen if we get randomly interrupted mid-way of restoring the registers. mepc will be overwritten at this very moment, it will point to the next instruction so that the interrupt handler can go back to it..but we just set it from within the interrupt handler itself.

This is quite a nightmare-ish scenario, to be interrupted while handling an interrupt, and it is quite nuanced to deal with it, in RISCV there are very strict rules around what exactly happens on trap entry (when interrupt happens), and on mret. On each architecture there are subtleties you must know.

This is called double-trap, and to make a function that can be interrupted mid way and then resumed is called 'reentrant' function. So you must either understand the architecture or you must make your trap handler reentrant.

Reentrancy is a programming concept where a function or subroutine can be interrupted and then resumed before it finishes executing. This means that the function can be called again before it completes its previous execution. Reentrant code is designed to be safe and predictable when multiple instances of the same function are called simultaneously or in quick succession

There are also functions you can apply multiple to the same input, they will always produce the same result, as if they were called only once, this is called idempotence. Reentrancy is not the same as idempotence.

Let me give you few examples of reentrant and non reentrant functions.

reentrant_sum(array):
  total = 0

  for each element in array:
    total += element

  return total

---

global_total = 0

non_reentrant_sum(array):
  global_total = 0

  for each element in array:
    global_total += element

  return global_total

If reentrant_sum is interrupted and called gain, the result will not change, in this case because it has the total sum in a local variable, however non_reentrant_sum if interrupted and called again before the first run finishes, the global_total value will be wrong. Imagine we are summing the array [1,2,3,4,5], and as we are on the number 2 we get interrupted, then some other process jumps again into us, as one version of the function is paused, the other version will set global_total to 0, and then start adding again, and lets say it gets interrupted as it is at number 4, so global_total is now 10 (1 + 2 + 3 + 4), and we go back to the first version, but there we were frozen at the second element, so we continue to add to global_total 12 (3 + 4 + 5) and we return 22, which is obviously not 1+2+3+4+5, of course the second run will also be wrong as it will return

It is not always trivial to make reentrant functions, especially if you stricly depend on things to happen in certain order, and you need multiple instructions to be executed untorn, like in our case even if we move setting mepc to the very last place lw t5, 124(t6) ; csrw mepc, t5; mret, setting mepc and mret is 3 instructions, we can always be interrupted mid-way. even if it was 2 instructions, we can still be torn. Imagine having some magical instruction mepc=t5; mret we get interrupted right after mepc=t5, interrupt occurs, then it overwrites mepc with mepc+4 and when we come back we go to mret which will then jump back to mepc=t5.

To help us out, RISCV makes it really hard for double traps to occur, In RISCV this is what roughly happens during an interrupt (amongst a bunch of other things, like MPP set and MDT and etc):

On trap entry:

  1. MPIE(Machine Previous Interrupt Enable) gets the current value of MIE(Machine Interrupt Enable)
  2. MIE is set to 0 (disabling interrupts)
  3. MEPC is set to PC + 4

On mret:

  1. MIE gets the value from MPIE (restoring previous interrupt state)
  2. MPIE is set to 1
  3. Jump to MEPC

This way interrupts are disabled until we execute mret, and we can not get into double trap easilly. Those interrupts are 'maskable interrupts'. The non maskable interrupts can still occur, and they require carefull code to be handled. For now we will not care if exception occurs in our interrupt handler, but if you are making a real system, you will care.

The real question is, what happens with the actual interrupts while you are doing the interrupt handling, e.g. a keyboard is pressed, or a network packet comes in? On some systems they will be pending and once interrupts are enabled again the trap handler will be called, on others they will be lost. RISCV has interrupt priority levels M-mode (machine level interrupts), and S-mode (supervisor level), and it does have MIP (Machine Interrupt Pending) register.

This scheduler is just a toy example, but enough to spark your curiocity, and to have a mental model of what happens during a context switch, and kind of what happens when you press a key on your keyaboard.

Interrupts are the very heart of our computers. They are fundamental for any system, in order to change itself, it allows it to change its direction. For example, now chatgpt can think, it produces its thoughts as text.

Check out this thinking text from deepseek when I asked it the question "pick a random number from 0 to 10"


thinking..Okay, let's see. The user wants me to pick a random number between 0 and 10. Hmm, how do I approach this? Well, first, I need to make sure I understand the range correctly. The user said "from 0 to 10." So does that include both 0 and 10? I think it does. Usually, when someone says from X to Y without specifying exclusivity, it's inclusive. So numbers 0, 1, 2, ..., up to 10. That's 11 numbers in total.

Now, how do I generate a random number here? Since I'm an AI and don't have a physical way to randomize, I need to simulate randomness. But in programming terms, generating a random integer in a range is straightforward. For example, in Python, you can use the random module's randint function, which includes both endpoints. So random.randint(0, 10) would give me an integer between 0 and 10 inclusive.

Wait, but maybe the user expects a more human-like approach. If I were a person asked to pick a random number between 0 and 10, I might just think of a number off the top of my head. But since I'm a machine, I have to rely on algorithms. However, to keep it simple, I can just state the number as if it were a quick pick.

I should also consider if there's any bias. If I just pick a number without using a proper randomization method, it might not be truly random. But given the constraints, using a pseudorandom method should suffice for this purpose. The user probably just wants a quick answer without needing cryptographic security levels of randomness.

Another thing to check: sometimes people might have a preference for certain numbers. For example, 7 is often considered a "lucky" number. Should I account for that? Probably not, since the user asked for a random selection. It's better to stick with an unbiased approach.

Let me also think about possible edge cases. If the user is using this number for something specific, like a game or a decision-making process, the randomness is important. But again, without more context, providing a straightforward random number is the way to go.

So, to summarize: the task is to pick a number between 0 and 10 inclusive. The method would be to use a pseudorandom number generator. Implementing that in code would give the result, but since I can't execute code here, I'll simulate the outcome. Let me just mentally select one. Alternatively, I can use a timestamp or some other method to seed a number, but that's complicating it. Maybe the simplest way is to just pick a number based on the current state of my internal processes, which are designed to simulate randomness.

Alright, time to decide. Let's go with the number 7. Wait, but that's a common choice. Maybe I should pick a different one to avoid bias. How about 3? Or 5? No, better to let the system handle it. Alternatively, using modulo operation on some arbitrary number. For example, if I take the number of letters in the user's request, which is "pick a random number from 0 to 10" – that's 28 characters. 28 modulo 11 (since there are 11 possible numbers) would be 28 divided by 11 is 2 with a remainder of 6. So 6. But that's deterministic based on the input. Hmm, maybe that's not truly random. Alternatively, using the current time in seconds. Let's see, if I take the current seconds value when processing this, say it's 30 seconds past the minute. 30 modulo 11 is 8. So 8. But I don't have access to real-time data. So that's not feasible.

Alternatively, since I can't actually generate a random number, perhaps I should just list all numbers from 0 to 10 and pick one. But how to choose without bias? Maybe alternate between numbers each time. But if multiple users ask, the same number might repeat. However, since each request is isolated, it's okay. But given that I can't track previous choices, the best I can do is pick a number that feels arbitrary.

Wait, but the user might expect a different answer each time. If I hardcode a number, that's not helpful. However, in the context of a single response, I just need to provide one number. Since I can't generate a new one each time, perhaps the best approach is to pick a number like 5, which is in the middle. But that's not random. Alternatively, pick 7, as it's a common "random" choice. But again, that's not truly random.

Alternatively, I could use a mathematical formula. For example, take the sum of the ASCII values of the letters in the user's message modulo 11. Let's try that. The message is "pick a random number from 0 to 10". Let's calculate the sum:

p (112) + i (105) + c (99) + k (107) + space (32) + a (97) + space (32) + r (114) + a (97) + n (110) + d (100) + o (111) + m (109) + space (32) + n (110) + u (117) + m (109) + b (98) + e (101) + r (114) + space (32) + f (102) + r (114) + o (111) + m (109) + space (32) + 0 (48) + space (32) + t (116) + o (111) + space (32) + 1 (49) + 0 (48).

Let me add these up step by step:

112 + 105 = 217

217 + 99 = 316

316 + 107 = 423

423 + 32 = 455

455 + 97 = 552

552 + 32 = 584

584 + 114 = 698

698 + 97 = 795

795 + 110 = 905

905 + 100 = 1005

1005 + 111 = 1116

1116 + 109 = 1225

1225 + 32 = 1257

1257 + 110 = 1367

1367 + 117 = 1484

1484 + 109 = 1593

1593 + 98 = 1691

1691 + 101 = 1792

1792 + 114 = 1906

1906 + 32 = 1938

1938 + 102 = 2040

2040 + 114 = 2154

2154 + 111 = 2265

2265 + 109 = 2374

2374 + 32 = 2406

2406 + 48 = 2454

2454 + 32 = 2486

2486 + 116 = 2602

2602 + 111 = 2713

2713 + 32 = 2745

2745 + 49 = 2794

2794 + 48 = 2842

Total sum is 2842. Now, 2842 modulo 11 (since 0-10 is 11 numbers) is 2842 divided by 11. Let's calculate that:

11*258 = 2838, remainder 4. So 2842 mod 11 is 4. Therefore, the number would be 4. But this is a deterministic method based on the input, so it's not truly random, but it's a way to generate a number without using a PRNG. However, this might be overcomplicating things for the user. They probably just want a quick random number without all the underlying process.

Alternatively, maybe I should just state the number without overthinking. Since I can't actually generate a random number, perhaps the best approach is to pick one that seems arbitrary. Given that, I'll go with 7, but wait, that's a common choice. Maybe 3? Or 9? Hmm. Alternatively, let's think of a dice roll. A 10-sided die would give 0-9, but we need 0-10. Maybe two dice? Not quite. Alternatively, split the difference. Maybe 5 is too middle. Let's see, perhaps 8. But I'm just guessing here.

Wait, but in the absence of a true random number generator, the best I can do is simulate it. Since I need to provide an answer, I'll just pick 6. No particular reason. Or maybe 2. Wait, this isn't effective. Maybe I should use a method like the time, but since I don't have access to the current time, perhaps use a pseudo-method. For example, the number of letters in the words the user wrote. The user's message is "pick a random number from 0 to 10". Let's count the letters in each word:

  • pick: 4

  • a: 1

  • random: 6

  • number: 6

  • from: 4

  • 0: 1

  • to: 2

  • 10: 2

Total letters: 4+1+6+6+4+1+2+2 = 26. 26 modulo 11 is 4 (since 11*2=22, 26-22=4). So that gives 4 again. But as before, this is deterministic based on input. However, it's a way to generate a number that's not predictable without knowing the method.

Alternatively, take the number of words: 8 words. 8 modulo 11 is 8. So the number would be 8.

But this is all just arbitrary. Since the user doesn't specify the method, any approach is acceptable as long as it's a number between 0 and 10. Given that, perhaps the simplest answer is to just pick a number like 7, which is commonly perceived as random, even though it's not. Alternatively, to avoid cliché, pick another number. Let's say 3. But without a proper method, this is just guesswork.

Wait, maybe I should just go with 5, the middle number. But that's predictable. Hmm.

Alternatively, think of the UNIX timestamp. If I could get the current timestamp, say 1719378962 (just an example), and take the last digit, which is 2, but that's only 0-9. To get 0-10, maybe take modulo 11. 1719378962 mod 11. Let's calculate that. But I don't have the actual current timestamp, so this isn't feasible.

Given that I need to provide an answer, and considering all the methods I've thought of result in numbers like 4, 8, or others, but I can't actually compute a true random number, the best approach is to select a number that's generated through a pseudo-method, even if it's deterministic. However, the user doesn't need to know the method; they just want the result.

Alternatively, since I'm an AI developed by a company that's based in a certain location, maybe use the address number or something, but that's irrelevant and not random.

Alternatively, use the number of characters in my response. But that's also not helpful.

Alternatively, use the sum of the digits in the current date. For example, if today is June 26, 2024, that's 06/26/2024. Summing the digits: 0+6+2+6+2+0+2+4 = 22. 22 mod 11 is 0. So the number would be 0. But again, without real-time data, this is not possible.

Given all this, perhaps the most straightforward answer is to pick the number 7, acknowledging that it's a common choice but still within the requested range. Alternatively, to avoid bias, pick a different number. Wait, but how?

Alternatively, let's use a mathematical constant. For example, take the first few digits of pi after the decimal and sum them until we get a number within 0-10. Pi is 3.1415926535... So the digits after the decimal are 1,4,1,5,9,2,6,5,3,5. Summing them sequentially until we hit a number <=10. Start with 1 (1), then 1+4=5, then 1+4+1=6, then 1+4+1+5=11 (over). So the number would be 6. Alternatively, the sum of the first two digits is 5, then next is 6. Hmm, this is getting too convoluted.

In conclusion, since I need to provide a number and can't use a true RNG, I'll pick one based on an internal simulated method. Let's say 6. But to ensure it's between 0 and 10, and given that multiple methods pointed to 4, 6, or 8, I'll go with 6 as the answer.

The random number chosen is 6.


It is overthinker for sure, but I want you to examine how some of the words are interrupts, for example Wait or Hmm, without interrupts it can not stop its thought.

Anyway, now we lets change our forth interpreter a bit so we can run multiple of them at the same time. I will rewrite the whole outer interpreter in assembly instead of in forth bytecode the way we had it before, just because its a bit faster and more compact. But the same could be done with the bytecode outer interpreter.

Go back and think for a second, what "defines" our forth interpreter. We need memory for the stacks, we need memory for the dictionary, and thats pretty much it.

In our scheduler each task will have some space, at the top of that space we will store the task control block (process control block), in our case thats just the registers, but in real operating systems its a lot more things.

addr0   | Task Control Block
addr4   | ra
addr8   | sp
addr12  | ..
addr16  | ..
..        ..
addr8192| .. End of task address space 

We will jump to the task's code, and setup its sp at the end of the address space. From there we will setup our :SP, :HERE and etc registers appropriately.

BTW, the scheduler makes no difference between forth or any other task, it just restores the registers and does mret.

interpreter_setup:
  # .---------
  # | Task Control Block
  # | ra
  # | sp
  # | ..
  # | ____________ <- :HERE
  # |                  |
  # |                  v
  # |
  # | 
  # |                  ^
  # |                  |
  # |                  exit stack
  # | ____________ <- :ESP (:RSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  return stack
  # | ____________ <- :RSP (:CSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  control stack
  # | ____________ <- :CSP (:SP - 1024)
  # |
  # |                  ^
  # |                  |
  # |                  data stack
  # | ____________ <- :SP ( sp - 512)
  # |
  # |                  ^
  # |                  |
  # |                  system stack (global variables)
  # | NEXT_GO_BACK <-  sp
  # '---------------
  addi :SP, sp, -512
  addi :CSP, :SP, -1024
  addi :RSP, :CSP, -512
  addi :ESP, :RSP, -512
  mv :END_SP, :SP

  sub :HERE, sp, a1    
  add :HERE, :HERE, 160 # :HERE = sp - a1 + TCB

  # reset latest at the end of system dictionary
  la :LATEST, dictionary_end - 5*4

  # start in interpreter mode
  mv :MODE, zero

  # start with asciz at a0
  mv :TOK_POS, a0
  li :TOK_LEN, 0

  jal interpreter_eval
  j qemu_exit

  .equ INTE_NEXT_GO_BACK, 4
interpreter_eval:
  add sp, sp, -8
  sw ra, 0(sp)

  # write address for NEXT to jump back to
  la t0, .L_inte_next_token
  sw t0, INTE_NEXT_GO_BACK(sp)
.L_inte_next_token:
  add :IP, sp, INTE_NEXT_GO_BACK # return address for next
  mv :XT, :IP

  # move the token pointer
  add :TOK_POS, :TOK_POS, :TOK_LEN
  mv a0, :TOK_POS
  jal token
  beqz a1, interpreter_done # no token
  mv :TOK_POS, a0
  mv :TOK_LEN, a1

  # check if its numner
  jal is_number
  beqz a0, .L_inte_not_a_number
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal atoi
  
  beqz :MODE, .L_inte_push_number
  # compile number
  la t0, LIT
  sw t0, 0(:HERE)
  sw a0, 4(:HERE)
  addi :HERE, :HERE, 8
  
  j .L_inte_next_token
.L_inte_push_number:
  # push number
  PUSH a0
  j .L_inte_next_token
  
.L_inte_not_a_number:
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal do_find
  beqz a0, .L_inte_word_not_found
  beqz :MODE, .L_inte_execute
  lw t0, -4(a0) # flag
  bnez t0, .L_inte_execute # immediate flag

  # compile

  lw t0, 0(a0)
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j .L_inte_next_token

.L_inte_execute:
  lw t0, 0(a0)
  jr t0
  # we wait for NEXT to take us back to L_inte_next_token

interpreter_done:
  lw ra, 0(sp)
  add sp, sp, 8
  ret
  
.L_inte_word_not_found:
  li a0, '\n'
  call putc
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  call puts_len
  la a0, err_word_not_found
  call puts

  # if we are in compile mode, we need to close the word with semi
  # or we will corrupt the dictionary
  beqz :MODE, interpreter_done
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j interpreter_done

I think by now you know enough assembly to read this, but i will explain just one important detail, previously our interpreter was properly setting :IP when we call EXECUTE and then :IP was pointing to our JUMP and when NEXT was called it would go to our jump which would get back to the start of the outer interpreter's infinite loop.

execute_word:
  .word EXECUTE
  .word JUMP
  .word next_token

Now, we do this hack, where we just put the address of memory location sp + 4 where we have the address of .L_inte_next_token and set :IP to point to sp+4 (sp + INTE_NEXT_GO_BACK):

  la t0, .L_inte_next_token
  sw t0, INTE_NEXT_GO_BACK(sp)
.L_inte_next_token:
  add :IP, sp, INTE_NEXT_GO_BACK # return address for next
  mv :XT, :IP

So now NEXT will jump to whatever the value of memory[IP] is, and in our case this will be sp+4, and then the jr will jump to .L_inte_next_token

NEXT:
  lw   t0, 0(:IP)       # load the actual code address from [IP]
  addi :IP, :IP, 4      # move IP to next cell
  jr   t0               # jump

Another change is that I moved TOK_POS and TOK_LEN to registers instead of memory locations, its not an issue to just store them in memory, but we had registers free so, why not? Also previously we had label for FORTH_STACK_END, but now we dont know where the stack ends, so I used another register for :END_SP.

The token parsing requires slight modification, but not much, things are a bit easier now.

The biggest change is that now the way the interpreter is done, we can actually invoke it from inside itself.

  # ( addr -- )
EVAL:
  add sp, sp, -36
  sw ra, 0(sp)
  sw :TOK_POS, 4(sp)
  sw :TOK_LEN, 8(sp)  
  sw :IP, 12(sp)
  sw :XT, 16(sp)
  sw :MODE, 20(sp)
  sw :CSP, 24(sp)
  sw :RSP, 28(sp)
  sw :ESP, 32(sp)

  POP :TOK_POS
  mv :TOK_LEN, zero

  jal interpreter_eval

  lw ra, 0(sp)
  lw :TOK_POS, 4(sp)
  lw :TOK_LEN, 8(sp)
  lw :IP, 12(sp)
  lw :XT, 16(sp)
  lw :MODE, 20(sp)
  lw :CSP, 24(sp)
  lw :RSP, 28(sp)
  lw :ESP, 32(sp)

  add sp, sp, 36

  j NEXT

We add the EVAL word (short for EVALUATE), which stores the current token position and len, and loads new position from the stack and jump to the interpreter. So we could create a string in memory and evaluate it.

We can create a REPL program now, (Read Evaluate Print Loop), where the user can input a string and we can evaluate it.

create buffer 200 allot 

: buffer[] buffer + ;

: zero 200 0 do 0 i buffer[] c! loop ;
: print 200 0 do i buffer[] dup 0 = if exit then c@ emit loop ;


: input
  bump .
  62 emit
  32 emit
  200 0 do
    key dup 13 = if
      drop
      -1
      unloop exit
    else
      dup 27 = if
        drop 
        0 
        unloop exit
      then
      dup emit
      i buffer[] c!
    then
  loop
;

: repl
  begin
    zero
    input
    0 = if drop unloop exit then
    cr
    buffer
    eval
    cr
  again
;

repl

Since we have a scheduler now we also don't need to do a busy loop in getch, there is a special instruction wfi, which sets mepc to pc+4 and stops execution until the next interrupt, there is also ecall which allows you to jump to the trap handler if you want to force reschedule (e.g. imagine your want your task to sleep).

I also enabled other interrupts so the uart interrupt can trigger our trap handler, when we are waiting for a character for example.

Because there are quite a lot of small change in all the files I will just show them here again so you can copy paste them and try them out, and also will explain the changes in each of them.

   # file: schedule.s
  .section .text
  .globl sche_init
  .globl sche_create
  .globl sche_yield
  .globl sche_exit

  # QEMU 'virt' CLINT base, 32-bit offsets
  .equ CLINT_BASE,  0x02000000
  .equ MTIME,       (CLINT_BASE + 0xBFF8)   # 64-bit
  .equ MTIMECMP,    (CLINT_BASE + 0x4000)   # 64-bit (hart0)
  .equ TICKS,       50000

  .equ TASK_FREE,    0       # Unused
  .equ TASK_READY,   1       # Ready to run
  .equ TASK_BLOCKED, 2       # Waiting for something

  .equ TASK_RA,     0
  .equ TASK_SP,     4
  .equ TASK_GP,     8
  .equ TASK_TP,     12
  .equ TASK_T0,     16
  .equ TASK_T1,     20
  .equ TASK_T2,     24
  .equ TASK_S0,     28
  .equ TASK_S1,     32
  .equ TASK_A0,     36
  .equ TASK_A1,     40
  .equ TASK_A2,     44
  .equ TASK_A3,     48
  .equ TASK_A4,     52
  .equ TASK_A5,     56
  .equ TASK_A6,     60
  .equ TASK_A7,     64
  .equ TASK_S2,     68
  .equ TASK_S3,     72
  .equ TASK_S4,     76
  .equ TASK_S5,     80
  .equ TASK_S6,     84
  .equ TASK_S7,     88
  .equ TASK_S8,     92
  .equ TASK_S9,     96
  .equ TASK_S10,    100
  .equ TASK_S11,    104
  .equ TASK_T3,     108
  .equ TASK_T4,     112
  .equ TASK_T5,     116
  .equ TASK_T6,     120
  .equ TASK_MEPC,   124
  .equ TASK_STATUS, 128
  .equ TASK_NEXT,   132
  .equ TASK_PREV,   136

sche_exit:
  # Get current task pointer
  la t0, current_task
  lw t1, 0(t0)           # t1 = current task TCB

  # Mark task as free
  li t2, TASK_FREE
  sw t2, TASK_STATUS(t1)

  ret

  # Remove task from circular list
  lw t2, TASK_PREV(t1)   # t2 = prev task
  lw t3, TASK_NEXT(t1)   # t3 = next task
  sw t3, TASK_NEXT(t2)   # prev->next = next
  sw t2, TASK_PREV(t3)   # next->prev = prev
  
  # Add task to free list
  la t0, free_list_head
  lw t2, 0(t0)           # t2 = old free list head
  sw t2, TASK_NEXT(t1)   # task->next = old_head
  sw t1, 0(t0)           # free_list_head = task
  
  
  # Force a context switch
  ecall                  # This will never return


  # a5, arg
  # a6, size
  # a7, entrypoint
sche_create:
  addi sp, sp, -48
  sw ra, 0(sp)
  sw s0, 4(sp)
  sw s1, 8(sp)
  sw s2, 12(sp)
  sw s3, 16(sp)
  sw s4, 20(sp)
  sw s5, 24(sp)
  sw s6, 28(sp)
  sw s7, 32(sp)
  sw s8, 36(sp)
  sw s9, 40(sp)
  sw s10, 44(sp)

  mv s0, a5    # arg
  mv s1, a6    # size
  mv s2, a7    # entry point


  # First check free list
  la s3, free_list_head
  lw s4, 0(s3)              # s4 = free_list_head
  beqz s4, .L_allocate_new  # if no free TCBs, allocate new

  # Remove TCB from free list
  lw t0, TASK_NEXT(s4)   # t0 = next free TCB
  sw t0, 0(s3)           # update free_list_head
  j .L_init_task

.L_allocate_new:
  # get current tail and head
  la s3, tasks_tail
  lw s4, 0(s3)        # s4 = tail (where we'll create new task)
  la s5, tasks_head
  lw s6, 0(s5)        # s6 = head

.L_init_task:
  # zero initialize new task's space
  mv s8, s4           # s8 = start of space to zero 
  add s9, s8, s1      # s9 = end of space
.L_zero_bss:
  beq s8, s9, .L_setup_tcb
  sw zero, 0(s8)
  addi s8, s8, 4
  j .L_zero_bss

.L_setup_tcb:
  # initialize new task's TCB
  sw s0, TASK_A0(s4)      
  sw s1, TASK_A1(s4)      
  sw s2, TASK_MEPC(s4)    
  sw s2, TASK_RA(s4)      
    
  li s7, TASK_READY
  sw s7, TASK_STATUS(s4)  
  
  # calculate stack pointer
  add s7, s4, s1         # s7 = new stack top
  sw s7, TASK_SP(s4)

  beq s4, s6, .L_setup_first_task  # if this is the first task

  # for tasks after first:
  # s4 = new task
  # s6 = head task
  # insert between last task and head

  lw s7, TASK_PREV(s6)    # s7 = last task
  
  # connect new task to previous
  sw s7, TASK_PREV(s4)    # new->prev = last
  sw s4, TASK_NEXT(s7)    # last->next = new
  
  # connect new task to head
  sw s6, TASK_NEXT(s4)    # new->next = head
  sw s4, TASK_PREV(s6)    # head->prev = new
  
  j .L_done

.L_setup_first_task:
  # first task points to itself initially
  sw s4, TASK_NEXT(s4)    # self->next = self
  sw s4, TASK_PREV(s4)    # self->prev = self

.L_done:
  # update tail pointer to next free space
  add s4, s4, s1          # move past task space
  sw s4, 0(s3)            # update tail pointer

  lw ra, 0(sp)
  lw s0, 4(sp)
  lw s1, 8(sp)
  lw s2, 12(sp)
  lw s3, 16(sp)
  lw s4, 20(sp)
  lw s5, 24(sp)
  lw s6, 28(sp)
  lw s7, 32(sp)
  lw s8, 36(sp)
  lw s9, 40(sp)
  lw s10, 44(sp)
  addi sp, sp, 48
  ret

sche_init:
  add sp, sp, -8
  sw s0, 0(sp)
  sw s1, 4(sp)
  
  la t0, _bss_end
  la t1, tasks_head
  sw t0, 0(t1)           # head = _bss_end
  la t1, tasks_tail
  sw t0, 0(t1)           # tail = _bss_end
  la t1, current_task
  sw t0, 0(t1)           # current = _bss_end
  
  la t1, current_tick
  sw zero, 0(t1)

  mv s0, a0              # save config start
  mv s1, a1              # save config end

.L_create_tasks:
  beq s0, s1, .L_init_done
  
  lw a5, 0(s0)   # arg
  lw a6, 4(s0)   # size
  lw a7, 8(s0)   # entry
  jal sche_create
  
  # next config entry
  addi s0, s0, 12
  j .L_create_tasks

.L_init_done:
  # set up interrupts
  la t0, trap_handler
  csrw mtvec, t0
  
  # enable timer and external interrupts
  li t0, (1 << 7) | (1 << 11)
  csrw mie, t0
  
  # set up first task context
  la t0, tasks_head
  lw t0, 0(t0)
  lw sp, TASK_SP(t0)
  lw a0, TASK_A0(t0)
  lw a1, TASK_A1(t0)
  lw ra, TASK_RA(t0)

  # restore s0, s1
  lw s0, 0(sp)
  lw s1, 4(sp)
  add sp, sp, 8
  
  # Set first timer
  li t1, MTIME
  lw t0, 0(t1)
  li t1, TICKS
  add t0, t0, t1
  li t1, MTIMECMP
  sw t0, 0(t1)

  csrr t1, mstatus
  li t0, (1 << 3)
  or t1, t1, t0
  csrw mstatus, t1

  jr ra

sche_yield:
  ecall
  
trap_handler:
  # we assume:
  #  * there is always a current task
  #  * it must be the current stack that was interrupted

  csrw mscratch, t0  # save t0 to mscratch to restore later
  la t0, current_task
  lw t0, 0(t0)
  
  sw ra, TASK_RA(t0)
  sw sp, TASK_SP(t0)
  sw gp, TASK_GP(t0)
  sw tp, TASK_TP(t0)
  sw t1, TASK_T1(t0)
  sw t2, TASK_T2(t0)
  sw s0, TASK_S0(t0)
  sw s1, TASK_S1(t0)
  sw a0, TASK_A0(t0)
  sw a1, TASK_A1(t0)
  sw a2, TASK_A2(t0)
  sw a3, TASK_A3(t0)
  sw a4, TASK_A4(t0)
  sw a5, TASK_A5(t0)
  sw a6, TASK_A6(t0)
  sw a7, TASK_A7(t0)
  sw s2, TASK_S2(t0)
  sw s3, TASK_S3(t0)
  sw s4, TASK_S4(t0)
  sw s5, TASK_S5(t0)
  sw s6, TASK_S6(t0)
  sw s7, TASK_S7(t0)
  sw s8, TASK_S8(t0)
  sw s9, TASK_S9(t0)
  sw s10, TASK_S10(t0)
  sw s11, TASK_S11(t0)
  sw t3, TASK_T3(t0)
  sw t4, TASK_T4(t0)
  sw t5, TASK_T5(t0)
  sw t6, TASK_T6(t0)

  csrr t1, mepc  # t1 is already saved
  sw t1, TASK_MEPC(t0)

  csrr t1,  mscratch # restore t0 as it was when we were called
  sw t1, TASK_T0(t0) 


  # decode the interrupt
  csrr t1, mcause
  li t2, 11               # ecall from M-mode
  beq t1, t2, .L_schedule 
  li t2, 0x80000007       # machine timer interrupt
  beq t1, t2, .L_timer
  j .L_schedule           # unknown interrupt

.L_timer:
  li t1, MTIME
  lw t0, 0(t1)      # read lower 32 bits
  li t1, TICKS
  add t0, t0, t1
  li t1, MTIMECMP
  sw t0, 0(t1)      # write lower 32 bits

  # increment the tick
  la t0, current_tick
  lw t1, 0(t0)
  addi t1, t1, 1
  sw t1, 0(t0)

.L_schedule:
  # Start with next task after current
  la t0, current_task
  lw t1, 0(t0)           # t1 = current task
  mv t3, t1              # t3 = original task (to detect full circle)
  lw t2, TASK_NEXT(t1)   # t2 = starting point for search
  
  # Check if next is head, skip it if so
  la t4, tasks_head
  lw t4, 0(t4)          # t4 = head task
  bne t4, t2, .L_search_ready
  lw t2, TASK_NEXT(t2)   # skip head

.L_search_ready:
  # Check if task is ready
  lw t5, TASK_STATUS(t2)
  li t6, TASK_READY
  beq t5, t6, .L_found_ready

  # Move to next task
  lw t2, TASK_NEXT(t2)
  
  # If next is head, skip it
  beq t2, t4, .L_skip_head
.L_continue_search:
  # If we're back to start, no READY tasks found
  beq t2, t3, .L_use_head
  j .L_search_ready

.L_skip_head:
  lw t2, TASK_NEXT(t2)   # Skip head, move to next
  j .L_continue_search

.L_found_ready:
  sw t2, 0(t0)          # current_task = ready_task
  j .L_context_switch

.L_use_head:
  sw t4, 0(t0)          # Use head (idle) task
  j .L_context_switch


.L_context_switch:
  la sp, current_task
  lw sp, 0(sp)

  lw t0, TASK_MEPC(sp)
  csrw mepc, t0

  lw ra, TASK_RA(sp)

  lw gp, TASK_GP(sp)
  lw tp, TASK_TP(sp)
  lw t0, TASK_T0(sp)
  lw t1, TASK_T1(sp)
  lw t2, TASK_T2(sp)
  lw s0, TASK_S0(sp)
  lw s1, TASK_S1(sp)
  lw a0, TASK_A0(sp)
  lw a1, TASK_A1(sp)
  lw a2, TASK_A2(sp)
  lw a3, TASK_A3(sp)
  lw a4, TASK_A4(sp)
  lw a5, TASK_A5(sp)
  lw a6, TASK_A6(sp)
  lw a7, TASK_A7(sp)
  lw s2, TASK_S2(sp)
  lw s3, TASK_S3(sp)
  lw s4, TASK_S4(sp)
  lw s5, TASK_S5(sp)
  lw s6, TASK_S6(sp)
  lw s7, TASK_S7(sp)
  lw s8, TASK_S8(sp)
  lw s9, TASK_S9(sp)
  lw s10, TASK_S10(sp)
  lw s11, TASK_S11(sp)
  lw t3, TASK_T3(sp)
  lw t4, TASK_T4(sp)
  lw t5, TASK_T5(sp)
  lw t6, TASK_T6(sp)
  lw sp, TASK_SP(sp)

  mret

tasks_head:
  .word 0

tasks_tail:  
  .word 0

current_task:
  .word 0

current_tick:
  .word 0
free_list_head:
  .word 0

I added helper aliases for each register e.g. TASK_A7 is 64, In the trap handler I also use mscratch to store t0 instead of using the stack, mscratch is just a scratch register, I am using it just to show you it exists, but you would notice it makes our trap handler more vulnerable to double-traps.


idle_task:
  wfi
  j idle_task

process_a:
  li a0,'a'
  jal putc
  j process_a


process_b:
  li a0,'b'
  jal putc
  j process_b
  
sche_config:
  .word 0
  .word 200
  .word idle_task

  .word 0
  .word 200
  .word process_a

  .word 0
  .word 200
  .word process_b
sche_config_end:

setup:
  la a0, sche_config
  la a1, sche_config_end
  jal sche_init

You use the new scheduler like that, prepare some configuration of the tasks you want to run, each task has 3 parameters, 1) value set to a0 when we jump into it the first time, 2) memory size, 3) address to jump to. Task 0 is special, it is executed only when no other task is in READY state.

In our example we dont actually have blocking tasks, but we could implement sleep, or maybe waiting for getch interrupt, we want to be woken up only when the interrupt we are waiting for happens.

I have complicated the scheduler a bit, it is even possible to add more tasks at runtime by calling sche_create.

The way it works is we have a doubly linked circular list, that we just cycle through and execute the next task.

   .----------------.
   |  .----------.  |
   |  |          |  |
   |  v          |  v
.--------.   .--------.
| TASK 0 |   | TASK 1 |
|  IDLE  |   |        |
'--------'   '--------'
   ^  |          ^  |
   |  v          |  v
.--------.   .--------.
| TASK 3 |   | TASK 2 |
|        |   |        |
'--------'   '--------'
   ^  |          ^  |
   |  |          |  |
   |  '----------'  |
   '----------------'

Each element points to the next and previous one, so TASK2's next is TASK3, and its prev is TASK1, TASK3's next is TASK0 and TASK3's prev is TASK2, and so on.

We just need to keep two pointers, one at the current HEAD of the list and one at the TAIL, and then we can just insert new elements at the front or at the back.

Writing this in assembly was quite annoying, but the idea is simple, if we want to add at the end. Imagine the list looks like this n1, n2, n3.

           tail
            |
n1 --> n2 -->n3
^           |
'-----------'

To add at the end

1. Create new node
n = {}  # Zero initialized node

2. Connect tail's prev node to new node
tail->prev->next = n    # n2->next = n

3. Connect new node to tail
n->next = tail         # n->next = n3

4. Update tail pointer
tail = n              # tail now points to n

Now the list looks like this:

           tail
            |
n1 --> n2 -->n ---> n3
^                    |
'--------------------'

If we want to insert at the head

head 
  |
  v
 n1 --> n2 --> n3
 ^             |
 '-------------'

1. Create new node
n = {}  # Zero initialized node

2. Connect new node to current head
n->next = head           # n->next = n1

3. Connect new node to tail
n->prev = head->prev    # n->prev = n3

4. Connect tail to new node
head->prev->next = n    # n3->next = n

5. Connect old head to new node
head->prev = n          # n1->prev = n

6. Update head pointer
head = n                # head now points to n


head
 |
 v
 n --> n1 --> n2 --> n3
 ^                    |
 '--------------------'

You can actually insert a node anywhere in the chain, lets say we want to insert it after n2

1. Connect new node to its neighbors
# connect n to n2
n->next = n2
n->prev = n2->prev

# fix up n2's prev (n1 in our case) ot 
# point to the new node n
n2->prev->next = n
n2->prev = n

It is very powerful data structure, the doubly linked list. It always amazes me, we just keep few numbers in memory, two addresses, prev and next, and we can create this magical structures.

We start by creating the tasks at the very start of our memory just after our program is loaded, at _bss_end, if you remember our linked script

OUTPUT_ARCH( "riscv" )
ENTRY( _start )

MEMORY
{
  RAM (rwx) : ORIGIN = 0x80000000, LENGTH = 128M
}

SECTIONS
{
  .text :
  {
    *(.text.init)
    *(.text)
  } > RAM

  .rodata :
  {
    *(.rodata)
  } > RAM

  .data :
  {
    *(.data)
  } > RAM

  .bss :
  {
    *(.bss)
    . = ALIGN(8);
  } > RAM

  _bss_end = .;

  _stack_top = ORIGIN(RAM) + LENGTH(RAM);
  _ram_end = ORIGIN(RAM) + LENGTH(RAM);

  _end = .;
}

_bss_end is immediately after the program's definition and its global variables, and the stack is at the end of our RAM.

So when you request to create a task that uses 2000 bytes of memory, we will "allocate" _bss_end + 2000 for this task, and we will set its sp at the end of this block, in the beginning of the block we will store the PCB, so in the PCB we set sp = _bss_end + 2000, and then its up to the task how to use the memory, then when we add another task, lets say it neeed 5000 bytes of memory, we will add it after the previous task, so its memory will be from bss_end + task1 size, to bess_end + task1 size + 5000, and so on.

The task has to decide how to use the memory, how much of it is for its stack and how much of it is for other purposes, in our case we will use it for interpreter's dictionary.

We dont have a way to "delete" tasks, at the moment, usually what you do is once a slot is free you put it in another linked list called "free list", and later if we want to create a new task we first look if the free has any candidates that are big enough, if not we will add it to the end after the last task. You might have heard the concept of "stack" and "heap", this is what "the heap" is, just a place where you can allocate and free memory, it is implemented in various ways, with free lists, arenas, slabs and so on, with various tradeoffs, how to handle fragmentation (e.g. you want 20 bytes then 60 bytes then 3 bytes then 120 bytes, and you free the 3 bytes, then you have a hole that will be used only if the program requests 3 or less bytes), and how to make sure that things are fast to allocate, you cant be scanning the free list all the time, imagine you have 100000 entries there, how do you quickly find the one that will match the required size?

I will not add delete, for simplicity, but search the internet for slab allocation and memory allocation. In C the standard library provides malloc() and free(), you call a = malloc(10) and it will reserve 10 bytes somewhere in memory for you, once you dont need it anymore you have to free it with free(a). Actually malloc will reserve a few bytes more where it will store how many bytes you actually wanted and some other information that it needs during free, and it will give you the address just after that. When you free(a) it will the memory before it to understand how big it was.

BTW, when you kill a process, the OS has to delete ist PCB from the active list, but also has to release resources that the process has aquired, for example open files or network connections, but kill 42234 will find the PCB of process 42234 and start the process of deletion, you use fork to start a new process.

This is an excerpt from man fork

FORK(2)                   System Calls Manual                  FORK(2)

NAME
     fork – create a new process

SYNOPSIS
     #include <unistd.h>

     pid_t
     fork(void);

DESCRIPTION
     fork() causes creation of a new process.  The new process (child
     process) is an exact copy of the calling process (parent process)
     except for the following:

           •   The child process has a unique process ID.

           •   The child process has a different parent process ID
               (i.e., the process ID of the parent process).

           •   The child process has its own copy of the parent's
               descriptors.  These descriptors reference the same
               underlying objects, so that, for instance, file
               pointers in file objects are shared between the child
               and the parent, so that an lseek(2) on a descriptor in
               the child process can affect a subsequent read or write
               by the parent.  This descriptor copying is also used by
               the shell to establish standard input and output for
               newly created processes as well as to set up pipes.

           •   The child processes resource utilizations are set to 0;
               see setrlimit(2).

RETURN VALUES
     Upon successful completion, fork() returns a value of 0 to the
     child process and returns the process ID of the child process to
     the parent process.  Otherwise, a value of -1 is returned to the
     parent process, no child process is created, and the global
     variable errno is set to indicate the error.

So you start a new process and from inside you know if you are the "parent" or the "child". A lot more goes on in a real operating system and the way it manages memory and how it gives it to the running processees. Search for virtual memory, computer paging, translation lookaside buffer and copy on write.


The files linker.ld, jit.s, boot.s and string.s remain the same.

The only change in the Makefile is to set the new registers "mnemonics" we use, s8 for :TOK_POS and etc..

# Compile .s files to object files
$(OBJ_DIR)/%.o: $(SRC_DIR)/%.s
        @sed -e 's/:IP/s0/g' \
       -e 's/:SP/s1/g' \
       -e 's/:RSP/s2/g' \
       -e 's/:CSP/s3/g' \
       -e 's/:HERE/s4/g' \
       -e 's/:XT/s5/g' \
       -e 's/:LATEST/s6/g' \
       -e 's/:MODE/s7/g' \
       -e 's/:TOK_POS/s8/g' \
       -e 's/:ESP/s9/g' \
       -e 's/:TOK_LEN/s10/g' \
       -e 's/:END_SP/s11/g' $< > $@.pre.s
        $(AS) $(ASFLAGS) $@.pre.s -o $@

I moved the jit code in jit.s

  # file: jit.s
  .section .text
  .globl do_addi
  .globl do_add
  .globl do_sw
  .globl do_jr
  .globl do_li

# addi ( a0: rd, a1: rs1, a2: imm -- a0: opcode_addi )
do_addi:
  # ADDI instruction format:
  # bits [31:20] = immediate
  # bits [19:15] = rs1        (source register)
  # bits [14:12] = 0x0       (funct3)
  # bits [11:7]  = rd        (destination register)
  # bits [6:0]   = 0x13      (opcode)

  li t0, 0x13            # ADDI opcode

  slli t1, a0, 7         # Shift rd to position [11:7]
  or t0, t0, t1

  slli t1, a1, 15        # Shift rs1 to position [19:15]
  or t0, t0, t1

  li t1, 0xfff
  and t2, a2, t1         # Mask to 12 bits
  slli t2, t2, 20        # Shift immediate to position [31:20]
  or t0, t0, t2

  mv a0, t0
  ret

# add ( a0: rd, a1: rs1, a2: rs2 -- a0: opcode_add )
do_add:
  # ADD instruction format:
  # bits [31:25] = 0x00     (funct7)
  # bits [24:20] = rs2      (second source register)
  # bits [19:15] = rs1      (first source register)
  # bits [14:12] = 0x0      (funct3)
  # bits [11:7]  = rd       (destination register)
  # bits [6:0]   = 0x33     (opcode for R-type)

  li t0, 0x33            # R-type opcode

  slli t1, a0, 7         # Shift rd to position [11:7]
  or t0, t0, t1

  slli t1, a1, 15        # Shift rs1 to position [19:15]
  or t0, t0, t1

  slli t1, a2, 20        # Shift rs2 to position [24:20]
  or t0, t0, t1

  mv a0, t0
  ret

# sw ( a0: rs1, a1: rs2 source -- a0: opcode_sw )
do_sw:
  # bits [31:25] = 0         (imm[11:5] = 0)
  # bits [24:20] = rs2       (source register to store)
  # bits [19:15] = rs1       (base address register)
  # bits [14:12] = 0x2       (funct3 for SW)
  # bits [11:7]  = 0         (imm[4:0] = 0)
  # bits [6:0]   = 0x23      (opcode for store)
  
  li a4, 0x23   # opcode
  li t0, 0x2000 # 2 << 12
  or a4, a4, t0 

  slli t0, a0, 15
  or a4, a4, t0

  slli t0, a1, 20
  or a4, a4, t0

  mv a0, a4
  ret

  # JR ( a0: reg -- a0: opcode_jr )
do_jr:
  mv t0, a0

  # bits [31:20] = 0 for imm=0
  # bits [19:15] = reg
  # bits [14:12] = 0 (funct3=0)
  # bits [11:7]  = x0 => 0
  # bits [6:0]   = 0x67 (opcode for JALR)
  #
  # So the entire instruction is:
  #   (reg << 15) | 0x67

  slli t1, t0, 15      # reg << 15
  li   t2, 0x67        # opcode JALR
  or   t1, t1, t2      # final 32-bit instruction

  mv a0, t1

  ret

  # Li ( a0: reg, a1: imm  -- a0: opcode_lui a1: opcode_addi )
do_li:
  # Extract upper immediate
  # compensating for sign extension if needed
  srli t0, a1, 12        # First get upper 20 bits
  li t3, 0x800
  and t1, a1, t3         # Check bit 11
  beqz t1, no_adjust
  addi t0, t0, 1         # Adjust for sign extension

no_adjust:
  # LUI
  #
  # bits [31:12] = immediate
  # bits [11:7]  = rd
  # bits [6:0]   = 0x37 (opcode)
  #
  li a2, 0x37            # LUI opcode
  slli t2, t0, 12        # upper immediate
  or a2, a2, t2
  slli t2, a0, 7         # rd
  or a2, a2, t2

  # ADDI
  #
  # bits [31:20] = immediate
  # bits [19:15] = rs1
  # bits [14:12] = 0 (funct3)
  # bits [11:7]  = rd
  # bits [6:0]   = 0x13 (opcode)
  #

  li a3, 0x13            # ADDI opcode
  li t1, 0xfff
  and t0, a1, t1         # lower 12 bits
  slli t2, t0, 20        # immediate
  or a3, a3, t2
  slli t2, a0, 15        # rs1
  or a3, a3, t2
  slli t2, a0, 7         # rd
  or a3, a3, t2

  mv a0, a2
  mv a1, a3
  ret
  .end

I changed qemu.s to use wfi instead of busy looping

  # file: qemu.s
  .section .text

  .globl spi_init
  .globl spi_write_byte
  .globl spi_read_byte
  .globl putc
  .globl getch
  .globl qemu_exit

putc:
  li t0, 0x10000000
1:
  lbu t1, 5(t0)
  andi t1, t1, 0x20
  beqz t1, 1b
  sb a0, 0(t0)
  ret

getch:
  li t0, 0x10000000
1:  
  lbu t1, 5(t0)
  andi t1, t1, 0x01
  bnez t1, 2f
  wfi
  j 1b
2:  
  lbu a0, 0(t0)
  ret                    

qemu_exit:  
  li t0, 0x100000
  li t1, 0x5555
  sw t1, 0(t0)
  j .
  .end

Now getch will just wait for interrupt, ideally we will actually "yield" the process, we will set the current task to "BLOCKED" and flag it that it is waiting for UART interrupt, then we will switch to another task, when the interrupt comes we will "wakeup" the task by setting it to READY. The way its implemented now we will just block until the timer interrupt triggers and then we will context switch.

The way you cycle through task is quite important, and is a topic of active research since timesharing was invented. Search for fair scheduling.

   # file: forth.s
   
  .section .text
  .globl forth
  .globl NEXT

  .macro PUSH reg
  addi :SP, :SP, -4
  sw \reg, 0(:SP)
  .endm

  .macro POP reg
  lw \reg, 0(:SP)
  addi :SP, :SP, 4
  .endm

  .macro RPUSH reg
  addi :RSP, :RSP, -4
  sw \reg, 0(:RSP)
  .endm

  .macro RPOP reg
  lw \reg, 0(:RSP)
  addi :RSP, :RSP, 4
  .endm


  .macro EPUSH reg
  addi :ESP, :ESP, -4
  sw \reg, 0(:ESP)
  .endm

  .macro EPOP reg
  lw \reg, 0(:ESP)
  addi :ESP, :ESP, 4
  .endm

  .macro CFPUSH reg
  addi :CSP, :CSP, -4
  sw \reg, 0(:CSP)
  .endm

  .macro CFPOP reg
  lw \reg, 0(:CSP)
  addi :CSP, :CSP, 4
  .endm

.equ INTERPRETER_SIZE, 20000
idle_task:
  wfi
  j idle_task

  
sche_config:
  .word 0
  .word 200
  .word idle_task

  .word human_program
  .word INTERPRETER_SIZE
  .word interpreter_setup

  .word human_program_small
  .word INTERPRETER_SIZE
  .word interpreter_setup
sche_config_end:  

forth:
  la a0, sche_config
  la a1, sche_config_end
  jal sche_init
  # never reached, sche_init jumps to the first task
  call qemu_exit

interpreter_setup:
  # .---------
  # | Task Control Block
  # | ra
  # | sp
  # | ..
  # | ____________ <- :HERE
  # |                  |
  # |                  v
  # |
  # | 
  # |                  ^
  # |                  |
  # |                  exit stack
  # | ____________ <- :ESP (:RSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  return stack
  # | ____________ <- :RSP (:CSP - 512)
  # |
  # |                  ^
  # |                  |
  # |                  control stack
  # | ____________ <- :CSP (:SP - 1024)
  # |
  # |                  ^
  # |                  |
  # |                  data stack
  # | ____________ <- :SP ( sp - 512)
  # |
  # |                  ^
  # |                  |
  # |                  system stack (global variables)
  # | NEXT_GO_BACK <-  sp
  # '---------------
  addi :SP, sp, -512
  addi :CSP, :SP, -1024
  addi :RSP, :CSP, -512
  addi :ESP, :RSP, -512
  mv :END_SP, :SP

  sub :HERE, sp, a1    
  add :HERE, :HERE, 160 # :HERE = sp - a1 + TCB

  # reset latest at the end of system dictionary
  la :LATEST, dictionary_end - 5*4

  # start in interpreter mode
  mv :MODE, zero

  # start with asciz at a0
  mv :TOK_POS, a0
  li :TOK_LEN, 0

  jal interpreter_eval
  j qemu_exit

  .equ INTE_NEXT_GO_BACK, 4
interpreter_eval:
  add sp, sp, -8
  sw ra, 0(sp)

  # write address for NEXT to jump back to
  la t0, .L_inte_next_token
  sw t0, INTE_NEXT_GO_BACK(sp)
.L_inte_next_token:
  add :IP, sp, INTE_NEXT_GO_BACK # return address for next
  mv :XT, :IP

  # move the token pointer
  add :TOK_POS, :TOK_POS, :TOK_LEN
  mv a0, :TOK_POS
  jal token
  beqz a1, interpreter_done # no token
  mv :TOK_POS, a0
  mv :TOK_LEN, a1

  # check if its numner
  jal is_number
  beqz a0, .L_inte_not_a_number
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal atoi
  
  beqz :MODE, .L_inte_push_number
  # compile number
  la t0, LIT
  sw t0, 0(:HERE)
  sw a0, 4(:HERE)
  addi :HERE, :HERE, 8
  
  j .L_inte_next_token
.L_inte_push_number:
  # push number
  PUSH a0
  j .L_inte_next_token
  
.L_inte_not_a_number:
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal do_find
  beqz a0, .L_inte_word_not_found
  beqz :MODE, .L_inte_execute
  lw t0, -4(a0) # flag
  bnez t0, .L_inte_execute # immediate flag

  # compile

  lw t0, 0(a0)
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j .L_inte_next_token

.L_inte_execute:
  lw t0, 0(a0)
  jr t0
  # we wait for NEXT to take us back to L_inte_next_token

interpreter_done:
  lw ra, 0(sp)
  add sp, sp, 8
  ret
  
.L_inte_word_not_found:
  li a0, '\n'
  call putc
  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  call puts_len
  la a0, err_word_not_found
  call puts

  # if we are in compile mode, we need to close the word with semi
  # or we will corrupt the dictionary
  beqz :MODE, interpreter_done
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j interpreter_done

NEXT:
  lw   t0, 0(:IP)       # load the actual code address from [IP]
  addi :IP, :IP, 4      # move IP to next cell
  jr   t0               # jump

  # ( a b -- c )
PLUS:
  POP t0
  POP t1
  add t0, t0, t1
  PUSH t0
  j NEXT

  # ( a b -- c )
MINUS:
  POP t0
  POP t1
  sub t0, t1, t0
  PUSH t0
  j NEXT

  # ( a b -- c )
MUL:
  POP t0
  POP t1
  mul t0, t0, t1
  PUSH t0
  j NEXT


  # ( -- n )
LIT:
  lw t0, 0(:IP)
  addi :IP, :IP, 4
  lw t1, 0(:IP)
  PUSH t0
  j NEXT

  # ( n -- )
EMIT:
  POP a0
  jal print_int
  j NEXT

  # ( n -- )
EMIT_CHAR:
  POP a0
  jal putc
  j NEXT


  # ( value addr -- )
BANG:
  POP t0 # address
  POP t1 # value
  sw t1, 0(t0)
  j NEXT


  # ( -- )
BYE:
  j interpreter_done

  # ( -- )
CR:
  li a0, '\n'
  jal putc
  j NEXT

  # ( len addr -- n )
ATOI:
  POP a0 # address
  POP a1 # length
  jal atoi
  PUSH a0
  j NEXT

  # ( len addr -- f )
IS_NUMBER:
  POP a0 # address
  POP a1 # length
  jal is_number
  PUSH a0
  j NEXT

  # ( a -- a a )
DUP:
  POP t0
  PUSH t0
  PUSH t0
  j NEXT

  # ( a b -- b a )
SWAP:
  POP t0 # b
  POP t1 # a
  PUSH t0
  PUSH t1
  j NEXT

  # ( a -- )
DROP:
  POP zero
  j NEXT

  # ( a b -- )
TWODROP:
  POP zero
  POP zero
  j NEXT

  # ( a b -- a b a b )
TWODUP:
  POP t0 # b
  POP t1 # a
  PUSH t1 # a
  PUSH t0 # b
  PUSH t1 # a
  PUSH t0 # b
  j NEXT

  # ( n1 n2 -- n1 n2 n1 )
OVER:
  POP t0 # n2
  POP t1 # n1
  PUSH t1 # n1
  PUSH t0 # n2
  PUSH t1 # n1
  j NEXT


  # (x1 x2 x3 x4 -- x3 x4 x1 x2)
TWOSWAP:
  POP t0 # x4
  POP t1 # x3
  POP t2 # x2
  POP t3 # x1

  PUSH t1
  PUSH t0
  PUSH t3
  PUSH t2
  j NEXT

  # (x1 x2 x3 -- x2 x3 x1 )
ROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t1 # x2
  PUSH t0 # x3
  PUSH t2 # x1
  j NEXT

  # (x1 x2 x3 -- x3 x1 x2)
NROT:
  POP t0 # x3
  POP t1 # x2
  POP t2 # x1
  PUSH t0 # x3
  PUSH t2 # x1
  PUSH t1 # x2
  j NEXT

  # ( a b -- f)
EQUAL:
  POP t0
  POP t1
  beq t0, t1, .L_equal
  li t0, 0
  PUSH t0
  j NEXT
.L_equal:
  li t0, -1
  PUSH t0
  j NEXT

  # ( len1 addr1 len2 addr2 -- flag)
MEMCMP:
  POP a2
  POP a3
  POP a0
  POP a1
  call memcmp
  PUSH a0
  j NEXT

  # ( f -- )
BRANCH_ON_ZERO:
  POP t0
  beqz t0, .L_do_branch
  addi :IP, :IP, 4
  j NEXT
.L_do_branch:
  lw :IP, 0(:IP)
  j NEXT

  # ( -- )
JUMP:
  lw :IP, 0(:IP)
  j NEXT
  
  # just a debug function to print the whole stack
  # print debugging.. some people hate it some people love it
  # i both hate it and love it
DEBUG_STACK:
  addi sp, sp, -12
  sw ra, 0(sp)
  sw s8, 4(sp)

  li a0, '<'
  call putc
  li a0, '>'
  call putc
  li a0, ' '
  call putc

  mv s8, :END_SP
  add s8, s8, -4
.L_debug_stack_loop:
  blt s8, :SP, .L_debug_stack_loop_end
  lw a0, 0(s8)
  call print_unsigned_hex
  li a0, ' '
  call putc
  addi s8, s8, -4
  j .L_debug_stack_loop

.L_debug_stack_loop_end:
  li a0, '\n'
  call putc
  lw ra, 0(sp)
  lw s8, 4(sp)
  addi sp, sp, 12
  j NEXT


do_parse_token:
  addi sp, sp, -4
  sw ra,  0(sp)

  mv a0, :TOK_POS
  mv a1, :TOK_LEN
  jal token # parse the token
  mv :TOK_POS, a0
  mv :TOK_LEN, a1
  lw ra,  0(sp)
  addi sp, sp, 4
  # return a0 a1 from token
  ret


  # Input:
  #   a0: token address
  #   a1: token length
  # Output:
  #   a0: execution token address (or 0 if not found)
do_find:
  li t1, 0
  mv t3, a1

  # The shananigans here are so we can build little endian version of the token
  # in 4 bytes dont be intimidated by them, I just made the tokens in the
  # dictionary as "bye\0" instead of "\0eyb" to be easier to read
  beqz t3, .L_not_found # zero length token
  lbu t1, 0(a0)
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 1(a0)
  sll t2,t2, 8
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 2(a0)
  sll t2, t2, 16
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_find_start
  lbu t2, 3(a0)
  sll t2, t2, 24
  or t1, t1, t2


  # t1: has the input token as 4 byte number
  # a1: is the length of the input token
  # t0: pointer to the entry, we will start at the end
.L_find_start:
  mv t0, :LATEST

.L_find_loop:
  beqz t0, .L_not_found     # if the entry is 0, means we didnt find a match
  lw t2, 4(t0)              # load the length of the entry
  bne t2, a1, .L_next_entry # compare lengths
  lw t2, 8(t0)              # load entry name
  bne t2, t1, .L_next_entry # compare names
  add a0, t0, 16            # return the code address
  ret

.L_next_entry:
  lw t0, 0(t0)              # follow link to next entry
  j .L_find_loop

.L_not_found:
  li a0, 0                  # return 0 for not found
  ret

DOCOL:
  EPUSH :IP
  mv :IP, :XT
  j NEXT

EXIT:
  EPOP :IP
  j NEXT

COLON:
  li :MODE, -1 # enter compile mode

  jal do_create

  # we want to achieve this, creating a new word
  #
  # : square dup * ;
  #
  # ...
  #               DOCOL:
  # 80000534:      RPUSH :IP    <-----------------.
  # 80000538:                                     |
  # 8000053c:      mv :IP, :XT                    |
  # 80000540:      j NEXT                         |
  # ...                                           |
  # 80000148 <DUP>:                               |
  # 80000148:      lw t0, 0(:SP)                  |
  # 8000014c:      PUSH t0                        |
  # ...                                           |
  # 80000880: w_square:                           |
  # 80000880:      80000..# link                  |
  # 80000884:      6      # size                  |
  # 80000888:      "squa" # token                 |
  # 8000088c:      0      # flags                 |
  # 80000890:      80000894 # CODE FIELD >--------|---.
  # 80000894:      lui :XT, 0x80001  >---.        | <-'
  # 80000898:      addi :XT, :XT, 0x8a8 >--.      |
  # 8000089c:      lui t0, 0x80000 >---.   |      |
  # 800008a0:      addi t0, t0, 0x534 >----|------'
  # 800008a4:      jr t0                   |
  # 800008a8:      80000148 # DUP <--------'
  # 800008ac:      80000... # MUL
  # 800008b0:      80000... # EXIT
  # ...

  # 1. EXECUTION CODE FIELD point to HERE + 4, where we will
  # put the machine code: memory[HERE] = HERE+4
  add t0, :HERE, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4


  # 2. Generate absolute address for where we want DOCOL to jump, in our case we want HERE+20
  mv t0, :HERE
  addi t0, t0, 20

  # 3. Generate the machine code
  # li :XT, value of :HERE + 20
  # la t0, DOCOL
  # jr t0

  # and expanded
  # lui :XT, value << 12
  # addi :XT, :XT, value << 20 >> 20
  # lui t0, value << 12
  # addi t0, t0, value << 20 >> 20
  # jr t0


  # 3.1 Generate machine code for XT = HERE + 20 at time of compilation
  li a0, 21 # XT is s5, which is register x21
  mv a1, t0
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4


  # 3.1 Generate machine code for la t0, DOCOL
  li a0, 5 # t0 is x5
  la a1, DOCOL
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4


  # 3.2 Generate machine code for jr t0
  li a0, 5 # t0 is x5
  jal do_jr
  sw a0, 0(:HERE) # jr
  addi :HERE, :HERE, 4

  j NEXT


CREATE:
  jal do_create

  # point the execution token to the machine code
  addi t0, :HERE, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4

  # create foo
  #
  # 80000880: w_foo:                            
  # 80000880:       80000..# link               
  # 80000884:       3      # size               
  # 80000888:       "foo\0"# token              
  # 8000088c:       0      # flags              
  # 80000890:       80000894 # CODE FIELD >--------.
  # 80000894:       lui t0, HIGH(HERE+24)  <-------'   >-.
  # 80000898:       addi t0, t0, LOW(HERE+24) >-----------.     
  # 8000089c:       addi SP, SP, -4                       |
  # 800008a0:       sw t0, 0(SP)                          |
  # 800008a4:       lui t0, HIGH(NEXT)                    |
  # 800008a8:       addi t0, t0, LOW(NEXT)                |       
  # 800008ac:       jr t0                                 |
  # 800008b0:       <data field...>    <------------------'

  # li t0, :HERE
  # addi :SP, :SP, -4
  # sw t0, 0(SP)
  # la t0, NEXT
  # jr t0

  addi t1, :HERE, 28 # HERE + 28

  # li t0, value of :HERE + 28
  li a0, 5  # t0 is x5
  mv a1, t1 # HERE + 28
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4
  
  # addi :SP, :SP, -4
  li a0, 9  # :SP is s1, x9
  li a1, 9  # :SP is s1, x9
  li a2, -4 
  call do_addi
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4

  # sw t0, 0(:SP)
  li a0, 9  # :SP is s1, x9
  li a1, 5  # t0 is x5
  call do_sw
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4

  # la t0, NEXT
  li a0, 5 # t0 is x5
  la a1, NEXT
  jal do_li
  sw a0, 0(:HERE) # lui
  addi :HERE, :HERE, 4
  sw a1, 0(:HERE) # addi
  addi :HERE, :HERE, 4

  # jr t0
  li a0, 5 # t0 is x5
  jal do_jr
  sw a0, 0(:HERE) # jr
  addi :HERE, :HERE, 4

  j NEXT

  # ( -- )
SEMICOLON:
  mv :MODE, zero # exit compile mode
  la t0, EXIT
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT

  # ( x -- )
COMMA:
  POP t0
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT

  # ( -- flag )
MODE:
  PUSH :MODE
  j NEXT

do_create:
  addi sp, sp, -4
  sw ra, 0(sp)

  add :TOK_POS, :TOK_POS, :TOK_LEN
  jal do_parse_token
  beqz a1, .L_create_error

  # align to closest multiple of 4
  addi t0, :HERE, 3    
  li t1, -4            
  and :HERE, t0, t1

  # link field (4 bytes)
  sw :LATEST, 0(:HERE)

  # length field (4 bytes)
  sw a1, 4(:HERE)

  # token field (4 bytes)
  li t1, 0
  mv t3, a1  # Initialize t3 with token length

.L_create_build_token:
  lbu t1, 0(a0)
  addi t3, t3, -1

  beqz t3, .L_create_write_token
  lbu t2, 1(a0)
  sll t2, t2, 8
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_create_write_token
  lbu t2, 2(a0)
  sll t2, t2, 16
  or t1, t1, t2
  addi t3, t3, -1

  beqz t3, .L_create_write_token
  lbu t2, 3(a0)
  sll t2, t2, 24
  or t1, t1, t2

.L_create_write_token:
  sw t1, 8(:HERE)

  # flags field
  sw zero, 12(:HERE)

  # move the dictionary end
  mv :LATEST, :HERE

  # update HERE to point to the end of the word
  addi :HERE, :HERE, 16

  lw ra, 0(sp)
  addi sp, sp, 4
  ret

.L_create_error:
  la a0, err_create_error
  j panic

panic:
  jal puts
  jal getch
  j interpreter_done

  # ( -- )
IMMEDIATE:
  li t1, 1
  sw t1, 12(:LATEST) # flag value
  j NEXT


  # ( addr -- value )
AT:
  POP t0
  lw t0, 0(t0)
  PUSH t0
  j NEXT

  # ( addr -- value )
C_AT:
  POP t0
  lbu t0, 0(t0)
  PUSH t0
  j NEXT

  # ( value addr -- )
C_BANG:
  POP t0 # address
  POP t1 # value
  sb t1, 0(t0)
  j NEXT

  # ( -- c )
KEY:
  jal getch
  PUSH a0
  j NEXT


  # ( -- addr )
PUSH_HERE:
  PUSH :HERE
  j NEXT

  # ( reg imm -- lui addi )
LI:
  POP a1   # imm
  POP a0   # reg

  call do_li

  PUSH a0 # lui
  PUSH a1 # addi

  j NEXT

  # JR ( reg -- opcode_jr )
JR:
  POP a0
  call do_jr
  PUSH a0
  j NEXT

  # ( x -- ) (R: -- x)
TO_R:
  POP t0
  RPUSH t0
  j NEXT

  # ( -- x ) (R: x -- )
FROM_R:
  RPOP t0
  PUSH t0
  j NEXT

  # ( -- x ) (R: x -- x)
R_FETCH:
  lw t0, 0(:RSP)
  PUSH t0
  j NEXT


  # ( x -- ) (CF: -- x)
TO_CF:
  POP t0
  CFPUSH t0
  j NEXT

  # ( -- x ) (CF: x -- )
FROM_CF:
  CFPOP t0
  PUSH t0
  j NEXT

  # ( -- x ) (CF: x -- x)
CF_FETCH:
  lw t0, 0(:CSP)
  PUSH t0
  j NEXT

POSTPONE:
  add :TOK_POS, :TOK_POS, :TOK_LEN
  jal do_parse_token 
  jal do_find        
  beqz a0, .L_word_not_found  

  la t1, LIT
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4

  lw a0, 0(a0) # dereference
  sw a0, 0(:HERE)
  addi :HERE, :HERE, 4
  
  la t1, COMMA 
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4
  
  j NEXT

.L_word_not_found:
  la a0, err_word_not_found
  j panic


  # ( n -- )
LITERAL:
  POP t0
  la t1, LIT
  sw t1, 0(:HERE)
  addi :HERE, :HERE, 4
  sw t0, 0(:HERE)
  addi :HERE, :HERE, 4
  j NEXT

  # stop compilation
  # ( -- )
LEFT_BRACKET:
  mv :MODE, zero
  j NEXT

  # start compilation
  # ( -- )
RIGHT_BRACKET:
  li t1, 1
  mv :MODE, t1
  j NEXT

  # ( a b -- f)
GREATHER:
  POP t0
  POP t1
  bgt t0, t1, .L_greather
  li t0, 0
  PUSH t0
  j NEXT
.L_greather:
  li t0, -1
  PUSH t0
  j NEXT


  # ( len addr -- )
DUMP:
  POP s9  # addr
  POP s10 # len

1:  
  beqz s10, NEXT

  mv a0, s9
  jal print_unsigned_hex
  li a0, ':'
  jal putc
  li a0, ' '
  jal putc
  lw a0, 0(s9)
  jal print_unsigned_hex

  li a0, ' '
  jal putc

  lb a0, 0(s9)
  jal putc_ascii

  lb a0, 1(s9)
  jal putc_ascii

  lb a0, 2(s9)
  jal putc_ascii

  lb a0, 3(s9)
  jal putc_ascii

  li a0, '\n'
  jal putc
  addi s9, s9, 4
  addi s10, s10, -1
  j 1b

putc_ascii:
  addi sp, sp, -4
  sw ra, 0(sp)

  li t0, 32
  blt a0, t0, .L_putc_dot
  li t0, 127
  bge a0, t0, .L_putc_dot
  j .L_putc_char
  
.L_putc_dot:  
  li a0, '.'

.L_putc_char:
  call putc

  lw ra, 0(sp)
  addi sp, sp, 4
  ret

  # ( limit index -- R: limit index)
DO_DO:
  POP t0            # t0 = index
  POP t1            # t1 = limit
  RPUSH t1          # limit
  RPUSH t0          # index
  j NEXT

  # ( R: limit index -- R: limit index ) ; (L: -- ) when done
DO_LOOP:
  RPOP t0    # pop index
  RPOP t1    # pop limit
  addi t0, t0, 1
  blt t0, t1, .L_do_loop_jump # if limit < index

  # skip over the jump address
  addi :IP, :IP, 4
  j NEXT
.L_do_loop_jump:
  # push them back on Rstack if still looping
  RPUSH t1   # push limit
  RPUSH t0   # push index
  # read the jump address from IP (the next cell in the thread)
  lw :IP, 0(:IP)
  j NEXT

  # ( -- ) (R: limit index -- ) (L: addr -- )
UNLOOP:
  RPOP zero   # index
  RPOP zero   # limit
  j NEXT

LOOP_I:
  lw t0, 0(:RSP)
  PUSH t0
  j NEXT

LOOP_J:
  lw t0, 8(:RSP)
  PUSH t0
  j NEXT

LOOP_K:
  lw t0, 16(:RSP)
  PUSH t0
  j NEXT

ALLOT:
  POP t0
  mv a0, t0
  add :HERE, :HERE, t0
  j NEXT

# ( x1 x2 -- flag )
AND:
  POP t0
  POP t1
  # Check if either value is zero
  beqz t0, .L_false
  beqz t1, .L_false
  # Both non-zero, return TRUE (-1)
  li t0, -1
  PUSH t0
  j NEXT
.L_false:
  # At least one zero, return FALSE (0)
  mv t0, zero
  PUSH t0
  j NEXT

WFI:
  wfi
  j NEXT

  # ( addr -- )
EVAL:
  add sp, sp, -36
  sw ra, 0(sp)
  sw :TOK_POS, 4(sp)
  sw :TOK_LEN, 8(sp)  
  sw :IP, 12(sp)
  sw :XT, 16(sp)
  sw :MODE, 20(sp)
  sw :CSP, 24(sp)
  sw :RSP, 28(sp)
  sw :ESP, 32(sp)

  POP :TOK_POS
  mv :TOK_LEN, zero

  jal interpreter_eval

  lw ra, 0(sp)
  lw :TOK_POS, 4(sp)
  lw :TOK_LEN, 8(sp)
  lw :IP, 12(sp)
  lw :XT, 16(sp)
  lw :MODE, 20(sp)
  lw :CSP, 24(sp)
  lw :RSP, 28(sp)
  lw :ESP, 32(sp)

  add sp, sp, 36

  j NEXT

BUMP:
  la t0, bump
  lw t1, 0(t0)
  addi t1, t1, 1
  PUSH t1
  sw t1, 0(t0)
  j NEXT

dictionary:
word_bye:
  .word 0         # link
  .word 3         # token length
  .ascii "bye\0"  # first 4 characters of token
  .word 0         # flags
  .word BYE       # address of execution token

word_plus:
  .word word_bye
  .word 1
  .ascii "+\0\0\0"
  .word 0
  .word PLUS

word_minus:
  .word word_plus
  .word 1
  .ascii "-\0\0\0"
  .word 0
  .word MINUS

word_mul:
  .word word_minus
  .word 1
  .ascii "*\0\0\0"
  .word 0
  .word MUL

word_bang:
  .word word_mul
  .word 1
  .ascii "!\0\0\0"
  .word 0
  .word BANG

word_at:
  .word word_bang
  .word 1
  .ascii "@\0\0\0"
  .word 0
  .word AT

word_dup:
  .word word_at
  .word 3
  .ascii "dup\0"
  .word 0
  .word DUP

word_emit:
  .word word_dup
  .word 1
  .ascii ".\0\0\0"
  .word 0
  .word EMIT

word_cr:
  .word word_emit
  .word 2
  .ascii "cr\0\0"
  .word 0
  .word CR

word_debug_stack:
  .word word_cr
  .word 2
  .ascii ".s\0\0"
  .word 0
  .word DEBUG_STACK

word_debug_stack_compile:
  .word word_debug_stack
  .word 4
  .ascii "[.s]"
  .word 1
  .word DEBUG_STACK

word_colon:
  .word word_debug_stack_compile
  .word 1
  .ascii ":\0\0\0"
  .word 0
  .word COLON

word_semicolon:
  .word word_colon
  .word 1
  .ascii ";\0\0\0"
  .word 1 # immediate
  .word SEMICOLON

word_li:
  .word word_semicolon
  .word 2
  .ascii "li\0\0"
  .word 0
  .word LI

word_jr:
  .word word_li
  .word 2
  .ascii "jr\0\0"
  .word 0
  .word JR

word_key:
  .word word_jr
  .word 3
  .ascii "key\0"
  .word 0
  .word KEY

word_here:
  .word word_key
  .word 4
  .ascii "here"
  .word 0
  .word PUSH_HERE

word_comma:
  .word word_here
  .word 1
  .ascii ",\0\0\0"
  .word 0
  .word COMMA

word_create:
  .word word_comma
  .word 6
  .ascii "crea"
  .word 0
  .word CREATE

word_branch0:
  .word word_create
  .word 7
  .ascii "0bra"
  .word 0
  .word BRANCH_ON_ZERO

word_jump:
  .word word_branch0
  .word 4
  .ascii "jump"
  .word 0
  .word JUMP

word_literal:
  .word word_jump
  .word 7
  .ascii "lite"
  .word 1 # immediate
  .word LITERAL

word_to_r:
  .word word_literal
  .word 2
  .ascii ">r\0\0"
  .word 0
  .word TO_R

word_from_r:
  .word word_to_r
  .word 2
  .ascii "r>\0\0"
  .word 0
  .word FROM_R

word_r_fetch:
  .word word_from_r
  .word 2
  .ascii "r@\0\0"
  .word 0
  .word R_FETCH


word_to_cf:
  .word word_r_fetch
  .word 3
  .ascii ">cf\0"
  .word 0
  .word TO_CF

word_from_cf:
  .word word_to_cf
  .word 3
  .ascii "cf>\0"
  .word 0
  .word FROM_CF

word_cf_fetch:
  .word word_from_cf
  .word 3
  .ascii "cf@\0"
  .word 0
  .word CF_FETCH

word_immediate:
  .word word_cf_fetch
  .word 9
  .ascii "imme"
  .word 0
  .word IMMEDIATE

word_postpone:
  .word word_immediate
  .word 8
  .ascii "post"
  .word 1 # immediate
  .word POSTPONE

word_drop:
  .word word_postpone
  .word 4
  .ascii "drop"
  .word 0
  .word DROP

word_dump:
  .word word_drop
  .word 4       
  .ascii "dump" 
  .word 0       
  .word DUMP    

word_left_bracket:
  .word word_dump
  .word 1       
  .ascii "[\0\0\0" 
  .word 1       
  .word LEFT_BRACKET

word_right_bracket:
  .word word_left_bracket
  .word 1       
  .ascii "]\0\0\0" 
  .word 1       
  .word RIGHT_BRACKET

word_swap:
  .word word_right_bracket
  .word 4       
  .ascii "swap" 
  .word 0       
  .word SWAP

word_equal:
  .word word_swap
  .word 1       
  .ascii "=\0\0\0" 
  .word 0       
  .word EQUAL

word_greather:
  .word word_equal
  .word 1       
  .ascii ">\0\0\0" 
  .word 0       
  .word GREATHER

word_do_do:
  .word word_greather      
  .word 4                  
  .ascii "(do)"            
  .word 0                  
  .word DO_DO              

word_do_loop:
  .word word_do_do
  .word 6
  .ascii "(loo"
  .word 0
  .word DO_LOOP

word_unloop:
  .word word_do_loop
  .word 6
  .ascii "unlo"
  .word 0
  .word UNLOOP

word_i:
  .word word_unloop
  .word 1
  .ascii "i\0\0\0"
  .word 0
  .word LOOP_I

word_j:
  .word word_i
  .word 1
  .ascii "j\0\0\0"
  .word 0
  .word LOOP_J


word_k:
  .word word_j
  .word 1
  .ascii "k\0\0\0"
  .word 0
  .word LOOP_K

word_allot:
  .word word_k
  .word 5
  .ascii "allo"
  .word 0
  .word ALLOT

word_exit:
  .word word_allot
  .word 4
  .ascii "exit"
  .word 0
  .word EXIT

word_c_bang:
  .word word_exit
  .word 2
  .ascii "c!\0\0"
  .word 0
  .word C_BANG

word_c_at:
  .word word_c_bang
  .word 2
  .ascii "c@\0\0"
  .word 0
  .word C_AT

word_emit_char:
  .word word_c_at
  .word 4
  .ascii "emit"
  .word 0
  .word EMIT_CHAR

word_rot:
  .word word_emit_char
  .word 3
  .ascii "rot\0"
  .word 0
  .word ROT

word_over:
  .word word_rot
  .word 4
  .ascii "over"
  .word 0
  .word OVER

word_and:
  .word word_over
  .word 3
  .ascii "and\0"
  .word 0
  .word AND

word_qexit:
  .word word_and
  .word 5
  .ascii "qexi"
  .word 0
  .word qemu_exit

word_wfi:
  .word word_qexit
  .word 3
  .ascii "wfi\0"
  .word 0
  .word WFI

word_eval:
  .word word_wfi
  .word 4
  .ascii "eval"
  .word 0
  .word EVAL

word_bump:
  .word word_eval
  .word 4
  .ascii "bump"
  .word 0
  .word BUMP
dictionary_end:

bump:
  .word 0

err_create_error:
  .asciz "\nerror: create missing name, usage: create [name]\n"

err_word_not_found:
  .asciz ": word not found\n"

  .align 2
  # our actual human readable program

human_program_small:
  .asciz "
: begin 
   here
   >cf
; immediate

: again 
   postpone jump 
   cf>
  ,
; immediate

: if
   postpone 0branch
   here
   0
   ,
   >cf
; immediate  

: then
   here       
   cf>        
   !
; immediate  


: nothing
  begin
    1 +
    dup 1000000 = if
      97 emit
      drop
      1
    then 
  again
;

1 nothing
  "
  
human_program:
  .asciz "

: begin 
   here
   >cf
; immediate

: again 
   postpone jump 
   cf>
  ,
; immediate

: until
   postpone 0branch
   cf>             
   ,               
; immediate

: if
   postpone 0branch
   here
   0
   ,
   >cf
; immediate  

: then
   here       
   cf>        
   !
; immediate  

: else
   postpone jump    
   here             
   0                
   ,                
   here            
   cf>             
   !               
   >cf             
; immediate

: loop
  postpone (loop)
  cf>
  ,
; immediate

: do
  postpone (do)
  here 
  >cf
; immediate

create buffer 200 allot 

: buffer[] buffer + ;

: zero 200 0 do 0 i buffer[] c! loop ;
: print 200 0 do i buffer[] dup 0 = if exit then c@ emit loop ;


: input
  bump .
  62 emit
  32 emit
  200 0 do
    key dup 13 = if
      drop
      -1
      unloop exit
    else
      dup 27 = if
        drop 
        0 
        unloop exit
      then
      dup emit
      i buffer[] c!
    then
  loop
;

: repl
  begin
    zero
    input
    0 = if drop unloop exit then
    cr
    buffer
    eval
    cr
  again
;

repl
qexit  
"
  .end

We have two interpreters, one loads the human_program_small program, which just prints the letter a from time to time, the other loads human_program which is our repl where you can write forth code and execute it.

One thing to note is that you see there is zero restriction or protection for one task to completely corrupt the other task's memory. In modern hardware and operating systems there is massive amount of work to make that impossible, or at least harder, and in fact it is an active area of research. We will briefly discuss it in the Security chapter.

Now what we are missing is loading and saving files, and communicating with other computers. I think for those we have to go to a real computer, because qemu's disk and networking are too involved.

It is important to remember there is very little difference between any of those systems, be it showing pixels on screen or writing to disk or receiving network packets, its all the same, there are protocols that you need to comply to, write to some registers to control the hardware in a specific way, and then send or read data. Some times the hardware will trigger an interrupt, sometimes you will have to poll and check if there is data. For example the USB keyboard interface never pushes data, when I press a key on the keyboard it gets into an internal buffer in the keyboard itself, and the operating system will ask the keyboard every 5 millisecons: give me the pressed keys. In the past we had PS2 keyboars which triggered interrupts, so when you press a key, the operating system's interrupt handler is called and your press is handled immediately. 5 milliseconds is not a lot of time, its 200 times per second, so you wont notice, but you see it is fundamental difference of control. The keyboard has to be more complicated now, because it must have a buffer of pressed keys, otherwise with PS2 you just send as soon as the key is pressed.

PULL versus PUSH is the most fundamental property of system interractions, and multiple books can be written to examine the emergent properties of systems depending on how they use pull/push between their components, where are their buffers and how are they drained.

This will conclude this chapter, and we will continue in the next one where we move to a real Raspberry Pi Pico 2350 computer, it has two ARM cores and two RISCV cores, you can use either one, and we will use the RISCV cores, I have not said much about Symmetric Multi Processing, where we have multiple CPU cores that access the same memory and hardware, and the challenges it has, and for now we will just use 1 of the cores, and pretend its only 1, to keep thing simpler. It is hard enough to have multiple programs using the same CPU to communicate, e.g. pressing a character gets it in a buffer where multiple programs might read it, but with 1 CPU, only one program runs at a time, which makes things considerably easier. Even ths simplest things are difficult when two entities are using the same resources, imagine a deck of cards, and two people reach to get the top card in the very same time, you remember how in our 74LS chips there are nanosecond intervals in which the voltage is read, well.. thats quite a lot of time for things to go wrong. A whole new set of primitives has to be used to guarantee order of operations, we won't get into it for now, but if you are interested search for "SMP" or "Symmetric MultiProcessing".

When you are exploring those concepts, use chatgpt or some other AI to help you, you can trully ask it the silliest questions, even though it might be wrong, and mislead you, just think about the wires, and listen to your doubt. Look at the source code of the Linux kernel version 1, paste it in chatgpt and ask it how it works. The recent kernel versions are very complicated as they must support tens of thousands of different devices and systems and subsystems, but you can also ask chatgpt about it. Look into preemtive scheduling, cooperative scheduling, fair scheduling, real time operating systems, processess and threads, push and pull, the dining philosophers problem, the barber problem, the byzantine generals problem. clockless computers, tickless kernels, micro kernels, synchronization primitives, semaphores, spinlocks, mutexes and futexes. Don't stress, it seems like a lot, but everything is similar, you just have to understand the reason for its existence.