Forth

Forth or FORTH is a stack based programming language, made in the 70s by Chuck Moore, it is incredibly compact and expressive language, but most of all, it is beautiful and elegant. And we must always strive towards beauty.

Stack

A stack is an abstract thing, it is pretty much what you are thinking of when you see the word stack, it is a bunch of things on top of each other, like a deck of cards. You can add(push) one more card on top, or you can take (pop) the top card in your hand. Those two operations define a stack. We call them push and pop, instead add and take. But we can have a stack of cards, or a stack of books, or a stack of pancakes. For all of them you can do push and pop, you can add one more pancake on top, and take the top pancake.

Anything that can do push and pop efficiently can be used as a stack, so when you think of what you can do with stack, this thing can do it. When you add a pancake on top of the stack of pancakes, it takes no time, you just add it on top, you dont have to do anything else. When you take the top one its the same, no other work, just take it. Imagine however you want to take the middle pancake, then you have a problem, you have to move multiple pancakes from the top, take the pancake, then put them back.

This is what defines the stack abstract datastructure. A data structure is just a way to organize data so that we can access it and modify it, each data structure have different properties, like the stack makes it easy to push and pop, but had to modify the middle, it is also difficult to lookup values in it, e.g. if you want to know if the value 3 exists in a stack of numbers, you have to go through the whole stack to check one by one. There are others, where its to lookup, like sets, but its hard to have a concept of 'top'. Some make it easy to add, some easy to delete, otheres easy to scan, or to seach and so on.

The data structures are more general than computers, you can see them in nature, self organizing trees, like if you have seen ducks flying together, they form this V shape. Or self sorting organizations like the cells in our bodies. In our computers however they must live in memory, the same electrons and flipflops you already know about.

Our addressable memory allows us to implement a stack in a very very efficient way. It almost comes out for free. We just keep track of where the top of the stack is, lets say our stack will work with just 4 byte values, then a push would mean memory[top] = value; top += 4 then a pop would be top -= 4; value = memory[top], thats pretty much it. top is just a variable which we can store at some memory address, or we can keep it in a special purpose register.

When I talk about memory I always imagine this

address | value
      0 | 0
      4 | 0
      8 | 0
     12 | 0
     16 | 0
     20 | 0
     24 | 0
     28 | 0
     32 | 0
     36 | 0
     40 | 0
     44 | 0
     48 | 0
     52 | 0
     58 | 0
     62 | 0
       ...

Now when I think of a variable, lets say in our case the variable top, I just imagine it at some random address, in our case address 248, and we want our stack to start at address 256 (again, just a number I picked). So you see the value at address 248 is 256, or top = 256

address | value
       ...
    240 | 0
    244 | 0
    248 | 256 <-- top
    252 | 0
    256 | 0
    260 | 0
    264 | 0
    268 | 0
    272 | 0
    276 | 0
    280 | 0
       ...

Lets push the value 3 to the stack, first we will do memory[top] = 3

address | value
       ...
    240 | 0
    244 | 0
    248 | 256 <-- top
    252 | 0
    256 | 3   <-- memory[top] = 3
    260 | 0
    264 | 0
    268 | 0
    272 | 0
    276 | 0
    280 | 0
       ...

Then we want to move the top of the stack by doing top += 4, and 256 + 4 is 260

address | value
       ...
    240 | 0
    244 | 0
    248 | 260 <-- top
    252 | 0
    256 | 3   <-- memory[top] = 3
    260 | 0
    264 | 0
    268 | 0
    272 | 0
    276 | 0
    280 | 0
       ...

Lets push few more values, 3 5 6, which will get our top to 272:

address | value
       ...
    240 | 0
    244 | 0
    248 | 272 <-- top
    252 | 0
    256 | 3
    260 | 4
    264 | 5
    268 | 6
    272 | 0
    276 | 0
    280 | 0
       ...

Now lets do a pop, and lets store the result in some variable, we will call it v (people are quite upset when single character variable names are used, but, they dont mind when i is used).

First we do top -= 4, 272 - 4 is 268

address | value
       ...
    240 | 0   <-- v (just a random address I picked)
    244 | 0
    248 | 268 <-- top
    252 | 0
    256 | 3
    260 | 4
    264 | 5
    268 | 6
    272 | 0
    276 | 0
    280 | 0
       ...

Then we do v = memory[top]

address | value
       ...
    240 | 6   <-- v 
    244 | 0
    248 | 268 <-- top
    252 | 0
    256 | 3
    260 | 4
    264 | 5
    268 | 6
    272 | 0
    276 | 0
    280 | 0
       ...

Lets pop again, and again into v

address | value
       ...
    240 | 5   <-- v 
    244 | 0
    248 | 264 <-- top
    252 | 0
    256 | 3
    260 | 4
    264 | 5
    268 | 6
    272 | 0
    276 | 0
    280 | 0
       ...

Thats it. we did push and pop, we have a stack, you see because of the way our digital computer with addressable memory works, the operations are really fast, since we know the address of top and the address of v we can update and read them. top is a normal 4 byte integer, but you can see we use it to lookup another address memory[top] this is called dereferencing because top is actually a pointer to the actual place we are interested in.

Lets implement push and pop examples in RISCV assembly, and we will discuss it line by line. It will seem frightening all at once, but remember that nothing is as complicated as water.

addi x6, x0, 256 # x6 = 256
addi x5, x0, 248 # x5 = 248 (top)
sw x6, 0(x5)     # memory[x5] = x6

addi x5, x0, 240 # x5 = 240 (v)
sw x0, 0(x5)     # memory[x5] = 0


jal x1, push_3
jal x1, push_4
jal x1, push_5
jal x1, pop_into_v

end:
    jal x0, end

push_3:

    # memory[top] = 3
    addi x5, x0, 248 # x5 = 248 (top)
    lw x5, 0(x5)     # x5 = memory[x5]
    addi x6, x0, 3   # x6 = 3
    sw x6, 0(x5)     # memory[x5] = x6


    # top += 4
    addi x5, x0, 248 # x5 = 248 (top)
    lw x6, 0(x5)     # x6 = memory[x5]
    addi x6, x6, 4   # x6 += 4
    sw x6, 0(x5)     # memory[x5] = x6

    jalr x0, 0(x1)


push_4:

    # memory[top] = 4
    addi x5, x0, 248 # x5 = 248 (top)
    lw x5, 0(x5)     # x5 = memory[x5]
    addi x6, x0, 4   # x6 = 4
    sw x6, 0(x5)     # memory[x5] = x6


    # top += 4
    addi x5, x0, 248 # x5 = 248 (top)
    lw x6, 0(x5)     # x6 = memory[x5]
    addi x6, x6, 4   # x6 += 4
    sw x6, 0(x5)     # memory[x5] = x6

    jalr x0, 0(x1)

push_5:

    # memory[top] = 5
    addi x5, x0, 248 # x5 = 248 (top)
    lw x5, 0(x5)     # x5 = memory[x5]
    addi x6, x0, 5   # x6 = 5
    sw x6, 0(x5)     # memory[x5] = x6


    # top += 4
    addi x5, x0, 248 # x5 = 248 (top)
    lw x6, 0(x5)     # x6 = memory[x5]
    addi x6, x6, 4   # x6 += 4
    sw x6, 0(x5)     # memory[x5] = x6

    jalr x0, 0(x1)

pop_into_v:

    # top -= 4
    addi x5, x0, 248 # x5 = 248 (top)
    lw x6, 0(x5)     # x6 = memory[x5]
    addi x6, x6, -4  # x6 -= 4
    sw x6, 0(x5)     # memory[x5] = x6

    # v = memory[top]
    addi x5, x0, 248 # x5 = 248 (top)
    lw x5, 0(x5)     # x5 = memory[x5]
    addi x6, x0, 240 # x6 = 240 (v)
    lw x5, 0(x5)     # x5 = memory[x5]
    sw x5, 0(x6)     # memory[x6] = x5

    jalr x0, 0(x1)

Everything after # is a comment, the assembler just ignores it.

We made few subroutines: push_3, push_4, push_5, pop_into_v, a subroutine is just a bunch of reusable code we can jump to, Lets say our assembler prepares our program to be executed at address 0, this is the machine code produced:0x10000313 0x0f800293 0x0062a023 0x0f000293 0x0002a023 0x014000ef 0x034000ef 0x054000ef 0x074000ef 0x0000006f 0x0f800293 0x0002a283 0x00300313 0x0062a023 0x0f800293 0x0002a303 0x00430313 0x0062a023 0x00008067 0x0f800293 0x0002a283 0x00400313 0x0062a023 0x0f800293 0x0002a303 0x00430313 0x0062a023 0x00008067 0x0f800293 0x0002a283 0x00500313 0x0062a023 0x0f800293 0x0002a303 0x00430313 0x0062a023 0x00008067 0x0f800293 0x0002a303 0xffc30313 0x0062a023 0x0f800293 0x0002a283 0x0f000313 0x0002a283 0x00532023 0x00008067.

Quite intense, but each number will map almost exactly to our assembly code.

Zooming into the first instruction addi x6,x0, 256, the instruction is 0x10000313, or in decimal 268436243. in binary: 00010000000000000000001100010011. You can see that it has 3 parameters, x6 (register destination: rd), x0, (register source: rs), 256 the immediate value, and of course the fact that it is the addi instruction. so somehow in the number 268436243 all this information is encoded. I will color code which part of the number is which part of the instruction.

addi x6, x0, 256

00010000000000000000001100010011

From the official documentation you can see how the instruction is defined:

In our example 100000000 is 256, which it is, rs is 0, which is x0, rd is 110, which is x6. So if we change 256 to 3, or 000100000000 to 000000000011, we get the number 00000000001100000000001100010011 or 0x00300313 in hex. And if you look at our program, 0x00300313 is addi x6,x0,3! Success! We can write actual RISCV machine code.

You can imagine how the instruction is decoded, once you know which instruction is about to be executed, then you have special logic to extract the parameters and do the appropriate things, like in our SUBLEQ example.

So addi has only 12 bits for the number you want to use, and the first bit is actually the sign bit, is it + or -, So the biggest number you can addi is 011111111111 or 2047, and the smallest number is 111111111111 or -2048. You can see how addi x6, x6, -4 is translated to the machine code 0xffc30313, when you decode it you see the first 12 bits are 111111111100 which is the Two's complement for -4. In the code below it is shown as addi x6,x6,0xfffffffc, and 0xfffffffc is 32 bit number, but this is just a convention, only 12 bits are actually in the machine code. What do you do then, if you want to set a value to 4828327 for example? You must use 2 instructions to do that, lui Load Upper Immediate, which can put 20 bits in the upper bits of a register, and then do addi for the lower 12 bits. Or you can use a pseudo instruction, meaning we write li x5, 4828327 which the assembler will translate into lui x5, 0x49b; addi x5, x5, 0xca7.

This is the same program but shown which machine code goes to which memory address and also human readable format of the instruction, plus the acutal line in our source code.

Address     Code        Basic                        Line Source

0x00000000  0x10000313  addi x6,x0,0x00000100        1    addi x6, x0, 256 # x6 = 256
0x00000004  0x0f800293  addi x5,x0,0x000000f8        2    addi x5, x0, 248 # x5 = 248 (top)
0x00000008  0x0062a023  sw x6,0(x5)                  3    sw x6, 0(x5)     # memory[x5] = x6
0x0000000c  0x0f000293  addi x5,x0,0x000000f0        5    addi x5, x0, 240 # x5 = 240 (v)
0x00000010  0x0002a023  sw x0,0(x5)                  6    sw x0, 0(x5)     # memory[x5] = 0
0x00000014  0x014000ef  jal x1,0x00000014            9    jal x1, push_3
0x00000018  0x034000ef  jal x1,0x00000034            10   jal x1, push_4
0x0000001c  0x054000ef  jal x1,0x00000054            11   jal x1, push_5
0x00000020  0x074000ef  jal x1,0x00000074            12   jal x1, pop_into_v
0x00000024  0x0000006f  jal x0,0x00000000            15   jal x0, end
0x00000028  0x0f800293  addi x5,x0,0x000000f8        20   addi x5, x0, 248 # x5 = 248 (top)
0x0000002c  0x0002a283  lw x5,0(x5)                  21   lw x5, 0(x5)     # x5 = memory[x5]
0x00000030  0x00300313  addi x6,x0,3                 22   addi x6, x0, 3   # x6 = 3
0x00000034  0x0062a023  sw x6,0(x5)                  23   sw x6, 0(x5)     # memory[x5] = x6
0x00000038  0x0f800293  addi x5,x0,0x000000f8        27   addi x5, x0, 248 # x5 = 248 (top)
0x0000003c  0x0002a303  lw x6,0(x5)                  28   lw x6, 0(x5)     # x6 = memory[x5]
0x00000040  0x00430313  addi x6,x6,4                 29   addi x6, x6, 4   # x6 += 4
0x00000044  0x0062a023  sw x6,0(x5)                  30   sw x6, 0(x5)     # memory[x5] = x6
0x00000048  0x00008067  jalr x0,x1,0                 32   jalr x0, 0(x1)
0x0000004c  0x0f800293  addi x5,x0,0x000000f8        38   addi x5, x0, 248 # x5 = 248 (top)
0x00000050  0x0002a283  lw x5,0(x5)                  39   lw x5, 0(x5)     # x5 = memory[x5]
0x00000054  0x00400313  addi x6,x0,4                 40   addi x6, x0, 4   # x6 = 4
0x00000058  0x0062a023  sw x6,0(x5)                  41   sw x6, 0(x5)     # memory[x5] = x6
0x0000005c  0x0f800293  addi x5,x0,0x000000f8        45   addi x5, x0, 248 # x5 = 248 (top)
0x00000060  0x0002a303  lw x6,0(x5)                  46   lw x6, 0(x5)     # x6 = memory[x5]
0x00000064  0x00430313  addi x6,x6,4                 47   addi x6, x6, 4   # x6 += 4
0x00000068  0x0062a023  sw x6,0(x5)                  48   sw x6, 0(x5)     # memory[x5] = x6
0x0000006c  0x00008067  jalr x0,x1,0                 50   jalr x0, 0(x1)
0x00000070  0x0f800293  addi x5,x0,0x000000f8        55   addi x5, x0, 248 # x5 = 248 (top)
0x00000074  0x0002a283  lw x5,0(x5)                  56   lw x5, 0(x5)     # x5 = memory[x5]
0x00000078  0x00500313  addi x6,x0,5                 57   addi x6, x0, 5   # x6 = 5
0x0000007c  0x0062a023  sw x6,0(x5)                  58   sw x6, 0(x5)     # memory[x5] = x6
0x00000080  0x0f800293  addi x5,x0,0x000000f8        62   addi x5, x0, 248 # x5 = 248 (top)
0x00000084  0x0002a303  lw x6,0(x5)                  63   lw x6, 0(x5)     # x6 = memory[x5]
0x00000088  0x00430313  addi x6,x6,4                 64   addi x6, x6, 4   # x6 += 4
0x0000008c  0x0062a023  sw x6,0(x5)                  65   sw x6, 0(x5)     # memory[x5] = x6
0x00000090  0x00008067  jalr x0,x1,0                 67   jalr x0, 0(x1)
0x00000094  0x0f800293  addi x5,x0,0x000000f8        72   addi x5, x0, 248 # x5 = 248 (top)
0x00000098  0x0002a303  lw x6,0(x5)                  73   lw x6, 0(x5)     # x6 = memory[x5]
0x0000009c  0xffc30313  addi x6,x6,0xfffffffc        74   addi x6, x6, -4  # x6 -= 4
0x000000a0  0x0062a023  sw x6,0(x5)                  75   sw x6, 0(x5)     # memory[x5] = x6
0x000000a4  0x0f800293  addi x5,x0,0x000000f8        78   addi x5, x0, 248 # x5 = 248 (top)
0x000000a8  0x0002a283  lw x5,0(x5)                  79   lw x5, 0(x5)     # x5 = memory[x5]
0x000000ac  0x0f000313  addi x6,x0,0x000000f0        80   addi x6, x0, 240 # x6 = 240 (v)
0x000000b0  0x0002a283  lw x5,0(x5)                  81   lw x5, 0(x5)     # x5 = memory[x5]
0x000000b4  0x00532023  sw x5,0(x6)                  82   sw x5, 0(x6)     # memory[x6] = x5
0x000000b8  0x00008067  jalr x0,x1,0                 84   jalr x0, 0(x1)

Now back to the subroutines, the very interesting calls are jal x1,0x00000014 and jalr x0,x1,0. As I said before, JAL is Jump And Link. It has 2 parameters, register destination rd and immediate value, it stores regurn address pc+4 into rc, pc is the program counter register, and its value is the address of current instruction being executed, which is the jal instruction itself, so pc+4 is the next instruction where we want to come back to in order to continue from where we left of before we jumped into the subroutine. The imediate value is relative offset from pc, once we link, we set pc += immediate value and the next instruction is going to be executed there.

Address     Code        Basic                        Line Source

...
0x00000014  0x014000ef  jal x1,0x00000014            9    jal x1, push_3
0x00000018  0x034000ef  jal x1,0x00000034            10   jal x1, push_4
0x0000001c  0x054000ef  jal x1,0x00000054            11   jal x1, push_5
0x00000020  0x074000ef  jal x1,0x00000074            12   jal x1, pop_into_v
0x00000024  0x0000006f  jal x0,0x00000000            15   jal x0, end
0x00000028  0x0f800293  addi x5,x0,0x000000f8        20   addi x5, x0, 248 # x5 = 248 (top)
...
0x00000048  0x00008067  jalr x0,x1,0                 32   jalr x0, 0(x1)
...

We want to execute the push_3 subroutine, we know it is in address 0x00000028, and we know we are at address 0x00000014, so if we add 0x14 (20 in decimal) to pc we will go right where we want. jal x1, 0x14 will do x1 = pc+4; pc += 0x14. in this case pc is 0x00000014 and pc+4 is 0x00000018, so x1 = pc+4; pc += 0x14 is x1 = 0x18; pc += 0x14 (you see sometimes I leave the leading zeroes in front to remind you that the address is just a 32 bit number, but sometimes I remove them for brievery). We then start executing instructions from address 0x28, one by one. 0x0f800293 bing, 0x0002a283 bang, 0x00300313 ting, 0x0062a023 tang.. and so on, until we reach 0x00008067, ah the famous 0x8067, my favorite instruction. jalr x0, 0(x1). JALR means Jump And Link Register, it has 3 parameters rd, rs, and immediate value, it sets rd to pc+4 and then sets pc to rs+immediate value so you can jump relative to rs. rd = pc + 4; pc = rs + immediate. Now in our case rs is x1, the immediate value is 0, and rd is x0 which is the zero register, so x0 = pc + 4; pc = x1 + 0 the write to x0 will be ignored, this is its purpose after all, the zero register, but after that, magic happens, previosly when we jumped to the subroutine we stored the return address 0x18 in x1, which means that x0 = pc + 4; pc = x1 + 0 becomes pc = 0x18 and BANG we are back to where we were going to be before we executed the subroutine call. And then we will execute the instruction at address 0x18 which is a jump to push_4, then we will be back again and execute jump to push_5 and so on until we execute a halt instruction, or in our case the infinite loop of jal x0, 0 or 0x6f my other favorite instruction. Jump to itself, x0 = pc + 4; pc = pc + 0.

This is a bit weird having push_3 and push_4 and push_5, the code is exactly the same, but the only difference is in the addi parameter, is it 3 or 4 or 5. We could use a register to just pass a parameters to the subroutine.

Rewriting the

# top = 256
li x6, 256       # x6 = 256
li x5, 248       # x5 = 248
sw x6, 0(x5)     # memory[x5] = x6

# v = 0
li x5, 240       # x5 = 240 (v)
sw x0, 0(x5)     # memory[x5] = 0


# push 3
li x10, 256
li x11, 3
jal x1, push


# push 4
li x10, 248
li x11, 4
jal x1, push

# push 5
li x10, 248
li x11, 5
jal x1, push


# pop into v
li x10, 248
li x11, 240
jal x1, pop

end:
    jal x0, end

push:
    # x10: address of top
    # x11: value

    # memoxy[x10] = x11
    lw x5, 0(x10)    # x5 = memory[x10]
    sw x11, 0(x5)    # memory[x5] = x11

    # top += 4
    lw x5, 0(x10)    # x5 = memory[x10]
    addi x5, x5, 4   # x5 += 4
    sw x5, 0(x10)    # memory[x10] = x5

    jalr x0, 0(x1)   # return

pop:
    # x10: address of top
    # x11: address of v

    # top -= 4
    lw x5, 0(x10)    # x5 = memory[x10]
    addi x5, x5, -4  # x5 -= 4
    sw x5, 0(x10)    # memory[x10] = x5

    # v = memory[top]
    lw x5, 0(x10)    # x5 = memory[x10]
    lw x5, 0(x5)     # x5 = memory[x5]
    sw x5, 0(x11)    # memory[x11] = x5

    jalr x0, 0(x1)   # return

I have been using the addresses in their raw form x0, x1, x2 and so on, but there are conventions and mnemonics for us to make the code easier to write and read, x1 is the ra return address, x10 is a0 the argument 0 register. We also have a bunch of other pseudo instructions for example jal push will expand to jalr x1, push or ret will expand to jalr x0, 0(x1), j 0 will expand to jal x0, 0, and many more.

x0/zero: Hardwired zero
x1/ra: Return address
x2/sp: Stack pointer
x3/gp: Global pointer
x4/tp: Thread pointer
x5-x7/t0-t2: Temporary registers
x8/s0/fp: Saved register/Frame pointer
x9/s1: Saved register
x10-x11/a0-a1: Function arguments/return values
x12-x17/a2-a7: Function arguments
x18-x27/s2-s11: Saved registers
x28-x31/t3-t6: Temporary registers

Rewriting the program again to use the mnemonics and the pseudo instructions

# top = 256
li t0, 248          # t0 = 248
li t1, 256          # t1 = 256
sw t1, 0(t0)        # memory[t0] = t1

# v = 0
li t0, 240          # t0 = 240 (v)
sw zero, 0(t0)      # memory[t0] = 0

# push 3
li a0, 256          # First argument: address of top
li a1, 3            # Second argument: value to push
jal push

# push 4
li a0, 248          # First argument: address of top
li a1, 4            # Second argument: value to push
jal push

# push 5
li a0, 248          # First argument: address of top
li a1, 5            # Second argument: value to push
jal push

# pop into v
li a0, 248          # First argument: address of top
li a1, 240          # Second argument: address of v
jal pop

end:
    j end

push:
    # a0: address of top
    # a1: value
    # memory[a0] = a1
    lw t0, 0(a0)     # t0 = memory[a0]
    sw a1, 0(t0)     # memory[t0] = a1
    
    # top += 4
    lw t0, 0(a0)     # t0 = memory[a0]
    addi t0, t0, 4   # t0 += 4
    sw t0, 0(a0)     # memory[a0] = t0
    ret

pop:
    # a0: address of top
    # a1: address of v
    # top -= 4
    lw t0, 0(a0)     # t0 = memory[a0]
    addi t0, t0, -4  # t0 -= 4
    sw t0, 0(a0)     # memory[a0] = t0
    
    # v = memory[top]
    lw t0, 0(a0)     # t0 = memory[a0]
    lw t0, 0(t0)     # t0 = memory[t0]
    sw t0, 0(a1)     # memory[a1] = t0
    ret

OK now we have usable push and pop that we can call as much as we want. Almost all modern systems use a stack to keep the temporary variables for their subroutines, also to be able to store data if i call a subroutine that calls the subroutine, it will mangle the return address in x1 (ra), so we need to preserve it in memory and later get it out from there, x2 is the sp register that is used specifically designated for that, and we will use it later when our program gets more complicated, but the idea is exactly the same as our stack, but instead of having the top address in system memory, we have it in the register x2 (sp).

We are slowly building up, we started with wires, and electrons, up to control logic, up to instruction decoding, to instruction parameters, to pseudo instructions, and now we have our abstract concept of a stack and a pointer. We are so far away from the electrons, its almost as if they dont exist, and yet when you open a file and inside of it you write those 4 bytes: 0x0000006f, you can imagine, what would the machine do. Even if you dont know how it is wired, you can pretend. You will have certain expectations, like when you have a sequence of instructions, they will be executed in the order you wrote them.

# top = 256
li t0, 248          # t0 = 248
li t1, 256          # t1 = 256
sw t1, 0(t0)        # memory[t0] = t1

And, I will now break everything you have built. The order of instructions is not guaranteed in the way you think. In the name of speed, the wiring might fetch multiple instructions in the same time, and execute them in parallel, or in different order if it decides that it is better. In the example above, it li t0, 248 and li t1, 256 are completely independent, so we could exploit that fact, we just have to make sure both are done before sw t1, 0(t0) is executed. Modern processors are so complicated, they inside are whole distributed systems. There are all kinds of syncronous and asyncronous procesees going on, message passing, pipelining, out of order execution, register renaming (Apple's M1 for example has 600 registers, and it uses them to store and read temporary values, in order to be able to run more instruction in parallel), branch prediction, speculative execution..

Depending on how much you want to think like the machine, how much you want to extract out of it, you have to understand it to different depth, some people stop at 'I understand basic assembly, I dont want to know anything lower', others have to go to the electorns, and I am a bit in between, I have a simplified model of wires and flipflops and few instructions, but don't understand the sophisticated complexity of the modern processor, I just "guess" how it works, unless I need to do something very performant, and then I need to know how much SRAM it has, how big is the write line, how far are things in memory, what is the memory organization and so on. Others dont want to know anything about it, they are just interested in its abstract operations, "it can add, it can store data" or even higher "I can push and pop data from a stack". Or even higher they just think about how objects interract through messages, and what kinds of relationships and structure they can build through this interraction.

You will have to find out what works for you. I am just trying to show you, that it is not so scary to go closer to the electrons, and it will allow you to have some empathy for it.

Forth is simple. Forth is complicated. Forth is extremely powerful. Forth is extremely minimal.

-- Everyone who has written a Forth

Forth, Again

A stack language is exactly what you imagine, every symbol in the language either pushes or pops from a stack. For example + will pop the top 2 elements, add them together and then push the result back. 1 will push the number 1 to the stack, 2 will push 2 and so on.

1
2
+
4 
+
bye

This forth program will first push 1, then push 2, then evaluate + which will pop 2 and pop 1, and push 3 to the stack, then 4 will be pushed, and then + again will pop 4, pop 3 and push 7 to the stack, so after executing it the stack will have just the value 7 in it.

Is kind of the same as this pseudo assembly code (pseudo code is just a mock of code, it wont compile, its goal is just to illustrate an idea):

push 1
push 2
jal plus
push 4
jal plus
jal bye

plus:
  a = pop
  b = pop
  c = a + b
  push c
  ret

bye:
  j bye

Even in this simple program 1 2 + 4 + bye we already have a language. We have symbols, we have semantic rules of how to interpret and evaluate them, we have syntax.

Syntax (grammar rules):

Each symbol is separated by whitespace
The program is read from left to right

Semantics (meaning)

Numbers are pushed to the stack
Words
- plus(+): Pops two values from the stack, adds them together, pushes the result back
- bye: Stops the program, in real Forth it exists the program, but in our pseudocode we just go into infinite loop.

Operational Semantics (how to process/understand the symbols)

The program is evaluated symbol by sumbol, left to right, each symbol is evaluated according with their semantic properties.

The language lives in a different plane from the wires, whoever writes Forth does not need to know about how our assembly will implement the + operation, or how the control logic will manipulate the circuits, or how exactly it will use the ALU. They know, when the + is executed, it will do what it is supposed to do. In the same time they expect that '+' is fast and does not depend on the values, imagine if 3 + 5 was doing 3 + 1 + 1 + 1 + 1 + 1 under the hood. So the way machine works does leak a bit into the language, and into the programmer's thoughts. Other things matter as well, like how much RAM the machine has so that you know how stack you could use.

This is the eternal tension, between us and and the machine.

Programming languages must take advantage of what the machine can do, and what our minds can think. A language that ignores this principle is doomed to fail, regardless of how powerfull or beautiful it is. In the same time, we keep writing code like we still use computers from 1979 with 64kb of RAM and 5 registers beating with 1mhz clocks, now we have 600 registers, instruction parallelism, 5ghz and 64GB of RAM. The machines have grown million fold, but we haven't. Some people say, we keep writing dead programs. Until recently I felt we have not made a real phase transition, you know when you boil water, it just keeps getting hotter and hotter until it reaches 100 degrees, and then from fluid it becomes gas, a true change, a new material, a new phase. But today, I am so excited. I read a lot of old computer books from 80s and a lot of new books, I write code in old languages and in new languages, in order to understand, both myself and the machine. As Kirkegaard says: "life can only be understood backwards, but must be lived forwards". To understand the new computers, the new phase, we must understand the old, but they should not keep us hostage. The soul of the new machine must be explored.

A language, you see, is meaningless, it can not do anything, just like the symbol '7' does not do anything. The wires however, can do things. In this world of ours, where by some miracle, the physical law was gracious enough to reveal some of its mysteries, and we have learned how to ask electrons politely to go through the wire. Who is really evaluating the symbolic language then? Is it our machine or the physical law? When the electrons go through the feed forward gates of the ALU, who is doing the addition?

Wires, assembly and Forth are possibly the best way to study the machine, language and expression. You might ask why not C, and it is a good question, but C is almost assembly, once you get used to it you can almost compile it in your head, it is an amazing language, and allow you to build incredible structures and organizations, it hides almost nothing, it tries to give you all the power over the machine, at least in the 80s that was the case, now the underlying hardware is so complicated that even gcc doesnt know how the instructions will be executed. But to explore language, it is not a great tool. LISP and Forth are better, and I have picked Forth because I think it is cool and not appreciated enough.

They say: you understand Forth once you implement Forth. So lets implement it. We will start with this tic-tac-toe Forth program, and slowly implement a Forth interpreter that will be able to execute it.

create board 9 allot

: board[] board + ;

: reset-board ( -- )
  9 0 do
    '-' i board[] c!
  loop
;

: print ( -- )
  3 0 do   \ j
    3 0 do \ i
      j 3 * i + board[] c@ emit
    loop
    cr
  loop
;

: check-line ( a b c -- flag )
  board[] c@ rot board[] c@ rot board[] c@
  dup '-' = if
    drop drop drop 0
  else
    over    \ a b c -> a b c b
    =       \ a b c==b
    rot rot \ c==b a b
    =       \ c==b a==b
    and     \ c==b && a==b
  then
;

: check-win ( -- )
  0 1 2 check-line if 1 exit then
  3 4 5 check-line if 1 exit then
  6 7 8 check-line if 1 exit then
  0 3 6 check-line if 1 exit then
  1 4 7 check-line if 1 exit then
  2 5 8 check-line if 1 exit then
  0 4 8 check-line if 1 exit then
  2 4 6 check-line if 1 exit then
  0
;

: play ( -- )
  'X' 'O'
  begin
    over emit ." 's turn" cr
    print
    over key '0' - board[] c!
    swap
    1 check-win = if
        print cr emit ."  wins" cr
        exit
    then
  again
;

reset-board play bye