Understanding the fundamental concepts behind the APIs?

So even though I came to Rust to improve my programming chops. I have been blown away by how much skill and understanding I have been picking up. This journey has been wild! And for that I am so grateful!

But a lot of programming tutorials focus on basically explaining an API and/or assuming you already know what's really happening underneath. Perhaps I struggle simply because I am entirely self taught.

I need the dummies guide to computer science, likely independent of any programming language.

For example, here are some areas that I feel quite lost in. Perhaps you can read them, and either reply directly to any, or suggest an appropriate learning material of any kind:

  • What is serialization and deserialization?

  • How do http servers actually 'make webpages' i.e. how is it possible to code a web-app entirely in Rust? (I.e. Rocket, Actix) - I guess this maybe means understanding HTTP, TCP/IP, sockets? etc?

  • What is actually happening inside a virtual machine? If a virtual machine is just an intermediary step between input and output, isn't any program technically a 'virtual machine'?

  • How does one even go about creating a programming language? Is creating a language just effectively complex string parsing/manipulation? Are there clearly laid out steps to every programming language? I've heard these terms thrown around lexer, tokenisation, parsing)

  • How does concurrency work? And what concepts do I need to understand before learning things like Tokio, Async/Await, etc? I've heard of these terms, Atomics, Mutex?

Thank you for your time!

1 Like

Serialization is the act of taking some value in your program and turning it into a sequence of bytes you can write to e.g. a file or network connection. Deserialization is the opposite.


A web server includes many components, but ultimately it boils down to opening a TCP connection to the server, and writing something similar to this to the socket:

GET /some/path HTTP/1.1
Host: www.example.com

The server then responds with something like this:

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 155
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Connection: close

<html>
  <head>
    <title>An Example Page</title>
  </head>
  <body>
    <p>Hello World, this is a very simple HTML document.</p>
  </body>
</html>

So writing a webserver involves:

  1. Some sort of loop receiving new connections.
  2. Some code that reads data from a new connection.
  3. Some code that can deserialize the request data into an object the server understands.
  4. Some code that can build up the resulting html (e.g. look up the path and read the file).
  5. Some code that can write the resulting data (like in the example above) to the socket.

The step where it builds the html can involve many pieces, e.g. maybe it talks to a database, or all sorts of other crazy stuff. A web server probably also involves some sort of concurrency so it can handle multiple connections at the same time.


A virtual machine is a program that lets you run one operating system inside another. I don't know too many details on how they work internally, but there's usually some way to make it more efficient than simulating an entire computer inside the computer.


A compiler is in some sense just a complex string parser and manipulater as you say. There are some common pieces, e.g. you might first tokenize the source file into a sequence of larger pieces that are easier to work with, e.g.: [keyword fn, identifier foo, token (, token ), token ->, identifier String, token {] and so on.

You have many operations like this. Eventually after many such conversions, you have converted it to assembly and then a sequence of bytes, that you can write to a file.


Concurrency just means "at the same time". There are many facets of this, and it's a very large topic.

One side of it is parallelism. This is about using multiple CPU cores to make your programs faster. E.g. maybe you wrote a ray-tracing program, and each pixel can be computed independently of any other, so you split it up into several smaller task that are computed in parallel on different CPUs. This is the world of rayon.

Another side is the world of async/await. This is more about how you can run thousands of tasks at the same time, when each task spends most of its time waiting (e.g. network IO). This is where Tokio and such comes into the picture. If this interests you, I recommend the Tokio tutorial.

6 Likes

As @alice said, serialization is taking some value and turning it into a sequence of bytes (literally putting the bytes "in order".) Think of it like writing; in your head, a sentence is a sort of abstract thing, but you can write letters (and other symbols) in sequence on paper, hand that off to someone else, then they can read that sequence and form the same abstract sentence in their head. The writing is serialization, and the reading deserialization.

A virtual machine is, at a high level, a program that takes some stream of input that represents commands and implements those commands. You could view pretty much anything that takes some input as a virtual machine with a very specific instruction set, but in general, a virtual machine mimics a real machine in some ways (a dictionary mapping variable names to values in place of raw memory, a program "counter" pointing to the current virtual instruction,) and it's mostly used more for programs that take a very generic instruction language (sometimes bytecode, where each instruction is something like "load variable foo onto the stack" and "add these two values and store it here", and sometimes higher-level language parsed into an abstract syntax tree) and execute that language. With a virtual machine, it would be theoretically possible to create a piece of hardware that could execute those instructions without needing to be virtual.

Creating a language is really two things; creating the language specification, which is the sort of abstract model of the language (stuff like "numbers are values",) and creating an implementation, which is the thing that compiles or runs the language (often this is a virtual machine.)

In an implementation, you have to turn the human-readable source code into something understandable by a machine; this is often done through a two step process of tokenization using a lexer (this breaks a continuous source file into individual "words", handling things like whitespace) and then a parser which takes the list of tokens and turns it into an abstract syntax tree, which represents the structure of what was written in a way that the computer can handle. So if you had a source file

fn main() {
    println!("Hello, world!");
}

you'd put it through the tokenizer and get

[FN, ident(main), OPEN_PAREN, CLOSE_PAREN, OPEN_BRACKET,
 ident(println), EXCLAMATION_MARK, OPEN_PAREN,
 string(Hello, world!), CLOSE_PAREN, 
 SEMICOLON, CLOSE_BRACKET

which the parser turns into the internal form

function {
    name: "main",
    args: [],
    body: macro_call {
        name: "println",
        args: ["Hello, world!"],
    }
}

Then, the compiler has to figure out what that tree of syntax actually needs to do. This can be fairly complicated or very simple, depending on the language. Often a compiler is a multi-phase thing; one step would take this syntax tree and turn it into a flatter tree, then the next step would take the flatter tree and desugar things to make the number of core features the language needs to implement smaller, and so on until eventually you get out machine code. Or you could just write a function

fn exec(ast: AST) { ... }

that just "does" whatever the ast represents, as a scripting language does.

2 Likes

Virtual Machine:

All programs process input and output, but virtual machine programs are designed to mimic the abilities of an entire computer. The virtual machine is free to run whatever software is supported by the mimicked hardware, including entire operating systems.

Usually this mimicry is assisted by specialized hardware capabilities called virtualization technologies, enabling far better performance than software can manage alone. In this case, the vm's software is directly processed by the host machine's CPU and not by the virtual machine software which would be many orders of magnitude slower. If you have a virtual machine without specialized hardware support, what you have is an emulator. Emulators are slower by nature but suffer fewer restrictions due to being implemented entirely in software. Example: Hardware virtualization will allow you to run an x64 based virtual machine on an x64 based host, but will not let you run an arm based virtual machine on that same host. Emulation is used to fill in the gaps in hardware support, giving broader but slower support. Even in a hardware assisted virtual machine, some bits of hardware may be emulated while others are virtualized. Network adapters and disk drives are often virtualized, but other hardware is often emulated.

1 Like

Despite a lifetime of programming creating, a language and more importantly a compiler for a language is still some deep black magic to me.

However have a read of "Let's Build a Compiler" by Jack Crenshaw: https://compilers.iecc.com/crenshaw/#:~:text=This%20fifteen-part%20series%2C%20written,them%20in%20a%20ZIP%20file.

That is an amazing series of articles aimed at those who can at least program something. A great discussion of the choices one has to make when designing a language, with some historical background. And then a super simple discussion of how to write a compiler that will get you from the input source code of your language to executable instructions, in perhaps the most simple way it can be done.

Simple enough that I managed to create a compiler in a few hundred lines of code for a simple C/Pascal style language that generated code for Intel x86 and the Propeller micro-controller from Parallax Inc.

All his example codes are in Pascal, but it's clear enough that I wrote mine in C. And now I start to think I should have another go in Rust. Anyway the idea is not to slavishly copy his code but to use the ideas for your own little language.

Of course that is missing a ton of compiler technology, all that abstract syntax tree stuff and optimization etc, etc. But still there are useful techniques to be learned there and it's a real buzz to see ones first compiled code run!

2 Likes

Sometimes the term virtual machine is used to describe an abstract machine that is not identical to any particular CPU. An example is the Java Virtual Machine (JVM). The Java compiler produces byte code, which is something like machine instructions. A second step converts the byte code to machine instructions for a specific CPU. You only have to compile your Java code once since the byte code can be converted to machine instructions for different kinds of CPU.

Or my favorite abstract virtual machine: The Z-machine

Wow thanks all for the feedback. Especially the links to further learning.

So as the server is listening for messages, it receives a blob of binary data from a client, (with some kind of termination mechanism, either the length is fixed, or it is terminated by some series of binary values). Deserialization is the parsing of that binary blob... Understanding it is a GET request (or whatever)?

So If I have some random struct in my program:

struct Person {
    name: &str,
    age: i32,
    eye_color: Color,
}

It already exists as bytes somewhere in memory. So Serialization is copying/sending those bytes over to some other destination? Deserialization, is parsing the incoming bytes and understanding that it is an instance of Person?

Reading the binary blob over the network is not part of deserialization, only the conversion from byte array to object is. Similarly serialization is merely the conversion from object to byte array, and what you do with that byte array (e.g. write to file/send over network) is not considered part of the serialization.

However it's worth noting that many serialization libraries support you giving it an IO resource rather than a byte array, so it can do its thing without reading the entire thing into memory. This mixes things up a bit, but still, the reading/writing part is the reading/writing part, and the understanding/creation of bytes is the deserialization/serialization part.

In the case of a GET request, it is terminated by two consecutive newlines.

Unlike those bytes in memory, the serialized data must be understandable to any computer with any type of CPU, so you can't "just" copy the bytes over. An i32 looks different on big-endian and little-endian machines, but to communicate, they must choose one specific way to represent the integer as bytes and insert conversions as necessary.

Also, a &str is not stored in the same place as the rest of the Person. It's a pointer to somewhere else. The serialized byte array has to store all of it in a single byte array. E.g. maybe it starts with four bytes containing the length of the string, followed by that number of bytes, then followed by the age in big-endian, finally followed by the eye color in some format.

I just stumbled on this very compact and well put together video on writing a lisp interpreter in JavaScript (the concepts are valid across languages) if anyone in the future is interested.

Thanks, I will probably check that out.

But in some deep way that is not the interesting thing about writing a compiler that fascinates me.

The "magic" in a compiler for me is that it produces the real binary instructions that some real processor can execute in the absence of any other software, no operating system, no interpreter, nothing but the hardware.

Further, despite the fact that I might write such a compiler in C or Pascal or whatever, at least in concept it could be written in assembler, compiled/assembled by hand with pencil and paper and entered into that machine in binary.

The fascination then is how to bootstrap the whole software world, OS, compilers, interpreters, etc, staring from the hardware alone with no software in sight?

Using an interpreter like JS to interpret some other language like Lisp is so far removed from that problem.

On a Nova 9 circa 1976, we used toggle switches on the front panel to enter the bits that told the computer to start loading more code from tape. (I think it was tape.) These days that bootstrap code is in ROM.

Oh boy, reminiscence time....

Back in 1980 we built a single board computer using the new Motorola 6809 micro-processor. We had no compiler or assembler or any tools for it.

Getting code into that machine went as follows:

  1. Write what you wanted to do in pseudo code, kind of ALGOL like, on paper with a pencil. (pencil is best because you can erase mistakes and start over)

  2. "Compile" that pseudo code to assembly language, by hand with pencil an paper.

  3. "Assemble" that assembly code to hexadecimal memory content. With pencil an paper.

  4. Enter that into an EPROM programmer, via a paper tape punch. Blow the EPROM chip with it. Put the EPROM chip in the new computer board an see if it ran.

After some weeks we had a boot loader and a debug monitor running on that board. We could load and save code to cassette tape. Run it, inspect memory, set break points, all from a VT100 serial terminal.

Magic.

It always fascinated me as to where we could go with that. The next step would have been to write an assembler for that machine, then some operating system, then some high level language. All created from nothing.

Of course we never did. But Ken Thompson and Dennis Ritchie had done so ten years before in creating Unix.

2 Likes