Yesterday, I asked myself if I should add a very short and simplified definition of basic terms to the Appendix. I am still unsure if that makes sense, but on the other hand it is always good to have such a short text available. So I started writing a first draft fully manually, currently without any grammar fixes. I think it is indeed not that bad as a starting point. Later I could always replace it by definitions from Wikipedia or AI.
Does the text contain serious errors? For Cache, Alignment and Endianess I am not that sure, I have never cared that much for this topics. And are there more terms that would need a definition for Rust beginners?
Common Terms of Systems Programming
Values and Variables
Most data values in the real world are build from a number and a unit (dimension) , like 60s or 1.2m for a time period and a length.
When we process data with the computer, we might use units as well when we read in values from the terminal or from files, but
when we actually process the data, we generally just assume that some standard unit is implicitly applied, like seconds or meters
for time intervals or lengths, and use only the numeric part for arithmetic operations. However, for strongly, and statically typed
languages like Rust, we have to decide which data type we use to store the numeric value, e.g. u8, i32, or f64. The chosen data type
determines which values are available. For example, if we decide that we want to store the age of our employees in an u8 (one byte, using the binary number format) we have a numeric range from 0 to 255 available,
and we could store the age of a 35 years old person in the bit pattern 00100011.
To actually store values in form of bit pattern we use variables, which are named memory slots on the stack,
with a fixed address and a size determined by the data type of the value.
The fact that all our numeric values are just plain numbers without a physically unit or dimension, leads to the
problem that illegal mathematical operations can not be prevented by the compiler, and that complete wrong values could be passed to
functions. For example, we could add a numeric instance of a time and a length value, or we could pass a time value to a
function that expects a length. The compiler would only check if the data types match, but has no other way to detect
the invalid code. For preventing such errors in critical code, most modern programming languages offer some support.
For Rust, one option is to use the newtype idiom, which means that we embed the numeric value in a struct type like
struct Lenght(f64);
. This way, it is ensured that this type is not compatible with with any other numeric type. We can
define our own operations and conversions for this type, e.g. we might implement the +
operator to allow the addition
of two lengths. By defining more methods and traits, we could allow operations like the division of a length by a time
resulting in another newtype called Speed. The books chapter about the struct type will present a few examples for this strategy.
Pointers, Memory Addresses, and References
A special type of a variable is a pointer, which stores not some actual numeric data, but
a memory address. This way, pointers create a form of indirection. A pointer
variable can store a memory address, and this address can contain arbitrary data, including
again a pointer, which then might store another memory address and so on.
Accessing the actual item to which a pointer points is calling dereferring the pointer.
Pointers often contain addresses of a special memory region, that is called the heap.
The heap is a memory region, that is dynamically managed by the operation system. We will give a short definition
of the stack and the heap in the next section.
But pointers are in no way restricted to contain only heap addresses, they can point to
any memory location, including to other variables on the stack or to static text located in a data segment.
With pointers we can build interesting data structures, e.g. binary trees, where a node contains two pointers, which each can link to other nodes.
We can have a set of pointers that all points to the same memory region, that is to the same entity. This way we can construct
"many to one" data structures. For example, the pieces on a chess board might be represented by a struct, that not only has fields
for its color and position, but also a pointer to the whole board.
An important property of pointers is, that they can be invalid, which means they do not contain the address of an actual target entity.
A convention is, that for pointers a special value is defined as NULL, indication the absence of a actual target value.
NULL is often just a word with all bits cleared, but any other well defined bit pattern could be used as well as marker.
Pointers are typically initialized to NULL before a actual value is assigned. Using a NULL pointer to actually access
a target item is an invalid operation, doing so is a serious programming bug. However using pointers
has one more risk: The item to which a non NULL pointer points can become invalid. This happens, when
items allocated on the heap are freed, or when items on the stack become invalid because a scope or a function ends.
Continuing using pointers after their target has become invalid is a serious programming error.
Rust allows the direct use of pointers only in code sections that are explicitly marked with the unsafe keyword.
Regular Rust code uses only references, which are a safe variant of pointers. References are guaranteed to always contain
only valid data, and this is ensured by the Rust compiler through the borrow checker. This avoids that references can contain invalid data,
and ensures that references can be used only as long, as the data is actually available.
L-Values and R-Values
In computer science we sometimes distinguish between l-values and r-values. L-values can be used on the left of an assignment operator and
can store a value. R-values are used on the right of the assignment operator, and include ordinary variables, but can be also expression or constants
that can not be used on the left side of an assignment operator.
Storage
A computers memory is typically an addressable storage with arbitrary read and write access, often called RAM for random access memory.
This main memory of a computer is typically logically divided into a stack and a heap area. Why this storage is physically identical, the way it is used
and organized makes the difference. Other parts of the computers memory are the area where the program code itself and related static data like
text segments reside. Most computers have also some non volatile memory to permanently store basic configuration data, and all the computers software
and data is now typically stored on another form of non volatile storage called SSD instead on traditional hard disks.
Stack vs. Heap
Stack
The stack is a memory region that is used in a very simple, linear way: An entity called stack pointer
is initialized to an arbitrary start address, and
whenever a variable has to be stored, it is put at that location, while the stack pointer is increased at the same time by the byte size of the stored
variable. Actually, the whole stack area can be filled from bottom to top or from top to bottom -- for our explanations we just assume the first variant. Storing variables this ways is easy and fast, and as variable names in our source code directly corespondents to addresses on the stack, we can
access all the values in arbitrary order. The downside of using the stack in this way is, that we can not insert values at arbitrary positions: Even if we would move
the existing values to insert new data, this would confuse all the bookkeeping, invalidation all existing mappings of variable names to memory locations.
Removing unused values has also some restrictions: We can decrease the stack pointer by an amount, to delete the variables that we pushed on the stack last,
but we can not delete arbitrary variables. The most popular example for this last in, first out behavior in real life are stacks of plates in a canteen:
The service personal puts sets of cleaned pieces on the top, and customers remove them from the top.
For storing function parameters and locale variables in functions, the stack is an efficient data structure, as long as the data have a fixed size each.
For calling a function, we only have to push the data on the stack and increase the stack pointer. And when execution of that function starts, the
stack pointer is again increased by a value determined by the size of all the local variables of that function.
When calling another function, or the same function recursively, this process is repeated. Finally, when the functions ends or return early due to a return statement, the
local functions variables are all freed at once by just a single fix of the stack pointer. The actual behavior is a bit more complicated: Before
calling a function, the return address is pushed on the stack as well, so that the function has the data available to jump back to the call site.
And when when the functions returns some results, that data is pushed on the stack as well, before the functions returns. Another point is, that
some function parameter or function variables might be temporary stored in CPU registers instead of on the stack, which might improve performance.
Heap
The heap is a memory region, which is managed by the OS. Micro controllers and embedded systems, which do not have an OS, typically have no heap memory.
The OS does a carefully bookkeeping for the heap. Whenever storage for data is needed, it has to be requested by calling functions like alloc(),
and it has to be released by calling functions like free() or dealloc() to release that storage and make it available for other data again.
This is some overhead, but the advantage is, that the memory is managed dynamically, allowing allocation, re-allocations and de-allocations in
arbitrary order.
Pointer variables located on the stack and storing the start address of some data are typically used to access memory areas in the heap, but heap data might contain
addresses as well, which allows to construct all kinds of data structures like linked lists or tree shaped structures.
Some of Rusts data types,
like strings or vectors, consists of a struct instance, with three fields: Two numeric fields of type usize storing the capacity and the actual size
of the data, and a pointer pointing to the actual heap allocated data segment. When the capacity of a string or vector is not sufficient for storing the intended data,
a new heap segment have to be allocated, and existing data has to be copied from the actual heap region to the newly allocated data block before the pointer in the struct is updated and
the now unneeded old memory block can be released.
Cache
Modern computers have the problem, that ordinary RAM storage is relatively slow compared to the CPU instructions working on processor registers.
The relatively slow memory bandwidth of ordinary RAM is even more a problem, when many CPU cores have to access the RAM simultaneously using the same data bus.
The cache is a faster type of memory, that can be used as a faster storage for small to medium sized data or as a buffer for the ordinary ram.
Modern hardware uses often multiple cache variants with different sizes and memory bandwidths, called level 1, 2, and 3 cache, often labeled L1, L2, and L3.
In some computer architectures each CPU core has its own cache, in other architectures the cores share their cache. The large L3 cache is typically shared.
The level 1 cache is very fast, often comparable to CPU register access, but is often a few dozen bytes in size. The level 2 cache is larger, its size is typically a few
kB, and its performance is between the L1 and L3 cache. Finally, the L3 cache is a few MB in size, but it is the slowest of the 3 cache variants.
Cache is used for data, and for the program code. Data cache has the effect, that working with small data objects, or working with larger data sequentially,
is much faster than working on large data in non sequential order. And use of program cache can result in the fact, that executing small loops, that fully
fits into the program cache, runs faster than very large loops, that does not fit into the program cache.
Data Alignment
Modern computer hardware does not transfer single bytes between the CPU and the RAM, but whole words. The word size is determined by the data bus width and
is typically 64 bit or 8 bytes, and the transferred data words are located at addresses that are multiplies of the word size, e.g. multiplies of 8.
To optimize the data transfer, compilers typically aligns data. Instances of a struct, or other entities that are at least 8 bytes in size start
at addresses that are multiplies of 8. This ensures, that for transferring n
words, only n
memory accesses are needed. Without alignment, one additional
memory access would be needed, potentially with additional CPU instructions to move the data into the CPU register. Data with four or 2 bite in size is aligned also,
typically so that the start address is a multiple of the byte size. Alignment has typically not a large impact on the way how we write our code. However,
one result might be, that struct sizes are always a multiple of 8, e.g. a struct with only one field of type u8 might have a total size of 8, and it contains
7 bytes of unused padding. Another consequence can be, that fields of struct types might be reordered by the compiler, or that padding bytes are inserted.
This can lead to surprising results when data is exchanged with other programming languages, which might use different padding and alignment strategies.
Endianness
The endianess defines, how data larger than a single byte is stored in memory. A 8 byte integer with value 1 is built from one byte with the lowest bit set, and
7 bytes with all bits cleared. When this value is stored in memory, the single non zero byte might be located at the beginning or at the end. These two
possibilities are named little endian and big endian encoding. Typically we do not have to care for this, but when we receive data over the Internet
or store data on one computer and load it onto another, the endianess is important.