How can I make this function as fast as possible while taking up as little memory as possible?

In my kernel I have a very small heap (1 MB). I can make this larger, but I could never make it large enough for what the function I'm trying to optimize does.
My virtual memory manager accepts memory information from my bootloader. Since this information isn't clone and copy (its a pointer), I manually duplicate the information and store it in a static:

// For now I use a stack-allocated Vec -- my memory manager is kinda broken at the moment.
static MMAP: Once<Vec<MemoryRegion, 1024>> = Once::new();
// ...
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
struct MemoryRegion {
    pub start: u64,
    pub end: u64,
    pub kind: MemoryRegionKind,
}

/// Initializes the internal memory map.
#[cold]
pub fn init_memory_map(map: &'static MemoryRegions, rsdpaddr: u64) {
    info!(
        "Loading free memory region list from memory map at {:p}",
        &map
    );
    MMAP.call_once(|| {
        let mut mmap: Vec<MemoryRegion, 1024> = Vec::new();
        map.iter().for_each(|region| {
            mmap.push(MemoryRegion {
                start: region.start,
                end: region.end,
                kind: region.kind,
            })
            .unwrap();
            STOTAL.fetch_add(region.end - region.start, Ordering::Relaxed);
        });
        mmap
    });
    info!("Discovered {} bytes of RAM", STOTAL.load(Ordering::Relaxed));
    info!("RSDP at {:X}", rsdpaddr);
    RSDP.swap(rsdpaddr, Ordering::Relaxed);
}

This init_memory_map function is only called once, hence the use of the cold attribute. The function that's (potentially) on the really, really hot path is this portion of code:

// For context only
static FRAME_ALLOCATOR: Lazy<TicketMutex<Option<GlobalFrameAllocator>>> =
    Lazy::new(|| TicketMutex::new(None));
// Position of frame that we want to get
static FPOS: AtomicUsize = AtomicUsize::new(0);
// ...
#[derive(Debug, Copy, Clone)]
struct GlobalFrameAllocator;

impl GlobalFrameAllocator {
    /// Initializes the global frame allocator
    #[cold]
    pub fn init() -> Self {
        FPOS.store(0, Ordering::Relaxed);
        GlobalFrameAllocator {}
    }
}

unsafe impl FrameAllocator<Size4KiB> for GlobalFrameAllocator {
    #[must_use]
    fn allocate_frame(&mut self) -> Option<PhysFrame> {
        if MMAP.is_completed() {
            FPOS.fetch_add(1, Ordering::SeqCst);
            return MMAP
                .get()
                .unwrap()
                .iter()
                .map(|r| r.start..r.end)
                .flat_map(|r| r.step_by(4096))
                .map(|addr| PhysFrame::containing_address(PhysAddr::new(addr)))
                .nth(FPOS.load(Ordering::Relaxed));
        } else {
            warn!("Memory allocation attempted when MMAP was not ready");
            warn!("Waiting for memory map to be ready...");
            FPOS.fetch_add(1, Ordering::SeqCst);
            return MMAP
                .wait()
                .iter()
                .map(|r| r.start..r.end)
                .flat_map(|r| r.step_by(4096))
                .map(|addr| PhysFrame::containing_address(PhysAddr::new(addr)))
                .nth(FPOS.load(Ordering::Relaxed));
        }
    }
}

impl FrameDeallocator<Size4KiB> for GlobalFrameAllocator {
    unsafe fn deallocate_frame(&mut self, _: PhysFrame) {
        FPOS.fetch_sub(1, Ordering::SeqCst);
    }
}

The reason this code is on the really hot path (the boiling path?) is because every time that we want a new frame of memory for a page we're allocating we call the allocate_frame function, and every time we want to remove a new frame (or, in this case, just go back a step to "free" a frame) we call deallocate_frame. (In future I hope to have a "proper" frame deallocator, but this suffices for now.)
As it currently stands, the frame allocation routine is quite complicated. From what I understand of iterators, these particular ones are not zero-cost. The top-level conditional branch is necessary; we don't want the memory frame allocator being used before there are frames. I could eliminate this by just calling wait() on MMAP, but that wouldn't get rid of the branch and would only complicate the code even more.
I love iterators, I really do. My entire codebase is riddled with them because I enjoy using them and they make programming quite pleasant. (The only place I don't use them is in PCI enumeration where I need to recursively call async functions, and async closures aren't stable yet.) How could I optimize this code and when would an iterator be a bad thing? I understand that the above allocate_frame function is sort-of equivalent to loops and conditional statements in other languages.
I'd appreciate your thoughts and ideas.

Without seeing some of the details of the rest of the code, this doesn’t strike me as a place you’d want to use iterators. The iterator is designed to call next every time to walk through the iteration. In your case you’re calling nth at the end so you only ever need the value at the end. And it seems like you can skip a bunch of the next calls by doing the arithmetic instead in this case to reach the desired position in the iterator.

It’s possible the compiler could make those optimizations for you but the other issue that I think may prevent that is the map call before nth which the compiler may interpret as needing to create an object for each addrs in case there are side effects (I think the compiler can be smart enough to avoid that but it’s not guaranteed). So to help the compiler out while still leaving iterators in, I would swap the order of the last map call and the nth call.

That said I think you’d be more likely to get optimal performance by manually doing the arithmetic in a loop to guarantee that you’re skipping over ranges that you don’t need to consider.

Apologies; I should've linked to the code originally. You can find that code here. I hope that helps. I'm not sure how I'd go about rewriting it to use classical loop constructs.

Assuming r.end - r.start is always a multiple of 4096, you could turn

return MMAP
    .get()
    .unwrap()
    .iter()
    .map(|r| r.start..r.end)
    .flat_map(|r| r.step_by(4096))
    .map(|addr| PhysFrame::containing_address(PhysAddr::new(addr)))
    .nth(FPOS.load(Ordering::Relaxed));

into:

let mut offset = 4096 * FPOS.load(Ordering::Relaxed);
for region in MMAP.get().unwrap().iter() {
    if region.start + offset < region.end {
        let addr = region.start + offset;
        return PhysFrame::containing_address(PhysAddr::new(addr));
    } else {
        offset -= region.end - region.start;
    }
}

And if r.end - r.start is also constant, say it's RSIZE, you can avoid the region iteration too:

let offset = 4096 * FPOS.load(Ordering::Relaxed);
let index = offset / RSIZE;
let offset = offset % RSIZE;
// or: `let offset = offset - RSIZE * index`
// but the above two expressions are likely to be compiled down to a
// single "divide and mod" operation
let addr = (MMAP.get().unwrap())[index].start + offset;
return PhysFrame::containing_address(PhysAddr::new(addr));
3 Likes

r.start and r.end are not actually constant. And they may not be a multiple of 4096 either -- at least I don't think they are (the page size is 4096 but a memory region could be of variable width). Should I just implement a check to see if the end and start are multiples of 4096 and if not fall back to the old clunky method?

I assumed r.end - r.start (not necessarily the values themselves) was a multiple of 4096, because otherwise don't you have a partial page at the end? Anyway, I believe the first version could be fixed up to still match the iteration by doing something like (untested):

-        offset -= region.end - region.start;
+        let size = region.end - region.start;
+        let remove = match size % 4096 {
+            0 => size,
+            _ => size + 4096,
+        };
+        offset -= remove;

Once the checks start to pile up, I think you're definitely in "measure to see" territory. As @drewkett mentioned, though, in the iterator version you can at least swap the order of the last map and the nth no matter what.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.