Profiling signage application on a Raspberry Pi

I'm working on a digital signage engine that lets the user design a custom signage application in Javascript using Boa. My goal is to have the engine run a reasonably complex application (like that shown in the Example 2 demo on the readme) at 60 fps in full-screen on a Raspberry Pi 3. While the engine runs cross-platform (and I'm developing on Linux), the RPi3's performance is very limited. I'm using Speedy2D as a hope to get the most out of the Raspberry Pi, and I've split my Javascript and graphics drawing logic into two separate threads that can (mostly) run in parallel. However, I'm just barely not getting to 60 fps.

I've used cargo-flamegraph in the past to benchmark Rust programs on the desktop, but the RPi is so resource constrained that I'm unable to get it to compile. Does anyone have any tips on how to find the bottlenecks in a Rust app when developing on something as resource constrained as the RPi3? To add to the difficulty, just compiling my signage engine on the RPi3 take about an hour, so trying to succeed via guess-and-check is prohibitive.

Thanks! I'm open to any ideas anyone has to speed up this application in general.

You're doing a few copies of the graphics calls per frame. If the Vec of graphics calls is large enough on average, it's possible getting rid of those would make a noticeable performance impact.

One conceptually simple way to do that would be to use two channels, one to send filled buffers to the renderer and one to send the "used" buffers back to the JS thread to be filled again. As long as you have at least 2 Vecs in circulation the JS thread should be able to write to one while the renderer uses the other. You avoid the copying that the JS thread does from its buffer to the renderers, and the renderer gets to skip the clone of the Vec which involves an allocation and copying the whole buffer again.

I did try at least making changes in that vein to make sure I could get it to compile and run, but I can't say much about it other than "doesn't seem completely broken"

channel buffer exchange window_handler.rs
use std::cell::RefCell;
use std::collections::HashMap;
use std::path::{Path, PathBuf};
use std::rc::Rc;
use std::sync::atomic::AtomicBool;
use std::sync::mpsc::{Receiver, Sender};
use std::sync::{atomic, mpsc, Arc, Mutex, RwLock};
use std::thread;
use std::time::{Duration, Instant};

use speedy2d::color::Color;
use speedy2d::dimen::Vec2;
use speedy2d::image::{ImageHandle, ImageSmoothingMode};
use speedy2d::window::{
    MouseButton, WindowFullscreenMode, WindowHandler, WindowHelper, WindowStartupInfo,
};
use speedy2d::Graphics2D;
use thiserror::Error;
use tracing::info;

use crate::js_env::{GraphicsCalls, JsEnv};
use crate::perf::Perf;

#[derive(Error, Debug)]
enum SignError {
    // #[error("EvalAltError: {0}")]
    // EvalAltError(#[from] ScriptError),
}

enum JsThreadMsg {
    RunFrame(f32, Vec<GraphicsCalls>),
    TerminateThread,
}

// #[derive(Clone)]
// enum UserEvents {
//     Index,
// }

pub struct SignWindowHandler {
    graphics_calls_rx: Receiver<Vec<GraphicsCalls>>,
    previous_buffer: Vec<GraphicsCalls>,
    js_thread_tx: Sender<JsThreadMsg>,
    last_frame_time: Instant,
    last_mouse_down_time: Option<Instant>,
    pub is_fullscreen: Arc<Mutex<bool>>,
    draw_offset_stack: Vec<Vec2>,
    draw_offset: Vec2,
    pub root_path: Arc<Mutex<PathBuf>>,
    image_handles: Rc<RefCell<HashMap<String, ImageHandle>>>,
    draw_perf: Perf,
    server_port: u16,
}

impl WindowHandler<String> for SignWindowHandler {
    fn on_start(&mut self, helper: &mut WindowHelper<String>, _info: WindowStartupInfo) {
        let sender = helper.create_user_event_sender();
        crate::server::start_server(self, Mutex::new(sender), self.server_port);
    }

    fn on_draw(&mut self, helper: &mut WindowHelper<String>, graphics: &mut Graphics2D) {
        let dt = self.last_frame_time.elapsed().as_secs_f32();
        self.last_frame_time = Instant::now();

        // Wait for the JS thread to notify us that it has a buffer ready.
        let mut graphics_calls = self
            .graphics_calls_rx
            .recv()
            .expect("No remaining graphics call senders");

        self.js_thread_tx
            .send(JsThreadMsg::RunFrame(
                dt,
                // Send the previous buffer to the JS thread. On the first draw this will be the `Vec` created in `new`
                // Note: take will replace the `Vec` with an empty one, which doesn't allocate anything initially. We replace this default `Vec` with the buffer we received at the end of rendering, so the default Vec should never end up getting sent back to the JS thread
                std::mem::take(&mut self.previous_buffer),
            ))
            .unwrap();

        self.draw_perf.start();

        // Perform queued graphic calls
        self.draw_offset_stack.clear();
        self.draw_offset = (0., 0.).into();

        for call in graphics_calls.iter() {
            match call {
                GraphicsCalls::ClearScreenBlack => {
                    graphics.clear_screen(Color::BLACK);
                }
                GraphicsCalls::ClearScreen(c) => graphics.clear_screen(*c),
                GraphicsCalls::DrawRectangle(r, c) => {
                    graphics.draw_rectangle(r.with_offset(self.draw_offset), *c)
                }
                GraphicsCalls::DrawRectangleImageTinted(r, path_string, c) => {
                    let image_handle = self.get_image_handle(path_string, graphics);
                    graphics.draw_rectangle_image_tinted(
                        r.with_offset(self.draw_offset),
                        *c,
                        &image_handle,
                    );
                }
                GraphicsCalls::DrawText(pos, c, block) => {
                    graphics.draw_text(pos + self.draw_offset, *c, block);
                }
                GraphicsCalls::DrawImage(pos, path_string) => {
                    let image_handle = self.get_image_handle(path_string, graphics);
                    graphics.draw_image(pos + self.draw_offset, &image_handle);
                }
                GraphicsCalls::PushOffset(vec2) => {
                    self.draw_offset += *vec2;
                    self.draw_offset_stack.push(*vec2);
                }
                GraphicsCalls::PopOffset => {
                    self.draw_offset -= self.draw_offset_stack.pop().unwrap_or(Vec2::ZERO);
                }
                GraphicsCalls::SetResolution(uvec2) => {
                    graphics.set_resolution(*uvec2);
                    helper.set_size_pixels(uvec2);
                }
                GraphicsCalls::ImageFileUpdate(pathbuf) => {
                    self.update_image_handle(pathbuf, graphics)
                }
            }
        }

        // Clear the old buffer and store it in `previous_buffer` to allow it to be reused on the next draw call.
        graphics_calls.clear();
        self.previous_buffer = graphics_calls;

        self.draw_perf.stop();
        self.draw_perf.report_after(Duration::from_secs(1));

        helper.request_redraw();
    }

    fn on_mouse_button_down(&mut self, helper: &mut WindowHelper<String>, _button: MouseButton) {
        let double_click_timeout = Duration::from_millis(500);
        let now = Instant::now();

        if let Some(prev_down) = self.last_mouse_down_time {
            if now - prev_down < double_click_timeout {
                self.toggle_fullscreen(helper);
            }
        }

        self.last_mouse_down_time = Some(now);
    }

    fn on_fullscreen_status_changed(
        &mut self,
        _helper: &mut WindowHelper<String>,
        fullscreen: bool,
    ) {
        *self.is_fullscreen.lock().unwrap() = fullscreen;
    }

    fn on_user_event(&mut self, _helper: &mut WindowHelper<String>, user_event: String) {
        println!("{}", user_event);
    }
}

impl Drop for SignWindowHandler {
    fn drop(&mut self) {
        self.js_thread_tx
            .send(JsThreadMsg::TerminateThread)
            .unwrap();
    }
}

fn js_thread(
    app_root: PathBuf,
    ready: Arc<AtomicBool>,
    graphics_calls_tx: Sender<Vec<GraphicsCalls>>,
    js_thread_rx: Receiver<JsThreadMsg>,
) {
    thread::spawn(move || {
        let mut js_frame_perf = Perf::new("JS frame");

        let mut script_env = JsEnv::new(&app_root);
        if let Err(err) = script_env.call_init() {
            dbg!(err);
        }

        ready.store(true, atomic::Ordering::SeqCst);

        loop {
            match js_thread_rx
                .recv()
                .expect("No remaining JS thread senders!")
            {
                JsThreadMsg::RunFrame(dt, buffer) => {
                    js_frame_perf.start();

                    script_env.handle_file_changes();

                    // Use the buffer we just got back from the renderer to avoid needing copies.
                    *script_env.graphics_calls().borrow_mut() = buffer;
                    if let Err(err) = script_env.call_draw(dt) {
                        dbg!(err);
                    }

                    // Send the buffer back to the renderer, replacing the script_env buffer with an empty one.
                    graphics_calls_tx
                        .send(std::mem::take(
                            &mut *script_env.graphics_calls().borrow_mut(),
                        ))
                        .expect("No remaining graphics calls receivers");
                    js_frame_perf.stop();
                    js_frame_perf.report_after(Duration::from_secs(1));
                }
                JsThreadMsg::TerminateThread => return,
            }
        }
    });
}

impl SignWindowHandler {
    fn toggle_fullscreen(&mut self, helper: &mut WindowHelper<String>) {
        if *self.is_fullscreen.lock().unwrap() {
            helper.set_fullscreen_mode(WindowFullscreenMode::Windowed);
        } else {
            helper.set_fullscreen_mode(WindowFullscreenMode::FullscreenBorderless);
        }
    }

    pub fn new<P: AsRef<Path>>(app_root: P, server_port: u16) -> Self {
        let (js_thread_tx, js_thread_rx) = mpsc::channel();
        let (graphics_calls_tx, graphics_calls_rx) = mpsc::channel();
        let js_ready = Arc::new(AtomicBool::new(false));

        // The lock version had the renderer see an empty buffer on the first draw call. Without it the renderer just hangs since it receives before sending the next message.
        graphics_calls_tx.send(Vec::new()).unwrap();

        js_thread(
            app_root.as_ref().to_owned(),
            js_ready.clone(),
            graphics_calls_tx,
            js_thread_rx,
        );

        print!("Waiting for JS environment to start...");
        while !js_ready.load(atomic::Ordering::SeqCst) {
            thread::sleep(Duration::from_millis(10));
        }
        println!("done.");

        SignWindowHandler {
            graphics_calls_rx,
            previous_buffer: Vec::new(),
            js_thread_tx,
            last_frame_time: Instant::now(),
            last_mouse_down_time: None,
            is_fullscreen: Arc::new(Mutex::new(false)),
            draw_offset: Vec2::ZERO,
            draw_offset_stack: vec![],
            root_path: Arc::new(Mutex::new(app_root.as_ref().to_path_buf())),
            image_handles: Rc::new(RefCell::new(HashMap::new())),
            draw_perf: Perf::new("Graphics draw"),
            server_port,
        }
    }

    fn get_image_handle(&mut self, path_string: &str, graphics: &mut Graphics2D) -> ImageHandle {
        if let Some(image_handle) = self.image_handles.borrow_mut().get_mut(path_string) {
            return image_handle.clone();
        }

        // The path_string wasn't found in the image_handles map, so we need to create it.
        let mut path = self.root_path.lock().unwrap().clone();
        path.push(path_string);
        let image_handle = graphics
            .create_image_from_file_path(None, ImageSmoothingMode::Linear, path)
            .unwrap();
        self.image_handles
            .borrow_mut()
            .insert(path_string.to_owned(), image_handle.clone());
        image_handle
    }

    fn update_image_handle(&mut self, path: &Path, graphics: &mut Graphics2D) {
        let image_handle = graphics
            .create_image_from_file_path(None, ImageSmoothingMode::Linear, path)
            .unwrap();
        let root_path = self.root_path.lock().unwrap().clone();
        let key = path
            .strip_prefix(root_path)
            .unwrap()
            .to_str()
            .unwrap()
            .to_owned();
        self.image_handles.borrow_mut().insert(key, image_handle);
    }
}

Have you tried looking at the performance of the program on a normal PC? Since you aren't scaling based on available cores or anything like that, some performance problems might be observable even on very different hardware from the Pi.

Yep, I've been trying to profile it on a Linux PC (flamegraph SVG profile result: Dropbox - flamegraph.svg - Simplify your life). For a bit, I was able to optimize on the desktop, but after I added the text cache, it easily runs at 60 fps on my Ryzen 5 desktop.

I like the idea of avoiding the clone of the Vec<GraphicsCalls>, and really appreciate you taking a look at the code! Is it necessary to use a channel though? I think the following is something like how the JS and on_draw processing happens in time:

  JS Thread                      OnDraw
|                              |  Takes buffer
|                             <-- Sends RunFrame cmd 
| Locks buffer                 |  Drawing last buffer
| Executing JS                 |  ...
| ...                          |  ...
| ...                          |  ...
| ...                          |  Waiting for locked buffer
| ...                          |  
| Unlocks buffer               |
|                              |  Takes buffer
|                             <-- Sends RunFrame cmd 
| Locks buffer                 |  Drawing last buffer
| Executing JS                 |  ...
| ...                          |  ...
| ...                          |  ...
| ...                          |  Waiting for locked buffer
| ...                          |  
| Unlocks buffer               |

Since the JS and and drawing threads have to run in lock-step (you don't expect two JS thread runs for an OnDraw run), is it as simple as taking the vector? I'm trying to figure out the right way to do that but I'm fighting the borrow checker.

Edit: Actually... it looks like I can take the memory, but it feels like I'm cheating. The commented out version is what I previously was doing, and the take version is below it.

    fn on_draw(&mut self, helper: &mut WindowHelper<String>, graphics: &mut Graphics2D) {
        let dt = self.last_frame_time.elapsed().as_secs_f32();
        self.last_frame_time = Instant::now();

        // Wait for graphics_calls access when the JsEnv thread finishes
        // let graphics_calls = self.graphics_calls.read().unwrap().clone();
        let graphics_calls = std::mem::take(&mut *self.graphics_calls.write().unwrap());
        self.js_thread_tx.send(JsThreadMsg::RunFrame(dt)).unwrap();        
        
        self.draw_perf.start();
        ...

Using a really hacky timing on my desktop, there's a decent difference in the amount of time:

This takes about 7.5 µs in debug and 1.8 µs in release:

let graphics_calls = self.graphics_calls.read().unwrap().clone();

This takes about 800 ns in debug and 90 ns in release:

let graphics_calls = std::mem::take(&mut *self.graphics_calls.write().unwrap());

It definitely seems like a measurable improvement, though neither approach is significant on my desktop. Unfortunately, I don't have enough time to try compiling on the RPi3 tonight, but maybe that's a more significant bottleneck on that hardware?

You don't need to use a channel, no. That was just to demonstrate what I was suggesting.

Just taking the buffer will avoid at least one copy, but the JS thread will start with no allocation in it's buffer again, and have to re-allocate as the buffer fills up. Having a second buffer in rotation avoids that cost.

Is it possible that allocating the vector has a large cost? Or maybe it's more likely that on the RPi something is interrupting the application in a big way. My very unscientific benchmarking using instant.duration() are contradictory:

  • Using clone: 30 us to 11 ms, all over the place
  • Using take: 3 to 20 ms, usually around 10 ms

I'm going to need to figure out how to get flamegraph working on the pi.

A single allocation isn't that expensive, but the Vec has to grow as you add elements to it, which involves reallocation and then copying the old content into the new allocation. Generally the performance hit from that doesn't matter (data structures that do this usually are tuned to make growing relatively rare), but when you care about consistent latency it can certainly matter.