App is 100 times slower when NOT using cargo-flamegraph

i was running in some serious performance issues with my an app i was developing and tried using cargo-flamegraph to profile it, the thing is if i run it with that it actually runs 100 faster than my release and debug builds at least according to a couple timestamps i took in the code and i do believe the measure is accurate because the output went from visible lag to being very smooth.

here is the function that seems to be most affected 60 milliseconds in release vs 500 microseconds with cargo-flamegraph

pub fn render_ui_objects<S: Screen>(
    query: Query<(
        &ComputedNode,
        &UiGlobalTransform,
        Option<&BackgroundColor>,
        Option<&TextRenderBuffer>,
        Option<&CalculatedClip>,
    )>,
    stack: Res<UiStack>,
    mut screen: NonSendMut<S>,
) {
    let start = Instant::now();
    let stride = screen.screen_width() as usize;
    let stride = screen.screen_height() as usize;
    let mut frame = screen.get_frame();

    let pixels = frame.deref_mut();
    for (i, (computed, transform, color, text_buffer, calculated_clip)) in
        query.iter_many(stack.uinodes.iter().copied()).enumerate()
    {
        let region = Rect::from_center_size(transform.translation, computed.size());

        let clipped = if let Some(c_c) = calculated_clip {
            let clipped = region.intersect(c_c.clip);
            if clipped.is_empty() {
                continue;
            }
            clipped
        } else {
            region
        };
        let y_offset = (clipped.min.y - region.min.y) as usize;

        let region = Rectangle::from_rect(clipped);
        if let Some(color) = color {
            let u32_color: U32Color = color.0.to_srgba().into();
            for y in region.min_y as usize..region.max_y as usize {
                let row_start = y * stride + region.min_x as usize;
                let row_end = y * stride + region.max_x as usize;
                pixels[row_start..row_end].fill(u32_color);
            }
        }
        if let Some(text) = text_buffer {
            if text.pixels.len() == 0 {
                continue;
            }
            let text_stride = text.width;
            let text_pixels = &text.pixels;
            for y in region.min_y as usize..region.max_y as usize {
                for x in region.min_x as usize..region.max_x as usize {
                    pixels[y * stride + x] = U32Color::blend(
                        text_pixels[(y - region.min_y as usize + y_offset) * text_stride
                            + (x - region.min_x as usize)],
                        pixels[y * stride + x],
                    );
                }
            }
        }
    }
    let time = Instant::now() - start;
    println!("render_time:{:?}", time);
}```

This points to the slowness being algorithmic rather than due to on machine code. Have you checked if the loop has equivalent inputs in all settings? I could imagine running the app as a subprocess of a tracer may have an influence on UI events and, though it's hard to say since you didn't given information on the exact framework you're using, intersect queries are famously non-trivial to do efficiently. You could introduce metrics to track how often each branch is taken per frame and compare them in all settings. That is, number of affected pixels that are filled as well as number of intersect checks being ran or even just the number of iterated uinodes.

that's not the case both version are running under identical conditions apart from the profiling.
both spawn the same 6000 entities in a fixed configuration and render them on a fixed window and then start scrolling them by a fixed amount per frame. when run with cargo flamegraph the window is visible and scrolls quite fast so i am 100% certain that the slowdown on normal builds must be caused by how the code is generated

As a wild guess, running at full speed causes a “traffic jam” in the lower-level rendering hardware (or whatever is going on down there), but when you go slower the jam doesn’t happen.

1 Like

It would be easy to verify and we in comparison don't have any way of confirming it more directly. But sure, let's run under that assumption. Flamegraph does not do any special codegen. It takes a binary and runs it under a tracing profiler. You should be able to run that outside for yourself, too, and I believe cargo flamegraph quite literally takes the release artifact. Assuming the large time is the pixel filling and not any of the calls, the alternative hypothesis is hardware contention over the memory access (i.e. multiple threads competing over writes to the same memory location) (note: racing writes are unsound if not done through atomics). But your post is very sparse on information, though I can guess you're using bevy it'd be helpful to give that explicitly. I don't recognize the Screen trait so all the pixel filling is just a black box to me for which I can't tell you anything more specific.

It's somewhat unusual to see a manual texture blitting loop in a per-frame process of a UI nowadays. If you'd want that to be efficient such tasks should be composited during rendering or with a dedicated block operation. Have you considered alternative methods of displaying the text in your UI or do you specifically want to find the root cause here?

1 Like

this is the renderer, we are making a software renderer to use bevy ui in an embedded device we have, this function is almost the entirety of the app for now tbh the framebuffer given by screen is an in memory buffer that gets sent to drm at the end of the frame. we are running bevy singlethreaded so multithreading is not related

On embedded I'd suppose you need to look into the hardware counters. Running perf mem report (Linux, see man page for further details) instead of a shiny but information-sparse CPU latency graphs from flamegraph may reveal some more information. The more specialized the deployment context goes the less you can deduce from small code snippets itself. Who knows, maybe your system aggressively power-throttles the UI (maybe that happens by system-interrupts that are then serviced) and tracing interrupts your program often enough to keep the thermal budget for smooth operation. That'd be funny, but I'm just throwing out guesses with no clear basis here to clarify the amount of potential causes is much larger than the information available. The only thing I'm somewhat certain is what it isn't, it's not codegen.

Solved, dumbest cause possible lol our build script was not forwarding --release to cargo

8 Likes