Performance oddity

So I've got a weird performance issue, and I'm not sure I understand it. I have this function, which is computing transitions for a rendering algorithm:

    fn intercepts(&self, y_positions: &[f64], output: &mut [SmallVec<[(EdgeInterceptDirection, f64); 2]>]) {
        self.polyline.intercepts_on_lines(y_positions, output);

        for intercepts in output.iter_mut() {
            for (direction, _) in intercepts.iter_mut() {
                *direction = EdgeInterceptDirection::Toggle;
            }
        }
    }

The bulk of the work here is in intercepts_on_lines, which is filling up the output with 2 entries each. As an afterthought, it's switching the result from following the non-zero winding rule to the even-odd winding rule by updating some of the data. This last step is incredibly slow, and I can't figure out why: the actual rendering operation is much faster than just this step.

Here's the comparison from my benchmark, release build, tested on Intel and ARM systems:

  Scan convert flattened circle (pixel res, even-odd winding rule) (129 lines): 1000 calls made in 1.6s. 1.6ms per call, 10.4 calls per frame (621.6 fps)
  Scan convert flattened circle (pixel res, non-zero winding rule) (129 lines): 1000 calls made in 13.4ms. 13.4µs per call, 1248.3 calls per frame (74895.1 fps)

This is a full-frame circle, and the only difference between the two case is the code that updates the edge direction by iterating over the results. As a full-frame circle, this is 1080 lines with 2 intercepts on each, so output is a 1080-entry slice where every entry has 2 intercepts in it, so there are 2160 intercepts to update in total here. This operation apparently is 120x slower than generating the intercepts in the first place.

So what's going on? This really lead me on a wild-goose chase as the much more involved code in the actual intercepts should outclass this last update by a long way... I guess I did make it faster, all the while puzzling over why it was having no impact on the performance of the algorithm.

The choice of smallvec here is a possible suspect, of course, but these values are read elsewhere with nothing like this issue occurring and in any case the drop in performance is absolutely massive. I've found that usually iterating vectors like this is the fastest way to update them in Rust, so this was a real surprise to me.

Wow, utter headdesk moment from me here. I even held off on posting this because I was trying to figure out if I'd somehow made this happen from outside of the function because the issue seemed so ridiculous.. And only after posting do I almost immediately figure out that the bug was in the benchmarking code. Oof.

The bug was that the profiling code was re-using the output Vec without clearing it, so it was growing huge. I think I'd confused myself partly because it originally did clear the output instead of appending to it, and partly because there was another test of another part of the code that was still doing that earlier on.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.