New Rust PNG decoding library!

mattpiz · August 2, 2019, 3:42pm

Yet another PNG library? ... Yes BUT it has more performant decoding and I learned a lot while doing it ^^. In short, it is decoding-only (not encoding), based on the nom parser (5.0.0) and miniz_oxide for inflate stuff (cf Details on different Rust implementations of deflate?)

Is it ready / published?

No, not yet, still too early, just made all stuff working together today. There are no tests, the API is barely usable, its new territory

What can it decode?

Most PNG files, except those with palettes and interlaced progressive data.

How well does it perform?

Pretty well ;). Here are some benchmark results with a bunch of images (I can provide those images but you can try with yours by replacing the path in the benchmark). Wasm tests are done with GitHub - mpizenberg/wasm-png: Minimal png reading example in WebAssembly (wasm).

Image	this crate	png crate	this (wasm)	png (wasm)
depth.png	4.2 ms	9.1 ms	12 ms	42 ms
eye.png	0.69 ms	0.99 ms	3 ms	8 ms
inkscape.png	10.1 ms	9.8 ms	19 ms	43 ms
rgb.png	7.4 ms	16.1 ms	17 ms	61 ms
screen.png	8.6 ms	10.8 ms	15 ms	43 ms
texture_alpha.png	0.75 ms	1.9 ms	2 ms	8 ms
transparent.png	22.5 ms	17.6 ms	38 ms	81 ms

This code is generally faster than the png crate in native compilation, and always 2 to 4 times faster in wasm! This is mostly because I've put a lot of care in reducing number of allocations. The slower cases (like transparent.png) are when almost all lines are using the Paeth filter (cf implementation).

What can you do if you wish to contribute?

If you are also the kind of person excited by things getting rewritten in Rust, and interested in image processing, if you could try it and give feedback here / in GH issue would be super nice! (beware this is highly unstable API).

There is the PNG front:
- Add features such as supporting PNGs with palette, or with interlaced data, and supporting more chunk types.
- Add tests (this is not tested, I've just used some different PNG at my disposal).
- Add benchmarks.
- Documentation would be nice but might have to wait a bit for API stabilization.

There is the deflate front: (cf Details on different Rust implementations of deflate?)
- If you are the kind of person that would like to try implementing inflate with the nom parser combinator, it would fit perfectly into this crate .

alice · August 2, 2019, 7:17pm

Where did you find the specification for the png format?

mattpiz · August 2, 2019, 10:02pm

You can find the specification on the website of libpng: http://www.libpng.org/pub/png/spec/1.2/png-1.2-pdg.html

I also highly recommend the "definitive guide" from O'Reilly, available online at http://www.libpng.org/pub/png/book/toc.html

Dushistov · August 2, 2019, 10:44pm

What about speed in compare with libpng?

mattpiz · August 3, 2019, 12:22pm

For the time being, the code base is very small, so I consider it an optimization playground. Benchmarking is quite fun! I've not compared it to libpng yet though.

I've read a little the png crate code for the decoder/unfilter. If we put aside the transformation and interlaced stuff, the interesting struct is decoder::Reader and the actual decoding part is the loop inside the next_raw_interlaced_row() method which interleaves inflating and unfiltering scanlines. The unfiltering is done using two buffers, self.prev and self.current for the previous and current line. At the end of each line decoding, the current is copied to the previous buffer, and a reference to self.prev is returned. In the next_frame() method, after each decoded line, this slice is copied to the image buffer.

This code is a bit different. It doesn't interleave inflating and unfiltering, I first inflate with miniz_oxide, then unfilter. The other main difference I could spot is regarding where the data comes from. In png crate unfiltering, the self.current buffer, which holds the inflated data starting at current line, is modified in place. Then partially (the line) copied self.prev, which is finally copied to the image buffer. In my case, I build a Vec<(Filter, &[u8])> containing references to each inflated line (slices of the fully previously inflated data with miniz_oxide). Then I fold over those, keeping the index corresponding to current line, and write directly the unfiltered values to the mutable image buffer. So the inflated data is untouched (not mut) and I use indexes to retrieve data.

Here is the fold over scanlines:

let line_start = 0;
scanlines
    .iter()
    .fold(line_start, |line_start, (filter, line)| match filter {
        Filter::None => decode_none(line, line_start, &mut data),
        Filter::Sub => decode_sub(bpp, line, line_start, &mut data),
        Filter::Up => decode_up(line, line_start, &mut data),
        Filter::Average => decode_average(bpp, line, line_start, &mut data),
        Filter::Paeth => decode_paeth(bpp, line, line_start, &mut data),
    });
data

And here is the decode_paeth() function:

pub fn decode_paeth(bpp: usize, line: &[u8], line_start: usize, data: &mut Vec<u8>) -> usize {
    if line_start == 0 {
        decode_sub(bpp, line, line_start, data)
    } else {
        let previous_line_start = line_start - line.len();
        line.iter().enumerate().take(bpp).for_each(|(i, p)| {
            let up = data[previous_line_start + i];
            data[line_start + i] = p.wrapping_add(up);
        });
        let previous_line_start_bpp = previous_line_start - bpp;
        let line_start_bpp = line_start - bpp;
        line.iter().enumerate().skip(bpp).for_each(|(i, p)| {
            let up_left = data[previous_line_start_bpp + i];
            let up = data[previous_line_start + i];
            let left = data[line_start_bpp + i];
            data[line_start + i] = p.wrapping_add(paeth_predictor(left, up, up_left));
        });
        line_start + line.len()
    }
}

Do you think the slower behavior could be due to some kind of cache miss? Here up_left and up are read far back in data (one line). left is read near where data is also written to. p is the current value in line (inflated). In png crate code, you pay a copy to the self.prev buffer, but then up_left and up are close together in self.prev and left is close to where self.current is written to.

I've tried introducing a previous buffer to mimic the png crate behavior. It improves slightly the performances (2%) but is still quite slower than the png crate for the transparent.png image.

Any other idea of what could be the culprit?

mattpiz · August 11, 2019, 5:07pm

This is now always faster than the png crate

...

So I've added benchmarks to bench specifically the different unfilters and it helped me identifying improving factors. I've implemented an alternative decoding ("bis") modifying in place the inflated buffer and copying the resulting slice to the image data buffer. This approach is similar to what is found in the png crate. It resulted in a great improve of performance for the transparent.png image (mainly paeth filters).

But in doing so, I realized that the indexing of slices in loops also has a huge impact on performances. We should use a slice starting at start instead of indexing in a loop with operations like data[start + i]. So I modified this initial implementation with this in mind. I added a buffer to the previous line that is used to copy the previous line. Apparently, the price of copying one scanline to avoid indices offsets is worth it! In the end, we have the same performances with this than with the "bis" version modifying in place the inflated buffer. PS: I've tried to use data.split_at_mut(line_start) instead of copying the previous slice but it resulted in largely increased running times (15%) ... Benchmarking is weird!

I have updated the wasm benchmarks to loop 100 times in order to have more robust timings. They are thus different from last time. Currently at commit ecfa9fea we have:

Image	this	bis	png crate	this (wasm)	png (wasm)
depth.png	4.2 ms	3.7 ms	9.1 ms	8.9 ms	30.7 ms
eye.png	0.56 ms	0.51 ms	0.96 ms	1.7 ms	5.9 ms
inkscape.png	7.4 ms	7.6 ms	9.6 ms	14.5 ms	30.2 ms
rgb.png	7.0 ms	7.1 ms	16.0 ms	15.0 ms	52.1 ms
screen.png	6.7 ms	7.2 ms	10.2 ms	12.3 ms	29.8 ms
texture_alpha.png	0.67 ms	0.68 ms	1.94 ms	1.8 ms	8.0 ms
transparent.png	16.9 ms	17.8 ms	17.4 ms	31.0 ms	55.8 ms

In order to have an idea of what is taking time in the decoding, here are the timings of each part on the different images ("total" is from benchmark so usually does not coincide exactly):

Image	total	parse chunks	inflate idats	unfilter
depth.png	4.2 ms	7 us	3250 us	890 us
eye.png	0.56 ms	6 us	370 us	320 us
inkscape.png	7.4 ms	8 us	3500 us	4300 us
rgb.png	7.0 ms	9 us	6100 us	1100 us
screen.png	6.7 ms	9 us	2900 us	4030 us
texture_alpha.png	0.67 ms	5 us	630 us	160 us
transparent.png	16.9 ms	9 us	9300 us	8500 us

As we can see, for images containing only Sub filters (depth, rgb, texture_alpha) the decoding cost is dominated by inflating (here done by miniz_oxyde). For other images, containing mainly Paeth filters, inflating and unfiltering costs are roughly similar.

chrisd · August 11, 2019, 7:55pm

Am I to understand you've managed to avoid using unsafe while still getting good performance?

mattpiz · August 11, 2019, 8:39pm

There is no unsafe in this code indeed. Not sure about miniz_oxide though, didn't check. I think that generally unsafe is not needed for performance if we choose the right constructs.

mattpiz · August 11, 2019, 10:55pm

Usage of C++ libpng seems quite complicated from what I've read in the guide. So I've used OpenCV which, to my knowledge, uses libpng for PNG decoding. I've setup a small benchmark program decoding (with imdecode) in a loop 100 times. I don't know exactly what the imdecode function is doing but I hope it is not too far from what we would obtain using directly libpng. Feedback from knowledgeable libpng users welcomed!

Image	this	png crate	OpenCV
depth.png	4.2 ms	9.1 ms	4.0 ms
eye.png	0.56 ms	0.96 ms	0.72 ms
inkscape.png	7.4 ms	9.6 ms	6.6 ms
rgb.png	7.0 ms	16.0 ms	6.5 ms
screen.png	6.7 ms	10.2 ms	6.6 ms
texture_alpha.png	0.67 ms	1.94 ms	0.99 ms
transparent.png	16.9 ms	17.4 ms	13.2 ms

If my OpenCV loop benchmark is correct, we have decoding performance not too far from libpng!

  // Decoding loop being timed.
  for (int i = 0; i < 100; i++) {
    cv::Mat img;
    int flags = cv::IMREAD_ANYDEPTH | cv::IMREAD_UNCHANGED;
    cv::imdecode(input, flags, &img);
  }

mattpiz · August 11, 2019, 11:55pm

Actually I just realized that I could improve the inflating perf by initializing the buffer filled by miniz_oxide with the exact size it will take (height * scanline_length). This brings us even closer to OpenCV in native compilation and improves wasm performances. State of performances at commit 432edfad:

Image	this	bis	png crate	OpenCV	this (wasm)	png (wasm)
depth.png	4.0 ms	3.6 ms	9.1 ms	4.0 ms	8.5 ms	30.7 ms
eye.png	0.48 ms	0.49 ms	0.96 ms	0.72 ms	1.5 ms	5.9 ms
inkscape.png	7.1 ms	7.4 ms	9.6 ms	6.6 ms	13.4 ms	30.2 ms
rgb.png	6.6 ms	6.6 ms	16.0 ms	6.5 ms	13.7 ms	52.1 ms
screen.png	6.5 ms	6.6 ms	10.2 ms	6.6 ms	11.6 ms	29.8 ms
texture_alpha.png	0.68 ms	0.68 ms	1.94 ms	0.99 ms	1.8 ms	8.0 ms
transparent.png	15.2 ms	15.3 ms	17.4 ms	13.2 ms	26.1 ms	55.8 ms

ExpHP · August 12, 2019, 12:12am

I am just curious, why is copying necessary for this? Why doesn't something like let data = &data[start..] work? Is it because of borrow checker issues? If so, you can usually get around this by using split_mut. (unless you need two regions of data that literally overlap each other)

Edit: Whoops, so much for my reading comprehension!

mattpiz · August 12, 2019, 12:16am

That's also what I thought but for unknown reasons ...

oyvindln · August 12, 2019, 3:56pm

We've exised all usage of unsafe in miniz_oxide recently, and added #![forbid(unsafe)].

Used as a back-end for flate2 the decompression part was slightly faster than using the default C library (miniz), while compressing is still slightly slower. There is still some unsafe when used through flate2 as the back-ends go through an unsafe C-api, but I'm working on changing the rust-back end to avoid that.

We found that the unsafe only had a very marginal performance benefit at best, and in many cases the compiler can optimize out bounds checks so there isn't a huge need for it in the deflate compression libraries. One thing that is particularly conveninent is that the window sizes/buffers used have a size that's a power of two, so it's often easy for the compiler to deduce when values can go out of bounds.

system · November 10, 2019, 4:09pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FLIF (Free Losless Image Format) decoding interface announcements	3	1560	January 12, 2023
Details on different Rust implementations of deflate?	3	1301	November 25, 2019
PNGme: An Intermediate Rust Project tutorials	8	2013	October 30, 2020
Looking for image processing crate recommendation (& introduction) help	11	277	April 14, 2025
What's everyone working on this week (35/2021)? community	3	774	November 29, 2021

New Rust PNG decoding library!

Related topics