BufWrite is extremely slow in debug mode

hi,
I found that std::io::BufWriter is 30 times slower than std::fs::File when writing a 100M file , it is unacceptable for me even this occur only in debugging mode.

here is my test sample code:

Does any knows why std::io::BufWriter is extremely slow in debugging mode ? and is there alternative library to speed up file writing?

I'm not exactly sure, but it looks like your test is writing 4096 bytes at a time, which is basically what File does under the hood, so without optimizations, the only thing BufWriter changes is it double-buffers the input data.

If you reduce the block size from 4096 to 512, you'll see the BufWriter time increases marginally (from 6 to 8 seconds) but the File::write time is 8 times higher.

Let's start fromthe documentation of the BufWriter.

BufWriter<W> can improve the speed of programs that make small and repeated write calls to the same file or network socket. It does not help when writing very large amounts at once, or writing just one or a few times.

Your example keep writing 4096 bytes of buffer per write. 4096 is pretty large, it's half of the default buffer size of the BufWriter itself.

cliff suggested to reduce the size of the write to 512. But it would be more impressive to reduce it even more, like 4. The performance of the BufWriter would not be changed much, while the unbuffered case would slow down significantly.

2 Likes

Thanks for your explanation, What I concerned is why there is such incredible different writing speed (release mode(0.07s)

vs debug mode(3.62s)) for BufWriter itself.

As for std::fs::File, the difference is not so obvious (release: 0.104s vs debug: 0.109s)

I understand that there are some performance differences between debug mode and release mode. but for BufWriter, the performance seems unacceptable

File calls ~straight in to the OS, while BufWriter maintains an internal buffer in Rust code. If that Rust code is not optimized, it will run more slowly than if it is.

3 Likes

:rofl: :joy: :sweat_smile:

Unacceptable depends on use case. I rarely need to debug something by writing 100 mb, and even less often in a tight loop where 10 seconds is going to add up to something substantial. If your use case differs, you can always trade some compile time by enabling more optimizations in debug mode.

hi, cliff

Thanks for your suggestion. The project I'm working on is about file encoding/ decoding, so this is common use case for me. As I'm a beginner of rust from C/C++, I hava never encountered this kind of problem in C/C++ before(about such a big performance issue between release and debug mode) . This makes me a a little shocked and disappointed about rust.

A few days ago, I import a rust library "crate::sha2" to calculate sha256 of file, it works of course, but it is amost 10 times slower than previous c/c++ versions. I cannot figure out why, but I found another rust library "ring::digest::SHA256" which seems has "normal" performance . (all test cases are running on debug mode) At that time I felt very strange, why two libraries do the same thing in rust, the performance can be vary so much... I'm not sure if these two are same problem, one thing for sure is both problem are very disappointing for rust beginners :upside_down_face:

Rust uses and talks a lot about "zero cost abstractions", but what it actually means is "zero runtime cost abstractions" (but that doesn't quite have the same ring to it).
So there is a non-zero cost and when you compile with optimizations turned on, you move that cost from run-time to compile time. That's a good thing, but it means that without optimizations you end up paying this cost entirely at runtime.

For many use cases, this is okay. This is simply the tradeoff you make for the abstractions rust provides.
But if it's not, you can turn on some of the optimizations in debug mode for a significant boost.
Add to Cargo.toml:

[profile.dev]
# Required for dev builds to be usable
opt-level = 2

For reference, the release profile uses opt-level=3 and some other stuff too.

1 Like

Outperforming C and C++ is not a goal for Rust, especially in builds that are intentionally not optimized. The fact that you are more concerned with performance when running tests than when running in production puts you well outside of Rust's target audience. It has already been pointed out that you are not using BufWriter as intended, which is for "small and repeated write calls". It seems acceptable to me that it performs poorly in debug mode for something it is not intended to do. The fact that it performs so well in release mode when used outside of its intended use case is more impressive than anything.

Every project with dependencies has a discovery stage where you look at what's out there and decide which ones to use. That's not Rust, that's just coding. Having options and the tools to select one over another with confidence is not a bad thing.

It seems like you have drawn a conclusion about Rust and are now looking for justifications to that conclusion. If you can't take direction from the standard library documentation and accept that a six year-old language being updated daily has a compiler with some steeper trade-offs when set to its "fast" mode than one that is decades old then you're not going to like Rust. You can't learn with a closed mind and can't write good code when your goal is to write bad code that justifies a fallacious conclusion about the language.

1 Like

@bob Woah, no, that's all wrong. Rust is certainly targeting the same use cases as C and C++, and complaints about poor debug build performance are not something to write off as a closed mind with a goal of writing bad code! That's incredibly off base, wrong, and rude.

Rust should have better debug build performance, and BufWriter shouldn't degrade this much. Any work to address these real shortcomings is completely welcome, and there are already people doing work in that area!

8 Likes

Can somebody translate that into any language I might understand?

Google translate produced : ": rofl:: Joy:: sweat_smile:". So I'm none the wiser.

Even after a whole year of using Rust, and getting Rust into production in that time, I still count myself as a Rust beginner.

During that time I have noticed that debug builds of Rust can be up to 100 times slower than release builds. I have no disappointment about that. Seems quite normal. I'm sure that is true of C++ as well. I am not expecting to be running tests and debugging on huge data sets.

But I have a theory, someone correct me if I am wrong, when you build a C++ program at -O0 you are still linking in the standard library and whatever other libraries that were built with -O3 or whatever. So, if your program makes heavy use of such libraries the fact that your code is 30 times slower does not show up so much.

But, in Rust world all your dependencies are built with the same optimization level, as far down the stack as possible. And so you might feel the performance hit more. Rebuild your project with --release and the whole stack gets rebuilt to that new optimization level.

I don't see the problem.

Rust doesn't build the standard library at the optimization level of your program; instead it always uses the (optimized) build that comes with the compiler you're using. (However, both languages wind up using the optimization level of your program for generic code from the standard library.)

C++ can occasionally have pretty bad debug build performance (and people often work around it by enabling optimizations in those builds, losing some accuracy in debug symbols), but idiomatic Rust tends to make the problem worse, at least with current compiler implementations.

Like you say, this is fine for a lot of cases. But sometimes you need both debugging tools (symbols, overflow checks, whatever) and performance. This is why some people work around it in both languages! But also like you say, improvements that don't cost too much debug compile time are certainly desirable.

Outperforming perhaps not. But certainly matching C/C++ performance, yes.

Rust has being a "systems programming language" as a primary goal. That means matching C/C++ performance. Which is why garbage collection was thrown out, green threads were thrown out and whatever else before version 1.0.

The realization being that without those goals Rust would be "just another new language" among many. No matter it's fuss over program correctness. Nobody would take the time to look at it.

However, I don't see that dictates a need to match C/C++ in debug builds.

I see we agree on the "same use cases as C and C++" idea.

However I don't see how that extends to debug builds. At that point I'm testing my code for logical errors, I don't need it to be super fast. I want overflow checks in place, and so on. More likely I'd like it to build quicker.

Of course any improvements that come down the pipe in debug build performance would be welcome. Hardly a show stopper without though.

Rust could theoretically outperform C/C++ in some cases where Fortran currently does once LLVM fixes noalias and/or we are able to perform optimizations that stacked borrows allows in other ways.

In game development having fast compilations are important but at the same time it has to run fast so you get more than like 5fps.

1 Like

That is good to hear.

Hadn't considered the game developers. I guess they want all of everything all the time.

I understand the poor performance in debug mode as I understand there is a cost to provide better debugging experience. I totally accept that if Rust says debug production is 1000 times slower than release production.

What I can't figured out is the incredible performance difference in debug mode for different rust library doing amost same thing

From my point of view, both BufWriter and std::fs::File are used to write a 100M file, but the performance can be vary so much,(especially BufWriter was intent to speed up the write speed of std::fs::File).
I can't agree with that someone says I use BufWriter in wrong way, because BufWriter inneed have bigger internal buffer(8196) than the write bytes(4096) at a time. it is reasonable to expect that BufWriter should have better performance than std::fs::File no matter in debug mode or in release mode. if someone says that BufWriter is a special case becuase of internal implementation in rust, then why BufReader donesn't behave like BufWriter?

Alright then , if Rust says that the performance of difference library doing same thing in debug mode is totally unpredictable, or If Rust says that library intent to speed up somthing will instead dramatic downgrade it in debug mode is by design , and you must handle everything about performace in release mode, then I'm very sorry for my misunderstanding

Put like that it is an odd situation. I have no idea.

If you want to make many writes of size 4096, you definitely can't be sure that wrapping that in a buffered writer with capacity 8192 will be faster than writing directly to the File. The use of BufWriter will (at most) halve the number of times you call the read syscall, but in exchange for that, you pay the cost of copying all the data into the internal buffer before you write it.

It is not obvious to me at all that this copying is worth it runtime-wise when we are dealing with buffers of these sizes. Copying data is not free, even if it is relatively cheap.

As for debug speed, I'm pretty certain that the standard library would accept a PR that improves the performance in debug mode and leaves the release mode performance unchanged.

BufWriter::flush_buf has terrible performance in debug mode due to using a Vec iterator it seems. Looking at the assembly, there seem to be quite a bit of stack accesses that the mem2reg optimization pass would cleanup.

_ZN4core3ptr7mut_ptr31_$LT$impl$u20$$BP$mut$u20$T$GT$6offset17hed04ffc36d9e1232E():
sub    $0x20,%rsp
mov    %rdi,0x8(%rsp)
mov    %rsi,0x10(%rsp)
add    %rsi,%rdi
mov    %rdi,0x18(%rsp)
mov    0x18(%rsp),%rax
mov    %rax,(%rsp)
mov    (%rsp),%rax
add    $0x20,%rsp
retq

for example could become just:

_ZN4core3ptr7mut_ptr31_$LT$impl$u20$$BP$mut$u20$T$GT$6offset17hed04ffc36d9e1232E():
add    %rsi,%rdi
mov    %rdi,%rax
retq

cg_clif is already avoiding these stack accesses. It produced:

rex    push %rbp
mov    %rsp,%rbp
rex    mov $0x1,%eax
imul   %rax,%rsi
mov    %rdi,%rax
add    %rsi,%rax
rex    pop %rbp
retq

This is still not ideal as it has a prologue and epilogue and an unnecessary multiply instruction. I fixed the unnecessary multiply in https://github.com/bjorn3/rustc_codegen_cranelift/commit/96c4542dc3c7001d5a28b05d067701f2173e9eb4 improving perf by ~1%, but most of the overhead seems to come from the prologue and epilogue emission.

cg_llvm:

File => 71.32ms
BufWriter => 6.30s

cg_clif:

File => 958.81ms
BufWriter => 6.79s

cg_clif is slower on the File case here as the standard library is not optimized unlike with cg_llvm.

2 Likes