Why printing to console is expensive?

Hello everyone,

I'm currently developing a network application and working with a buffer vector of approximately 2MB in size. I'm reading data from the network, converting it into a String, and then printing it to the console.

However, I've noticed that the printing operation is taking a significant amount of time even it only prints a few data (around 1 or 2 seconds).

Below is a snippet of the relevant code. I have a few questions regarding this behavior and would appreciate your insights:

    let mut buffer = vec![0; 1024 * 1024 * 2];

    let bytes_cnt = socket.read(&mut buffer).await?;
    if bytes_cnt.eq(&0) {
        return Err(anyhow::anyhow!("Client sent zero length data."));
    }

    // Is this method good for shrinking vector?
    // let buffer = Vec::from(&buffer[0..bytes_cnt]);

    let buffer_str = String::from_utf8(buffer).unwrap_or_default();

    // It is taking too long time when you print 2MB data to terminal.
    tracing::info!(
        "received {} bytes, buffer_str: {}",
        bytes_cnt,
        buffer_str
    );

My questions:

  1. Why printing to terminal is expensive?
  2. What is the best way of shrinking vector?
  3. Is String::from_utf8 function best for converting Vec<u8>?

Thanks to all.

I don't know about how this works in your tracing backend, but in general Unix terminals are not high bandwidth.

Additionally, Rust uses a mutex to protect stdio. This will get locked per print call.You can lock stdout/stdin/stderr manually if you want to do a batch of operations. But again, I have no idea how that relates to tracing: too many layers in between.

Thanks for your answer. Actually this isn't about tracing. When I use log::info function the same thing happens. Actually I understand that the buffer_str length is 2097152. Because buffer size is 2MB. And when I put this string to console it is trying to print all things even there isn't inside of it.

It takes a long time, because you're printing 2 MiB of data.

What do you mean by that? There is 2 MiB of NULL bytes in there:

Those are not displayed by the terminal, but they sure need to be buffered.

Maybe try to remove the NULL bytes, if you don't want those in your resulting string:

It may even be better, to truncate the input directly to the amount of bytes read, if you track those anyway:

3 Likes

Right, but that is my first point: Unix terminals are not high bandwidth. The whole concept was originally made for teletypes printing on paper, connected over serial. Sure we got "glass terminals" soon with CRTs that emulated TTYs. And it has absolutely gotten faster since then, but none of it is optimised for high bandwidth.

You could use a faster terminal emulator, some use OpenGL or even Vulkan to render. But there will be bottlenecks in the kernel TTY subsystem as well for example.

(Windows terminals... well let's not talk about that, it is even worse, with its MSDOS heritage.)

2 Likes

Casey Muratori put together Windows terminal that is a thousand or so times faster, took circumventing some usual Windows pathway some how. How fast should an unoptimized terminal run?

It's significantly better nowadays with the Terminal app and their pseudo console rework, and they've made signals that they intend to improve it further, but who knows what Microsoft is doing these days. It's at least "fast enough" now that I don't have to performance test with a minimized console for even a human readable speed of output :roll_eyes:

Could be, I haven't used Windows in recent years. I doubt it is a high bandwidth link still.

As I understand it, the Windows Terminal is still laughably slow by Linux terminal standards, but the VT parser in the pseudo console system (for console API compatibility) is the real bottleneck. I haven't looked into it much since around that video above, 4 years ago, so things could have changed.

The complaints about wanting to dump gigabytes a second to screen always did seem a bit odd to me though. It's not useful to display that to a human, so it's only really important as an escape valve when a program does something dumb - nice but not the sort of "look how stupid Windows is" thing it's often treated as. Before it was so slow you would noticably slow a program down by printing a screen of text, which really was completely inexcusable. Now you need to work hard to fill your stdout buffer if you're not just cat.

1 Like

Because printing to an actual terminal involves substantial work to do things like inter-process communication, text layout, painting, scrolling, and dealing with whatever terminal-side affordances might be running, and terminal emulators often prioritize latency (for interactive use) over throughput (for bulk data transfer) in those areas.

Converting the entire buffer to a String when your program knows that only the first bytes_cnt bytes of buffer are non-zero seems like it's probably not what you want. How large is bytes_cnt in normal use?

I can understand this. I have pressed ctrl-C and had it take 10+ seconds to take effect (on Linux). In general it would be nice if ctrl-C would take effect immediately and clear any pending buffers. (I have never looked into which layer(s) the buffers exist at, so this may be more or less hard to fix.)

Yes, pretty much. If you are sure the buffer is UTF-8, you could also use from_utf8_unchecked, which is unsafe

Regarding question 1, yeah, print/log is really very slow, in the worse of cases is calling one write() syscall per line.

For inscance mwemu emualtes ASM at 18,000,000 instructions/second but logging every emulated instruction the speed drops to 408,000 approx.

There are better ways to do this but in this case is triggering 1 linux write() syscall for every log!!!

check it with strace, example:

strace -f -e write cargo run -- -f test/elf64lin_syscall64.bin -6 -vv

write(2, "\33[1;33m13282 0x401046: jmp sho"..., 5713282 0x401046: jmp short 000000000040103Ch
) = 57
write(2, "\33[1;36m13283 0x40103c: mov eax"..., 4113283 0x40103c: mov eax,22h
) = 41
write(2, "\33[0;32m13284 0x401041: xor rdi"..., 4113284 0x401041: xor rdi,rdi
) = 41
write(2, "\33[0;31m13285 0x401044: syscall\33["..., 3513285 0x401044: syscall
) = 35
write(2, "\33[1;31m** interrupt 0x80 functio"..., 45** interrupt 0x80 function:pause
) = 45
write(2, "\33[1;33m13286 0x401046: jmp sho"..., 5713286 0x401046: jmp short 000000000040103Ch
) = 57
write(2, "\33[1;36m13287 0x40103c: mov eax"..., 4113287 0x40103c: mov eax,22h
) = 41
write(2, "\33[0;32m13288 0x401041: xor rdi"..., 4113288 0x401041: xor rdi,rdi
) = 41
write(2, "\33[0;31m13289 0x401044: syscall\33["..., 3513289 0x401044: syscall
) = 35
write(2, "\33[1;31m** interrupt 0x80 functio"..., 45** interrupt 0x80 function:pause
) = 45
write(2, "\33[1;33m13290 0x401046: jmp sho"..., 5713290 0x401046: jmp short 000000000040103Ch
) = 57
write(2, "\33[1;36m13291 0x40103c: mov eax"..., 4113291 0x40103c: mov eax,22h
) = 41
write(2, "\33[0;32m13292 0x401041: xor rdi"..., 4113292 0x401041: xor rdi,rdi
) = 41
write(2, "\33[0;31m13293 0x401044: syscall\33["..., 3513293 0x401044: syscall
) = 35
write(2, "\33[1;31m** interrupt 0x80 functio"..., 45** interrupt 0x80 function:pause
) = 45
write(2, "\33[1;33m13294 0x401046: jmp sho"..., 5713294 0x401046: jmp short 000000000040103Ch
) = 57
write(2, "\33[1;36m13295 0x40103c: mov eax"..., 4113295 0x40103c: mov eax,22h
) = 41
write(2, "\33[0;32m13296 0x401041: xor rdi"..., 4113296 0x401041: xor rdi,rdi
) = 41
write(2, "\33[0;31m13297 0x401044: syscall\33["..., 3513297 0x401044: syscall
) = 35
write(2, "\33[1;31m** interrupt 0x80 functio"..., 45** interrupt 0x80 function:pause
) = 45
write(2, "\33[1;33m13298 0x401046: jmp sho"..., 5713298 0x401046: jmp short 000000000040103Ch
) = 57
write(2, "\33[1;36m13299 0x40103c: mov eax"..., 4113299 0x40103c: mov eax,22h
) = 41
write(2, "\33[0;32m13300 0x401041: xor rdi"..., 4113300 0x401041: xor rdi,rdi
) = 41
write(2, "\33[0;31m13301 0x401044: syscall\33["..., 3513301 0x401044: syscall
) = 35
write(2, "\33[1;31m** interrupt 0x80 functio"..., 45** interrupt 0x80 function:pause
) = 45
^C--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
strace: Process 1059038 detached

regards.

1 Like

In your case is just 1 write() syscall but passing a big buffer from user to kernel is still slow.

man splice

with splice you can transfer from the socket to the stdout directly.

regards.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.