How to get a file's real size?

Hi there!

I'm currently making a program which works on huge lists of files, which aside from having little [performance problems][Fast iteration on files' metadata) works pretty fine, except for one thing.

I need to get the size of each file to compute several things, but I can't seem to get what I need in Rust.

For instance, let's say I create a file with the content Hello World! (12 bytes).

If I use any of the following, Rust will return 4096, which is the physical size used on the disk as the partition is formatted using clusters of 4096 bytes, but it's not the actual size of the file, which is 12 bytes:

file.metadata().len() // 4096
file.metadata().size() // 4096
file.metadata().st_size() // 4096
file.symlink_metadata().len() // 4096
file.symlink_metadata().size() // 4096

Does anyone know how to get the actual size of the file?

Thanks in advance for your help!

One way is to read the file into a buffer and then look at its length.
Might not be the most efficient way though.

I am guessing you need to use the extended metadata traits for the OS that you're using. For Windows: MetadataExt in std::os::windows::fs - Rust. For Linux: MetadataExt in std::os::linux::fs - Rust. These have accurate size fields/methods, using system specific libraries.

Files can be several dozens of gigabytes, so that's not an option.

The code I'm using already makes use of MetadataExt, but that still gives me 4096 bytes nonetheless.

Open it for read, seek the the end, and ask what position you're at.

7 Likes

If your drives are fast enough (since you brought up very large files), streaming in chunks with a counter could work (and you won't notice it being too slow). You won't fill your RAM this way. If you are going through large dirs something like Rayon could speed things up quite a bit and you won't have to deal with the complexity of futures/resource exhaustion.

I would then suggest using the OS specific functions from the nix and winapi crates. In particular, for the stat function in the nix crate, I think it should give the correct size in bytes. At least, that's what I understand from the man page for stat. (refer to the st_size field).

1 Like

The partition contains multiple terabytes of data split across hundreds of thousands of files, which would result in quite a large performance overhead unfortunately :confused:

1 Like

I'll look into that, thanks for the suggestion :slight_smile:

But I'm surprised there is no single native method to get such a common value.

Ah makes sense. Thank you for clarifying. That is not a fun problem indeed. If you do end up going the rayon route, try to sort by allocated space so you don't constantly mix large and small files in batches (which would force you to wait for say a single large file, every time).

Hopefully those crates end up being fast enough for you!

I tested the Metadata::len() on windows, linux and macos and all three returns correct length. What platform did you tested on?

6 Likes

On WSL (Windows 11, latest stable version)

EDIT: I re-tested in another directory and now it seems to return the correct value. But for some other files in another directory (still on the same filesystem), it returns wrong values. This is weird, I'm going to check what's going on.

Ok so the reason it was returning wrong values was because I'm just plain stupid.

Basically I do something like:

WalkDir::new(source).into_iter().map(|item| item.metadata().unwrap().len())

Except I was writing this:

WalkDir::new(source).into_iter().map(|item| source.metadata().unwrap().len())

See the little source instead of item before .metadata()? The code compiled correctly but I was getting the size of the source directory, which is by convention a cluster's size :man_facepalming:

So problem solved, thank you all for helping and sorry for wasting your time :pray:

13 Likes

There's no equivalent to the stat function?
No wonder I prefer Perl...

The methods in Metadata and MetadataExt are the equivalent. They're literally populated from stat (or statx).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.