Why so many files?

I just ran cargo clean, on my fairly modest project ( rustweb2 ). I think the number of crates brought in via dependencies is 152. It says

cargo clean
Removed 93118 files, 2.3GiB total

How the heck can there be that many files? I don't understand how that can be. Ok, I have previously built it, run cargo doc, published it. But it still seems an awful lot. Just wondering.

Edit: if I just build it after clean, and clean, it says just Removed 1352 files, 875.5MiB total. I think maybe it is cargo doc makes a lot of files.

1 Like

I ran cargo doc on your repository and it left 30,923 files in target/ (as counted by tree -a). There was also a failure so more might be generated in your local working copy which might have some fix not pushed to GitHub.

The libc crate contributes 11,437 doc files and openssl_sys contributes 10,830 files. These are binding crates that have a huge number of items.

3 Likes

Well I just found a big culprit... the doc directory for windows_sys has 63,202 files.

You can see roughly why here:

https://docs.rs/windows-sys/0.52.0/windows_sys/all.html

Playing with Rust doc first time.

WOW... it generates the documentation for everything... every single dependency...

Well, it make sense: in a post-apocalyptic world where crates.io becomes not accessible due to the downfall of civilization... yep. It would be nice to have all docs inside your local computer :slight_smile:

Perhaps what you want is to generate the doc with --no-deps? cargo doc - The Cargo Book ?

4 Likes

--no-deps is a bit extreme the other way, I guess just excluding windows_sys would be good, but not sure if that is possible. The way it generates a file for every constant seems a bit extreme, I wonder if it could generate perhaps one file per module and use fragment links more? But anyway, it was just curiosity really. At least I now understand what is going on.

Some nights, I lie awake imagining a world in which some kind of compressed HTML archive format got standardised and was supported by browsers and servers. How documentation generators wouldn't need to output ten billion tiny files, choking filesystems and wasting storage space.

Then I turn over and softly cry myself to sleep...

19 Likes

The windows and windows_sys crates are sort of special cases, because they have so many items, and essentially none of them have any documentation. For a native Rust library, module level consts will generally have enough documentation attached to merit their own documentation page.

Additionally contributing to the large number of items in windows_sys is the use of old-style C enumerations with a type alias and a bunch of consts. In a native Rust library, you wouldn't have the "enum" and all of its variants as separately documented top level items, the variants would be associated items documented on the page for the enum type.


An additional constraint on rustdoc is that it really wants to generate a fully static site that works to open straight from the filesystem in the browser and with javascript disabled. It could do a bit better if it required running a server (which could then serve from some sort of container file) or used a JS driven SPA approach, but it wants to avoid that requirement.

Outside of outliers like the windows[_sys] and winapi crates (which are very much a spiders georg here), it generally works well.

It's unfortunate that this primarily impacts Windows, since Windows is on average worse at handling lots of small files than unixes, just due to how the filesystem works. (Use a Dev Drive if you can: it eliminates a decent amount of incidental overhead at the cost of reduced file system filter compatibility.)


If we eventually get the ability to distinguish between public and private dependencies, it might get a bit better. Namely, in the future, cargo doc should be able to generate only docs for the crates you actually can use items from. Thus you'll only generate docs for windows_sys

5 Likes

Well, if you think about that, this is up to the servers to decide, isn't it? Or even the operating system filesystem? If you look at a server as a black box, I ask "give me x.html", and the server returns an html page.

I know for a fact that the current protocols do support compressed formats, because I can see it at work all the time when I get my java programs using the commons httpclient wire logs: some times the payload does come in compressed format and all you can see is a binary payload.

So, it is possible to provide a pre-compressed html page, that doesn't reside as raw files in the html system, but instead, say, in a static database or binary format.

In fact, it seems that the mod_deflate available in apache_httpd does just that? Serving pre-compressed files using Apache

Although for individual files. Not for a blob of multiple compressed files with an index for simple access.

AND... I'm talking s*t :slight_smile: If we are using the files directly from our computers we don't rely on servers, so, yep, we get a problem here. :slight_smile:

I think you've misunderstood what I was trying to say.

The output of cargo doc can be used in one of two ways: directly opened in a browser, or served by an HTTP server as static files. If cargo doc did write all its output into an archive, both of these use cases are negatively impacted. Desktop users have to either extract the archive (which defeats the purpose), or run a special local server (which is a pain in the backside) to view the contents. Servers also have to either extract the archive to serve its contents, or (possibly) use server-specific plugins to read the contents out rather than just dropping it in a static folder.

So the idea was that if such an archive format existed and was supported by browsers and servers directly, then cargo could output that and everyone would be happy. It would be faster to output, load faster, and take less disk space.

But it doesn't, so it can't, and we're left with target/doc directories filled with eleventeen gazillion files and sadness.

Look, Microsoft has spent over 20 years and two from-scratch filesystems trying to replace NTFS with little to no success. They're not fixing the "lots of small files" issues any time soon. If ever. Probably never. Hell, I've seen people suggest the "solution" is to do dev inside WSL. That's not a solution, that's an admission of complete and utter defeat.

Why is it not a solution? It's easy to setup and you can create bazillion files in Linux easily (last time I checked Android checkout created 3 million files and it's perfectly usable inside of WSL).

Yes, its also an admission of defeat, by so what? It works.

Firefox should be able to directly open zip files using jar:file:///path/to/file.zip!/path/inside/zip.html. I don't know if that allows referencing other resources in the same zip file though.

You may be interested in "NTFS really isn't that bad" https://youtu.be/qbKGw8MQ0i8 - where rustup contributor Robert Collins describes how you can get NTFS competitive with Linux.

TLDR is you do some work to avoid redundant and otherwise poor API usage that is also a good idea to do on Linux. There's basically one outstanding issue where Defender blocks to scan files in CloseHandle synchronously, which you have to work around by pushing it to worker threads.

I'm surprised Microsoft hasn't "simply" added some sort of CloseHandleAsync.

1 Like

funny how this got accepted as a solution

That phrase doesn't make much sense. How can you compare a filesystem with an OS kernel?

I think you wanted to say that you may get Windows to become competitive with Linux. That's different.

So to make horsecart competitive with a car you just need not to move boxes when you need/want to, but have to group you shipments into batches and devliver them once per day.

That doesn't horsecart competitive with a car, this makes it useful.

Sure, if you have to deal with inefficiency of Windows for one reason or another then it's valid proposal, but if you have a choice then using something that doesn't require the use clever tricks is just easier!

BTW one valid way to achieve what you propose is to just install WSL2 on top of Windows and then work with lots of small files there. This sidesteps the issue of inefficient API nicely while still making it possible to use Windows.

P.S. It's nice talk, BTW, just the name it misleading: it doesn't duscuss deficiencies of NTFS, but deficiencies of Windows and it doesn't explain how to fix them, but explains how to sidestep them!

On Windows the FS directly gets invoked for every FS operation. On Linux there is a VFS which mediates between the program and the actual FS and which caches the metadata of all recently accessed files and directories (the inode cache) as well as recently accessed file contents (the page cache). So yes, comparing a filesystem with a kernel makes sense in this case as Linux handles a lot of things which NTFS on Windows needs to handle itself.

1 Like

And where's Windows in what you wrote? NTFS is perfectly usable on Linux or Mac and it wouldn't suffer from effects that you are describing.

Equally: one may grab Linux filesystem (e.g. BTRFS) and it would perform equally poorly on Windows.

Nope. It still doesn't make sense. You may compare NTFS to BTRFS on Windows (and BTRFS would lose, BTW). You may compare NTFS to BTRFS on Linux (where it would win). But comparison of NTFS to Linux doesn't make much sense.

What you are comparing is not NTFS and Linux, though, but quality of OS kernel and drivers and it's very well known fact that neither Apple nor Microsoft invest much into eficiency of their kernel.

Why call that comparison “NTFS vs Linux” and how that explanation even makes any sense is beyond me.

To forestall this argument: the reason I phrased it that was is because that's how the video framed it (presumably for the same reasons I would do so, but that's beside the point). There's not really much point arguing about it though, surely everyone understands the intent.

[citation needed]

Please avoid being unnecessarily argumentative.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.