Are holes in files initialised to zero?

When writing a file, if seek is used before writing, there can be an "unitialised" "hole" in the file.

To my knowledge, most (all?) file systems will initialise holes with zeros ( this may not be true - does anyone know? ).

However the Rust documentation doesn't say if this is guaranteed (unless I missed something), which makes it unclear whether this can be relied on. I think it would be useful to resolve this ambiguity one way or the other. I think I would favour resolving it by saying holes are initialised to zeros, recognising what file systems commonly do.

1 Like

Rust doesn't decide this; the OS does.

On systems you're likely to encounter, the OS fills in notional zero bytes if you seek past the end of the file and then write. Rust, however, only translates your request into a system call, and lets the OS do what it will with it.

I think I would favour resolving it by saying holes are initialised to zeros, recognising what file systems commonly do.

std would need to actually implement this behaviour, on OSes which don't do that natively, and it might confuse programmers on those platforms who expect the native behaviour rather than Rust's behaviour.

10 Likes

The problem is that currently it is hard (well I think impossible) to make use of this common ( and extremely useful ) behaviour of real filesystems. I would argue that the cost of this potential confusion is far out-weighed by the very real benefit of defining the value of "holes".

It may be there is a crate that does this, but I would argue it is the job of a standard library to provide useful standardised behaviour and it should be the job of someone implementing the standard library on a rare exotic filesystem (as yet it seems to be unknown whether examples even exist) that does not zero holes to provide the necessary implementation. Still, it is of course a matter of opinion, and it should reflect practical reality (hence my question about the existence of such file systems ).

Many OSes by default allow multiple writers to the same file. If a standard library tried to detect and fill holes, it could race with another writer, overwriting its content.

9 Likes

I think the best that std can do, given the design goals expressed in its code today, is to make it clearer to the programmer that this issue exists, and to document common configurations somewhere.

It might also be possible to put seek behind platform-specific features, as is done with many other OS-specific interfaces, even though it is implemented on every OS std targets, so that programmers can specify which seek implementation they expect and receive an error at compile time if that expectation is violated. This would be a breaking change, but it is a possible one; the question is only whether it is a substantial-enough improvement to justify the work.

The approach I see expressed in std in general is that it provides Rust programmers with consistent ways to access OS services, but makes no promises about what those services will do. Providing the kind of consistent cross-platform behaviour you're asking for is largely left to the crates ecosystem. This is in contrast to, for example, Java, where the language's stdlib expressly promises cross-platform similar behaviour in a number of places. Both choices are valid; they lead to different results, but not to bad ones.

Rust expects programmers to make educated guesses about the underlying platform in plenty of ways, and programmers do manage to write effective programs in spite of these guesses. You can call seek today in the blind expectation that the OS will zero the intervening extent of the file, and your program will almost certainly work as you intend on any platform you're likely to encounter. However, if your program relies on there always being zeroes there, and if you cannot tolerate it failing on platforms that don't do that, then your main option is to write zeroes instead of seeking past them, and to accept the resulting increase in file sizes.

I know of a few filesystems that don't implement seek at all, but they're extremely niche - experiments, half-baked blog posts, and things like that. Rust still provides seek in those configurations, but it'll return the error it gets from the OS indicating that seek is not supported.

2 Likes

I guess that is what I will have to do, and I can use cfg to omit this code in all existing known systems, although finding documentation seems difficult. Even a boolean constant defined in the standard library would be very useful here.

At least anything posix-compliant would return zeroes

https://pubs.opengroup.org/onlinepubs/009696799/functions/lseek.html

The lseek() function shall allow the file offset to be set beyond the end of the existing data in the file. If data is later written at this point, subsequent reads of data in the gap shall return bytes with the value 0 until data is actually written into the gap.

There might be some embedded systems or funky userspace filesystems that violate this but those aren't general-purpose environments.

6 Likes

IMO just write your application under the assumption holes behave sensibly and with the cavaet that writing to a file while you're reading it may result in unspecified behavior.[1] That's basically what a backup program would do, for example.

Not all OS+FS combos support detecting holes at all, so it's beyond Rust or any other library's ability to actually promise that holes are detected and reads within them act a certain way anyway (even if being racy, incurring unnecessary cost, and the FS/OS abstraction boundary weren't concerns).


  1. Since pretty much any file-reading program has this cavaet, it normally goes unstated. ↩︎

3 Likes

Looks like set_len will always make zeroes:

If it is greater than the current file’s size, then the file will be extended to size and have all of the intermediate data filled in with 0s.

That should be efficient on filesystems that support holes. You could make a wrapper that calls set_len whenever something tries to write past the end of a file.

5 Likes

Thanks! What I came up with is this in my function which writes files, it is a bit messy but should work ( I think/hope!).

    fn write(&self, off: u64, bytes: &[u8]) {
        let mut f = self.file.lock().unwrap();
        // The list of operating systems which auto-zero is likely more than this...research is todo.
        #[cfg(not(any(target_os = "windows", target_os = "linux")))]
        {
            let size = f.seek(SeekFrom::End(0)).unwrap();
            if off > size {
                f.set_len(off).unwrap();
            }
        }
        f.seek(SeekFrom::Start(off)).unwrap();
        let _ = f.write(bytes).unwrap();
    }

( Or here: stg.rs - source )

Hm... I don't know... have you tested if you even need that length check? I don't have context here, but I have never ever bothered with file gaps in java, always counting it would be filled with zeroes. And I did make an application that would use a (java)RandomAccessFile and write into an arbitrary file position even without checking file length.

Never got into any trouble.

If you are this concerned about an operating system capability to write zeroes if the offset is after the last file,
why not "test the waters?"

Place a Once call and make a very quick test of creating a temporary file, write at offset , I don't know, 16, and then check the value of file position zero.

It is hard to know without an actual system, but this code might fail to do what you want on some hypothetical system where seek extends the length and fills in holes with a non-zero value. You might want to do the set_len first, hoping to benefit from its guarantee. Although I expect that seek and set_len on this hypothetical system likely use the same underlying mechanism.

simply gets the current size of the file (it certainly is obscure!!). This is necessary as only if the write offset is greater than the current size do we want to call set_len.

1 Like

I think you should not worry.
Most file systems these days are not tape drives. If it is a non-virtual drive then the data is chopped and spread around the disk using different methods by different vendors. Usually not just a sequence of bit in order on on section of disk. Files are intermixed at the hardware level.
If it is virtual, then your virtual disk should have some optimization. Other wise you would use all your memory just to support the a large virtual disk.
Also, a modern OS should support sparse files automatically.

Another perspective is that a database does have requirements for the OS, filesystem and disk devices that are beyond what the Rust stdlib can or should provide. These are normally documented for that database.

The restriction to require holes filled with zeros is probably not an issue in almost all cases, but it is wise to document the requirement just in case non-mainstream file systems might be used, for example. You can't possibly test against all OS and filesystem combinations.

Here's one embedded Rust database where such restrictions are documented. For the big mainstream databases you can also find such restrictions in their documentation. For example, Postgres requires a POSIX-compatible file system and has NFS restrictions.

1 Like

Another point is that failure to initialise files would constitute a security issue for any general-purpose operating system. In general terms I would expect an operating system to keep all unallocated disk space zeroed, and then when freshly allocated it would therefore already be zeroed.

This definitely doesn't work like this. If you erase file of 100GB in size it takes few seconds, but it doesn't take as much time as HDD would need to zero 100GB of space.

And there are tons of “unerase” program for all popular OSed. They don't always work, but they often can recover something.

And if you'll recall how often some juicy info was found in unused parts of MS Office files (these files are, themselves, are filesystem in the filesystem and they often included garbage grabbed from HDD in the past), no, I wouldn't say “all general-purpose operating system treat it as security issue”. Not even sure all modern ones treat it like that: at least Windows 9X doesn't treat it like security issue at all.

Of course Rust9X is unofficial port, not something officially supported, thus I have no idea how important it is for you.

True, but the OS should not be leaking the contents of files previously written to arbitrary applications ( which may have low security privilege ).

Many OSes (especially in embedded space) don't even have the notion of low security privilege, let alone support protection of these arbitrary applications from each other.

The OSes that do care about them, today, offer such guarantees, but it would be stupid and pointless to try to ban all other OSes just because some people don't want to support these.

I think the file systems (Edit: or at least some of them) work by keeping track of the holes, and if you read from a hole they don't read from disk they just return zeros. Edit: When this approach is used, they're called sparse files. The allocation of blocks is deferred until you write something in the hole.

1 Like