Hello world. Please how can I remove bytes from a file at any position ?
I would probably write a temporary file (see also tempfile
) while writing/copying all bytes except the ones to be skipped to the new file. Then use std::fs::rename
to move the new/temporary file to the original one.
Maintaining file permissions might be tricky though. Maybe there is a better solution?
If it's okay to mess up the file if your process aborts, you might also modify the file in-place (though I guess this requires some optimization tricks unless you want to copy bytewise) and truncate the file (not sure how, I'm on mobile right now) in the end to remove the excess bytes.
This generally isn't possible to do in-place because filesystems organize files in blocks or block-aligned segments. Removing a single byte in the middle would lead to the logical byte offsets no longer being aligned with the block offsets.
At least linux has a non-portable, filesystem-dependent method to remove byte-ranges, but even supporting filesystems usually require things to be block-aligned. fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
.
Punching holes in the middle of files without shortening the total file length is somewhat more portable, but still not universal across file-systems. Those byte-ranges aren't removed, they're logically replaced with zeroes. Although they don't take up disk space.
So the portable options are:
- recreate the whole file
- design your file format for data to be organized in blocks, so that blocks in the middle can be replaced or marked as deleted in-place
- design your file format to support hole-punching
Thanks for the support. I didn't know that it would be that hard. How can I know the number of bytes per block on the executing machine ? How can I make each item in my data, coincide with the beginning of a block ?
On unix systems the block size is available via the MetadataExt trait. It's generally a power of two.
If your file format's record size is 1024 bytes and the filesystem's block size is 4096 bytes then hole-punching a single record will write zeroes to disk. Hole-punching 4 consecutive (and aligned) records will remove the block from the underlying storage and replace them with a virtual range of zeroes. So this works best if you can delete runs of data.
Another option that doesn't even require hole-punching is doing something akin to Vec's swap_remove, i.e. moving data around to fill holes created by deleted records. But that only works if the data movable (i.e. offsets don't matter) and the order isn't important.
My data is movable. But I have to imagine a way to do that without disturbing the order requests to the file. I can do that well.
Thanks to all of you for your responses. This DBMS project is really being the most instructive undertaking in my programming career.
I would believe that you'd generally want to avoid moving data. A lot of databases have some sort of VACUUM command, which will free any unused memory and "remove" any holes. But since this is a possibly expensive operation, it's done on a scheduled basis, rather than constantly all the time during operation.
So whatever you want to achieve, maybe it might be best to not worry about the "unused holes", but to provide a mechanism to re-write the entire database when desired.
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.