Is std::fs::Seek O(1) or fs-dependent?

One may expect this to be O(1). However, unless files are contiguous blocks on disk, it is not obvious to me how this can be guaranteed.

1 Like

Aren't you contradicting yourself in this?

It's not obvious to me != it can't be done.

Also, if you use 4kb b-trees, log_4096(N) is like 5 for N < 1 PB.

File system is just an abstract interface over various implementations. With interface like FUSE it can even call non-previleged userland process. Such userland processes can do whatever things they want before returning the result, like calculating digits of PI, spamming DNS requests, or just spin an infinite loop.

4 Likes

Even if they are contiguous blocks on HDD, it's often still not O(1) because it depends where you're seeking to, where the read head currently is, how far around the disk needs to spin to get to the data, how long the queue of other things accessing the disk is, etc. Remember, whatever you need might be at the address "most pessimum".

But what are you trying to do with the answer to the question? For most practical purposes, the answer is either "it's O(1)-but-really-slow" or "it's O(√random_access_working_set)".

2 Likes

I failed to state: I am assuming SSD.

Writing a toy KV store where Key space fits in memory, and trying to see how hard it is to do O(1) seek for reads/writes.

That was definitely non-obvious, because the very idea of "contiguous blocks" that you mentioned doesn't really make sense on SSD. Because of the FTL, any already-sketchy-in-flash-cells idea of continguousness is completely hidden from the the filesystem already, let along anything you can see at the application layer.

I would not be surprised to find that seek doesn't actually take any time worth worrying about, that it does not actually do anything apart from recording ones position in a file. Any actual work getting done when one subsequently reads or writes to that position.

One way for seek to be O(1) would be if it just changed the position, and didn't start reading new data until the next read call. I think that would be trivial to implement.

However, I'm not sure why you expect this call to be O(1) in the first place. Even the documentation for Seek states, under "Errors":

Seeking can fail, for example because it might involve flushing a buffer.

Flushing a buffer is definitely not a constant-time operation. I mean, maybe it's technically O(1) in the sense that the time won't depend directly on the input pos, but... it's not constant-time.

This.

What do you expect seek() to do? It's acting on a File, which is not a thing that hardware knows about at all.

The guarantee it gives is that a subsequent read or write will take from/write to the new logical location in the file. The operating system code that gets executed to provide that behaviour can do whatever it wants - probably it's just going to update an internal position associated with the file resource, but it could issue a seek to the block layer, or maybe decide to speculatively issue a 1M read at that location in case you're going to follow up with a read, or whatever. The OS is in charge here, and filesystems particularly make very, very few guarantees as to their behaviour.

If you want to reason about the performance of your filesystem interactions there is only the most general principles:

  • Reads are either free (if the OS has successfully guessed that you'll want that data) or slow (if the OS needs to actually read something).
  • Generally, large sequential reads will offer the highest throughput, both because hardware is generally better at large sequential reads and also because the OS is more likely to perform speculative reads for you if you're reading sequentially.
  • Writes are free, unless you're writing too much. “Too much” depends on hardware, operating system policy, other software running at the same time, and the phase of the moon.
  • Writes are free because they're a lie; the OS will (generally) defer writing to the physical hardware as long as possible. Maybe you're going to overwrite that data anyway! If you do, the OS doesn't need to do the first write at all. Win!
  • If you need a write to actually be in persistent storage, you need some form of sync call. The specific details of what calls and in what order are filesystem dependent(!)
  • sync calls are often ruinously expensive; on many filesystems they require not only flushing your write to disc, but also every other write from every process in the system since the last flush.
  • Zalgo. He comes!

The bottom line is: if you need to care about file operation performance or need to care about data integrity, you probably need to learn a bit about the filesystems you're working with. Yes, this sucks.

4 Likes

Quoting: Bitcask - Wikipedia

  • Single seek to retrieve any value: Bitcask's in-memory hash-table of keys points directly to the locations on disk where the data resides. Bitcask never needs more than one disk-seek to read a value, and the operating system's file-system caching can obviate the need for disk-seeks entirely for some lookups.
  • Predictable lookup and insert performance: Read operations as well as write operations have fixed, predictable behavior. Write operations require only a seek to the end of the current file that is open for writing and an append to that file.

I am not disagreeing with you. I am trying to reconcile these two viewpoints. To the best of my knowledge, bitcask does not operate directly on block devices, but uses the underlying file system (since it stores values in segments).

In this world, how is Bitcask able to make it's read/write/seek guarantees ?

I think the way to reconcile this is that you're misinterpreting the guarantees bitcask is making. I think the behaviour that it guarantees is predictable and fixed is how it interacts with the filesystem - eg: a write is a seek() to the end, plus an appending write().

Bitcask cannot guarantee wallclock performance - what if I have a highest-priority process writing continuously? Bitcask will submit it's predictable seek() + write() to the fs, but the OS' disc scheduler determines when that actually gets sent to the hardware.

1 Like

Although I acknowledge that both the fuse counter example and the high-priority-other-process counter example shows that in extreme cases, we can not guarantee anything, I refuse to take this as "we can't guarantee anything."

In particular, I do think we can make interesting / useful low-latency guarantees in the case of bitcask, and there interesting question then becomes: what realistic assumptions can we make which ensures that things like bitcask are O(1) / low-latency.

I think that one thing we're assuming here is that hardware even knows what "seeking" is and that its not a purely software concept. Looking at both the ATA specifications (in particular, the ATA8 command set, or ATA8-ACS) and the NVMe NVM command set, ATA/SATA had a historical command called SEEK, with various opcodes, but it is no longer defined (so I'm assuming it was eliminated from the specifications before ATA8). NVMe has never had a SEEK command of any kind. In the SCSI standards, seeking is mentioned but no such operation appears to actually be defined for the OS to control. So for the large majority of disk controllers out there, from what I can tell, at least, seeking is more of a "this may or may not happen" kind of thing. As others have said, it really does depend on the OS and underlying FS driver. For all we know, your fancy NVMe drive from Samsung may not seek at all and may simply receive a list of sectors to read from the OS, or there may be a vendor-specific command that the disk can receive to prepare it for reading from a set of sectors. I'm pretty sure, though, that fseek, and its underlying POSIX seek syscall, at least, is more of a "set position here" kind of thing in software and very few commands are given to the hardware unless speculative reads/writes are involved.

This isn't to say that seeking is a software-only concept though. The standards I listed above leave open the possibility of vendor-specific commands (if not entire vendor-specific command sets), and the specifications are extremely abstract in cases, only specifying the end results of a command, for example, and not how hardware should actually implement the operations associated with that command. At a guess, however, I would say that the answer to the question -- whether seeking is O(1) -- is no. However, the performance of modern disks (even if speculative disk accesses are taken into account and are done by the OS) is so good now that this doesn't matter.

3 Likes

This is a great point. seek may not even make all that much sense on SSDs.

There is still an FS component, e.g. if your file is large on ext3 or fragmented on ext4, it may have to read an extents node or whatever to find the actual inodes to request (once you actually read or write from/to the seeked position anyway). I personally doubt it's an optimization worth persuing; just use a reasonable FS and you'll be good. But if you want to anyway, it's another question where you'd likely find more informed answers in a Kernel/FS forum.

I guess it depends on what you mean by “guarantee”. A hard real-time OS can guarantee that - in the absence of hardware failure! - your specific code will never wait more than $X ms before running. If you want that sort of guarantee then you need your underlying foundation to support that, and basically no systems do.

Probably what you can provide is a statistical guarantee - on “an idle system” with “a sensible fs” and certain storage hardware you can expect a 99% latency of $Yms.

The hard guarantees you can provide are things like “lookup requires exactly one read() of size…”, “insert requires one lookup + one non-overwriting write() of size…”, “the access pattern is CoW friendly as data is never modified after write”, and so on.

You can't provide guarantees stronger than the OS provides to your code, and most filesystems provide no performance guarantees other than “we've tweaked things until we got the best performance on the workloads we care about”.

4 Likes

Seeking is also very expensive on a tape drive as it has to physically rewind the tape to the right location.

1 Like

I guess the other question is: why do you want to make a guarantee? What purpose are you trying to serve?

  • You want to say Bitcask is suitable for hard-real-time use - that you can use it when you have to be able to respond within 10ms, or the machine you're controlling will damage itself. I don't think you can make this guarantee in general, even if you were directly controlling the storage hardware. SSDs have internal operating systems performing various tasks, and I don't believe a random SSD can guarantee that every read completes in <10ms.
  • You want to say Bitcask has low-latency reads and writes. For this you want representative benchmarks, not an operational complexity guarantee.
  • You want to provide information so that people deploying Bitcask can tailor their environment to Bitcask's access patterns for maximum performance. For this you want to provide the pattern of filesystem accesses, because the purpose is precisely to let the user select a FS that performs those accesses well.

I'm not sure, but you might also have some misconceptions about time complexity? O(1) doesn't mean constant time, it means the number of operations taken is constant. Removing an element from a linked list is O(1) - it takes a constant number of memory accesses - but the time taken for those accesses can vary by 6 orders of magnitude depending on system state (L1 cache hit vs pagefault to swap in) and 3 orders of magnitude due to the the behaviour of your code alone (L1 cache hit vs main memory access). O(1) doesn't imply low-latency, or even predictable latency. You need to measure that separately.

4 Likes