Is std::fs::Seek O(1) or fs-dependent?

Although I acknowledge that both the fuse counter example and the high-priority-other-process counter example shows that in extreme cases, we can not guarantee anything, I refuse to take this as "we can't guarantee anything."

In particular, I do think we can make interesting / useful low-latency guarantees in the case of bitcask, and there interesting question then becomes: what realistic assumptions can we make which ensures that things like bitcask are O(1) / low-latency.

I think that one thing we're assuming here is that hardware even knows what "seeking" is and that its not a purely software concept. Looking at both the ATA specifications (in particular, the ATA8 command set, or ATA8-ACS) and the NVMe NVM command set, ATA/SATA had a historical command called SEEK, with various opcodes, but it is no longer defined (so I'm assuming it was eliminated from the specifications before ATA8). NVMe has never had a SEEK command of any kind. In the SCSI standards, seeking is mentioned but no such operation appears to actually be defined for the OS to control. So for the large majority of disk controllers out there, from what I can tell, at least, seeking is more of a "this may or may not happen" kind of thing. As others have said, it really does depend on the OS and underlying FS driver. For all we know, your fancy NVMe drive from Samsung may not seek at all and may simply receive a list of sectors to read from the OS, or there may be a vendor-specific command that the disk can receive to prepare it for reading from a set of sectors. I'm pretty sure, though, that fseek, and its underlying POSIX seek syscall, at least, is more of a "set position here" kind of thing in software and very few commands are given to the hardware unless speculative reads/writes are involved.

This isn't to say that seeking is a software-only concept though. The standards I listed above leave open the possibility of vendor-specific commands (if not entire vendor-specific command sets), and the specifications are extremely abstract in cases, only specifying the end results of a command, for example, and not how hardware should actually implement the operations associated with that command. At a guess, however, I would say that the answer to the question -- whether seeking is O(1) -- is no. However, the performance of modern disks (even if speculative disk accesses are taken into account and are done by the OS) is so good now that this doesn't matter.

3 Likes

This is a great point. seek may not even make all that much sense on SSDs.

There is still an FS component, e.g. if your file is large on ext3 or fragmented on ext4, it may have to read an extents node or whatever to find the actual inodes to request (once you actually read or write from/to the seeked position anyway). I personally doubt it's an optimization worth persuing; just use a reasonable FS and you'll be good. But if you want to anyway, it's another question where you'd likely find more informed answers in a Kernel/FS forum.

I guess it depends on what you mean by “guarantee”. A hard real-time OS can guarantee that - in the absence of hardware failure! - your specific code will never wait more than $X ms before running. If you want that sort of guarantee then you need your underlying foundation to support that, and basically no systems do.

Probably what you can provide is a statistical guarantee - on “an idle system” with “a sensible fs” and certain storage hardware you can expect a 99% latency of $Yms.

The hard guarantees you can provide are things like “lookup requires exactly one read() of size…”, “insert requires one lookup + one non-overwriting write() of size…”, “the access pattern is CoW friendly as data is never modified after write”, and so on.

You can't provide guarantees stronger than the OS provides to your code, and most filesystems provide no performance guarantees other than “we've tweaked things until we got the best performance on the workloads we care about”.

4 Likes

Seeking is also very expensive on a tape drive as it has to physically rewind the tape to the right location.

1 Like

I guess the other question is: why do you want to make a guarantee? What purpose are you trying to serve?

  • You want to say Bitcask is suitable for hard-real-time use - that you can use it when you have to be able to respond within 10ms, or the machine you're controlling will damage itself. I don't think you can make this guarantee in general, even if you were directly controlling the storage hardware. SSDs have internal operating systems performing various tasks, and I don't believe a random SSD can guarantee that every read completes in <10ms.
  • You want to say Bitcask has low-latency reads and writes. For this you want representative benchmarks, not an operational complexity guarantee.
  • You want to provide information so that people deploying Bitcask can tailor their environment to Bitcask's access patterns for maximum performance. For this you want to provide the pattern of filesystem accesses, because the purpose is precisely to let the user select a FS that performs those accesses well.

I'm not sure, but you might also have some misconceptions about time complexity? O(1) doesn't mean constant time, it means the number of operations taken is constant. Removing an element from a linked list is O(1) - it takes a constant number of memory accesses - but the time taken for those accesses can vary by 6 orders of magnitude depending on system state (L1 cache hit vs pagefault to swap in) and 3 orders of magnitude due to the the behaviour of your code alone (L1 cache hit vs main memory access). O(1) doesn't imply low-latency, or even predictable latency. You need to measure that separately.

4 Likes

Consider the following statements:

  1. there exists a file-system / OS / cpu-config / ssd-brand combo where Bitcask can not make useful O(1) time start read/write guarantees

  2. there exists a file-system / OS / cpu-config ssd-brand combo where Bitcask can make useful O(1) time start read/write guarantees

You have definitely proved #1 by providing a concrete example. It sounds like you are trying to argue that #2 is impossible, which I think is very difficult to do.

AFAIK, #2 is possible, but not practical to pursue in general-purpose system. It would hold if the system is specifically designed to run only in hard-real-time contexts, but in other cases is just doesn't worth the time and so would probably not be achieved deliberately.

In order for this to be productive I think you need to be more specific about what a “useful guarantee” is.

It seems to me that Bitcask makes a bunch of useful guarantees - that writes are append-only, that reads require exactly one, contiguous read(), and so on. Those are O(1)-type guarantees. They're not wallclock time or predictability guarantees, but that's not what O(1) is implying anyway.

If you want a wallclock guarantee, that's a property of the system, of which Bitcask is only a small part.

4 Likes

The issue is not the question but that I posted to the wrong forum. In a forum dedicated to using low-latency KV stores, the community would be focused on "under what conditions this could work", but here, the focus seems to be "pick extreme outliers to prove it can not work in general."

The focus is to read the question literally, I think, without trying to guess the context. Of course, in KV-stores community the context would be already provided.

2 Likes

Proving this in the positive requires only one configuration. Proving this in the negative requires accounting for all possible configurations; yet, for whatever reason, the focus here is on the latter.

Expanding a bit on @Cerber-Ursi 's point: Your original question, as reflected in the title of the thread, is asking whether this can be made to work in general. If instead you had titled it something like "How to achieve bounded disk seeks," I suspect the conversation would have gone differently.

1 Like

Ah, I think I see the issue here now. Somewhere in this thread, say at Is std::fs::Seek O(1) or fs-dependent? - #13 by RAOF in particular with

I saw the original question as having been resolved and the discussion shifting towards:

I guess people skipped that message.

Hmmm.... If we want to endure anything about anything we cannot rely on assumptions.

I can sense where the confusion lies in the question.

Bottom line is that no application can make guarantees about anything. One has to take into account the underlying operating system and whatever resources it is using.

After all, simply getting from one instruction to the next in any program under most common operating systems has no upper bound on the time it may take. What if an interrupt occurs? What if the processes is suspended? What if the machine goes to sleep?

I think the gap between the responses here vs Rust in low latency environments shows that:

  1. it is very hard to engineer stuff

  2. it is very easy to find an edge case to argue about (i.e. impossible to make that guarantee because, what if, world goes into WW3 and nukes all the CPU production facilities)

1 ) If the seek is in the current loaded page in memory, it's just updating a pointer, that's O1.
2 ) Then you go to the file system that looks up in block cash, and if your in there, it's a pointer on another page. Still pretty much O1.
3) Block cache hit fails and you need to go to disk to load the next block. This may trigger a page flush to get space in the block cash, in which case your waiting for writes to disk for dirty pages, then you need to read the disk to get the new page. This hits the physical disk cache on the SATA drive, where the new page may be present or not, so you wait for disk to cycle it's cache.
Assuming an SSD there's no physical movement to wait for, but the bad block detection and recovery algorithm, is not O1, and that randomizes where data is on the disk, and that can interrupt the cache refresh in the SSD firmware.

And to address your original query, yes the filesystem matters as well, as each block on disk has pointer to the next block on disk. But how contiguous the blocks are, and how well they translate to pages in memory depends on the type of filesystem. Most modern filesystems are journaling, so they actually have two strings of disk blocks, one for committed writes, and one for dirty blocks that have been written to. Figuring out if your reading a dirty or committed block, and traversing those pointers is not O1 in most cases.

Back in the day we might expect seek() to wind a tape forwards or even backwards until it got to the right place. That might take a minute or more. Then subsequent reads would be much faster.

Point is, the moment you call out to the OS you cannot make any guarantees about throughout or latency.

But as I think I said earlier, on modern CPU's and operating systems you cannot even make any guarantees of how long it takes to get from one instruction of your program to the next.

If you want to make real-time assurances you need control of the whole stack from application, through OS to hardware.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.