Confusing day when my disk (NVM device) is unreliable

I got pretty confused today, in the end it seems my NVM device or the drivers for it is buggy! I was stress-testing my database software, when I got a verification error (a check that each allocated page is referenced once and only once ).

Of course initially I assumed a bug in my software, but it seemed strange. Eventually I noticed that a particular page in my cache when it was read back from disk had a one byte discrepancy (I put in code to check for any differences). Which was odd. At one point I was convinced it was an error in my "diff" function which compares bytes, so I looked at that carefully, wrote a test, but couldn't get it to fail, and I couldn't see what was wrong with it.

So... eventually I thought what if I try an "in-memory" pseudo-file rather than my hard drive (which is actually a rather ancient NVM Express drive). I was amazed the error went away. Anyway, it appears doing a large number of writes, perhaps on odd boundaries, or short ranges (maybe just a single byte) to a file can confuse or overwhelm the driver(s) or the device. Or something...

Anyway, I just thought I would relate this strange experience. I might see if I can reproduce it with a simple test ( it was quite repeatable, I was able to put in all kinds of tracing to narrow it down ). I could still be mistaken about exactly what is going on.

Just in case it is relevant here are some screen-shots of the device details:
image

I've heard that writing database software is a great way to learn many things you did not want to know about filesystems and disks.

13 Likes

Is it always only one byte? It is always in the same position? If so, where? You aren't doing something silly like write to the buffer while an overlapped non cached write is in progress are you?

Always the same byte, in the same position. It was coming back as 0x00 rather than the correct value of (IIRC) 0x68. In the middle of a page of data.

The implementation of an "in-memory" pseudo-file is very simple (and performs correctly). The "real" file should behave the same way, and there is nothing complicated about the way I interface to the real file, so it cannot really be a bug in the interfacing (it is just a few lines of code). So it has to be a bug in the operating system or the drivers etc. In view of the age of the device drivers, it isn't THAT surprising, although I wasn't exactly expecting it!

[ I am using entirely safe Rust ]

How can I do one of those even I wanted to? I am writing and reading the file, just using std::file calls. I suppose some kind of timing issue could arise, but I don't think that's it.

[ The reason why I don't think it is a timing issue is the test I was doing was not really a multi-thread test - although threads were used, there was no real concurrency which could introduce some kind of data race ]

The Mutex you're using for the file should preclude concurrency problems unless you open the same file more than once -- do you?

Is the difference in behavior between SimpleFileStorage and MemFile, or are you also using MultiFileStorage?

I do, but I swapped that out for a version that doesn't, and it didn't help. You can still have a data race with multiple threads and a file locked by Mutex, as simple as one thread reads the file before another thread writes it, or vice-versa. But I don't believe that was the cause. I think it was simply that I did a lot of small writes to a file, (maybe just single bytes) very fast, and at some level in the operating system or device drivers it didn't cope properly and something went wrong.

So the version we're looking at does allow multiple opens? Do you have the version that disallows it, just to simplify the hunt?

That's extremely unlikely on Linux with ext3/ext4 at least. In 20 years working with database customers on Linux I've only seen one corruption problem like that and it seems to have been fixed or gone away over 10 years ago. It was also not a single byte, it was a always a sequence of several zeros that appeared after a crash. And it was very rare and not reproducible.

It is Windows not Linux. My personal laptop, which is a re-conditioned cheapo, and very, very old.

[ Incidentally Windows does misbehave a bit as well, the Start menu search is always failing, I have to re-boot quite regularly! ]

You might be able to narrow it down a little by trying to reproduce it on Linux, if you haven't already. If it fails there, you know it is your bug.

If I am right about it being the device driver, I am sure it wouldn't happen on Linux.

1 Like

Right, that's why if you test on Linux and it does happen, then you know you're wrong about that.

I am 90% certain it was the ancient device driver, so I am not going to spend further time investigating it, as there is little point. I wouldn't use my laptop to host a production database, those are all on Linux machines. I just thought it was a weird thing to happen so would pass it on!

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.