Race condition in safe code - Weak vs Drop

Here's a problem I've been working on for weeks, and am starting to understand.

I have a caching system for textures. Textures have a UUID and a level of detail (0..7). Textures which are currently in the cache are in the database, so requests can find them. Texture objects update the cache database on new and Drop. Each entry in the HashMap has a small array of possible LODs. If all possible LODs are empty, then it's time to get rid of the entry in the HashMap.

About 12 threads update this, and about once in 12-24 hours of operation, something gets deleted that's already deleted, and the program panics.

All this is is 100% safe Rust.

The cache entries are a HashMap, and look like this:

pub struct RenderUndupTexture<D, const N: usize>
{
    /// Database of textures.
    db: HashMap<RenderTextureKey, RenderUndupTextureItem<D, N>>,
}


pub struct RenderUndupTextureItem<D, const N: usize>
{
    /// Texture LODs, if present.
    pub items: [std::sync::Weak<D>; N],
}

Note that the entries are a Weak Arc ref. That's because this is a cache only. Ownership is with whatever owns the RenderTextureItem. (which maps to D in the generic). The idea is that when all uses of the item have been dropped, the item is dropped from the cache.

Here's what happens at Drop:

impl RenderUndupTexture {

fn remove_by_drop(&mut self, key1: RenderTextureIdent) {
        let db_key = RenderTextureKey {
            image_type: key1.image_type,
            uuid: key1.uuid_key_range.uuid_key.1,
        };
        match self.db.get_mut(&db_key) {
            Some(item) => {
                //  Known UUID, no new item needed, just update a slot
                let empty = item.remove(key1.uuid_key_range.uuid_key.0);
                if empty {
                    self.db.remove(&db_key);
                }
            }
            None => {
                panic!("Texture database remove of unknown item {:?}", key1);
            }
        }
    }
}
impl Drop for RenderTexture {
    fn drop(&mut self) {
        log::debug!(
            "Drop texture {:?} handle {:?}",
            self.texture_ident,
            self.texture_handle
        );
        {
            let mut locked_db = self.scene_link.base.get_undup_databases_locked();
            locked_db.texture_db.remove_by_drop(self.texture_ident);
        }
    }
}

impl RenderUndupTextureItem {
    fn is_empty(&self) -> bool {
        !self.items.iter().any(|i| {
            if let Some(item) = i.upgrade() {
                true
            } else {
                false
            }
        })
    }

    /// Remove by lod. Returns true if all entries now empty.
    /// Managed by RenderUndupTexture.
    fn remove(&mut self, lod: TextureLod) -> bool {
        self.items[lod.as_usize()] = std::sync::Weak::new(); // removes the item
        self.is_empty()
    }
}

This almost works, and fails about once a day in testing.

What I think is happening is a race condition where an Arc item ceases to be upgradeable, but Drop hasn't run yet. Drop is not atomic, and in this case, there's a lock on the HashMap which can result in a delay in Drop. The lock protects the HashMap, but can't protect the Arc.

Note the test in fn is_empty(). If nothing can be upgraded, then it is assumed that the entry is now empty and can be removed. But if two Arc countdowns happen in a short period, it may be possible for Drop to run from two threads, and both threads see all Weak entries as empty. So they both try to drop the item from the HashMap. The second attempt panics.

I have logs which show two Drop operations happening for different LODs in the same millisecond.

[2025-02-08T01:59:46.216Z DEBUG libscene::render::renderregistry] Add texture RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } handles ResourceHandle { refcount: 1, idx: 6472 } None
[2025-02-08T01:59:46.216Z INFO  libscene::render::renderregistry] Inserting texture in undup cache: RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture }
[2025-02-08T01:59:46.218Z DEBUG libscene::render::renderregistry] Drop texture RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(3), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } handle ResourceHandle { refcount: 4, idx: 11023 }
[2025-02-08T01:59:46.216Z DEBUG libscene::render::renderregistry] Drop texture RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } handle ResourceHandle { refcount: 1, idx: 6472 }
[2025-02-08T01:59:46.269Z ERROR libcommon::common::commonutils] ===> FATAL ERROR: Texture database remove of unknown item RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } at file libscene/src/render/renderregistry.rs, line 810 in thread Asset fetch #5. <===
[2025-02-08T01:59:51.942Z ERROR libcommon::common::commonutils] =========> Panic Texture database remove of unknown item RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } at file libscene/src/render/renderregistry.rs, line 810 in thread Asset fetch #5.

Note those last two Drop log entries. Same timestamp within a millisecond, same UUID, panic follows. So something close to what I described above is happening.

How can this be done safely?

1 Like

I'm starting to see a way out of this.

pub items: [std::sync::Weak<D>; N],

needs to be changed to:

pub items: [Option<std::sync::Weak<D>>; N],

At Drop, the item will be considered empty if all Option values are None.

For debug purposes, it's useful to check each Weak link before setting its Option to None. If .upgrade() is Some for the weak link in an Option about to be cleared, that's a bug and deserves a panic. Rust does, I think, guarantee that a weak link becomes non-upgradeable before the destructor runs. (Otherwise we'd have the curse of "re-animation", where objects can come back to life after deletion. Rust, I think, prevents that. Confirm? Microsoft Managed C++ had "re-animation". It was a bad idea.)
However, it is OK for an Option value to contain a Weak link which cannot be upgraded. That means the Arc has counted down but the destructor has not finished execution yet.

Comments?

Why not remove the panic!()? It should be harmless for the removal code to run twice, no? You understand why it happens, and you previously thought it was impossible to run twice, but now you understand it is possible, so acknowledge that in the code.

1 Like

Where two things have to be in sync, I prefer to be paranoid about checking. But I agree that taking out the panic would probably work.

That's a good principle! But you can also look at this as making the operation idempotent, which is another powerful principle.

2 Likes

I think I found another instance of this class of bug in the Rend3 crate. The pattern where the drop of an Arc-held struct causes deletion from a hash comes up reasonably often in graphics work.

Indeed, the atomicity guarantee that Arc/Weak provide is that the contained object is dropped at most once (and its memory freed at most once). No guarantees about how that is ordered w.r.t. other Arcs.

One possible solution is to move the LOD index into the RenderTextureKey. Then you have something simpler like HashMap<RenderTextureKey, Weak<RenderUndupTextureItem>>. You don't have this race condition since the Drop for RenderTexture only removes its own key, and therefore depends on the live-ness of only its own Arc, not on other Arcs whose refcount may concurrently become zero. Presumably, in the implementation of Drop for RenderTexture , the refcount is definitely zero (assuming that you have not got two RenderTexture objects with the same RenderTextureIdent).

1 Like

I coded up the paranoid form earlier today, and it's been running for seven hours now. But the buggy version got that far. Probably need about 48 hours of run time to validate this works.

(This is my metaverse client. So I logged it into Second Life and sat an avatar in one of the little tour cars which drive around the virtual world. The tour runs about an hour or so. I'll leave it running until I hit 48 hours or something crashes.)

I'm wondering:

Does temporarily upgrading a weak reference to a strong reference affect the intended logic?

A more neutral approach might be:

    fn is_empty(&self) -> bool {
        self.items.iter().all(|item| item.strong_count() == 0)
    }

Checking the strong count for other than debug purposes is not a safe operation. See this note.

3 Likes

Hmm, I believe checking if the strong count is nonzero is not a reliable operation.

However, once the strong count reaches zero, it can never increase again, making this check definitive.

The situation looks like this:

if some_weak.strong_count() == 0 {
    // We can be certain that no strong references remain
} else {
    // We have no reliable information about the actual strong count
}

If you look closely, the test in that issue checks for strong_count() == 1, which indeed doesn’t provide meaningful information.

1 Like

To clarify, the linked issue is discussing a different scenario. The relevant part of the issue is:

while Arc::strong_count(&a) != 1 {}
unsafe { *a.get() += 1; }

Here, the code spins in a loop, waiting for the strong count to reach 1, and then uses that outcome to synchronize data access. As discussed in the issue, this approach is flawed.

Later, the original author of the issue even states:

My use case was something like if strong == 1 { unsynchronized access } else { synchronized access } as a performance optimization

This, of course, cannot work, because the strong count is not a reliable synchronization mechanism for data access.