Here's a problem I've been working on for weeks, and am starting to understand.
I have a caching system for textures. Textures have a UUID and a level of detail (0..7). Textures which are currently in the cache are in the database, so requests can find them. Texture objects update the cache database on new and Drop. Each entry in the HashMap has a small array of possible LODs. If all possible LODs are empty, then it's time to get rid of the entry in the HashMap.
About 12 threads update this, and about once in 12-24 hours of operation, something gets deleted that's already deleted, and the program panics.
All this is is 100% safe Rust.
The cache entries are a HashMap, and look like this:
pub struct RenderUndupTexture<D, const N: usize>
{
/// Database of textures.
db: HashMap<RenderTextureKey, RenderUndupTextureItem<D, N>>,
}
pub struct RenderUndupTextureItem<D, const N: usize>
{
/// Texture LODs, if present.
pub items: [std::sync::Weak<D>; N],
}
Note that the entries are a Weak Arc ref. That's because this is a cache only. Ownership is with whatever owns the RenderTextureItem. (which maps to D in the generic). The idea is that when all uses of the item have been dropped, the item is dropped from the cache.
Here's what happens at Drop:
impl RenderUndupTexture {
fn remove_by_drop(&mut self, key1: RenderTextureIdent) {
let db_key = RenderTextureKey {
image_type: key1.image_type,
uuid: key1.uuid_key_range.uuid_key.1,
};
match self.db.get_mut(&db_key) {
Some(item) => {
// Known UUID, no new item needed, just update a slot
let empty = item.remove(key1.uuid_key_range.uuid_key.0);
if empty {
self.db.remove(&db_key);
}
}
None => {
panic!("Texture database remove of unknown item {:?}", key1);
}
}
}
}
impl Drop for RenderTexture {
fn drop(&mut self) {
log::debug!(
"Drop texture {:?} handle {:?}",
self.texture_ident,
self.texture_handle
);
{
let mut locked_db = self.scene_link.base.get_undup_databases_locked();
locked_db.texture_db.remove_by_drop(self.texture_ident);
}
}
}
impl RenderUndupTextureItem {
fn is_empty(&self) -> bool {
!self.items.iter().any(|i| {
if let Some(item) = i.upgrade() {
true
} else {
false
}
})
}
/// Remove by lod. Returns true if all entries now empty.
/// Managed by RenderUndupTexture.
fn remove(&mut self, lod: TextureLod) -> bool {
self.items[lod.as_usize()] = std::sync::Weak::new(); // removes the item
self.is_empty()
}
}
This almost works, and fails about once a day in testing.
What I think is happening is a race condition where an Arc item ceases to be upgradeable, but Drop hasn't run yet. Drop is not atomic, and in this case, there's a lock on the HashMap which can result in a delay in Drop. The lock protects the HashMap, but can't protect the Arc.
Note the test in fn is_empty()
. If nothing can be upgraded, then it is assumed that the entry is now empty and can be removed. But if two Arc countdowns happen in a short period, it may be possible for Drop to run from two threads, and both threads see all Weak entries as empty. So they both try to drop the item from the HashMap. The second attempt panics.
I have logs which show two Drop operations happening for different LODs in the same millisecond.
[2025-02-08T01:59:46.216Z DEBUG libscene::render::renderregistry] Add texture RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } handles ResourceHandle { refcount: 1, idx: 6472 } None
[2025-02-08T01:59:46.216Z INFO libscene::render::renderregistry] Inserting texture in undup cache: RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture }
[2025-02-08T01:59:46.218Z DEBUG libscene::render::renderregistry] Drop texture RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(3), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } handle ResourceHandle { refcount: 4, idx: 11023 }
[2025-02-08T01:59:46.216Z DEBUG libscene::render::renderregistry] Drop texture RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } handle ResourceHandle { refcount: 1, idx: 6472 }
[2025-02-08T01:59:46.269Z ERROR libcommon::common::commonutils] ===> FATAL ERROR: Texture database remove of unknown item RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } at file libscene/src/render/renderregistry.rs, line 810 in thread Asset fetch #5. <===
[2025-02-08T01:59:51.942Z ERROR libcommon::common::commonutils] =========> Panic Texture database remove of unknown item RenderTextureIdent { uuid_key_range: TextureUuidKeyRange { uuid_key: (TextureLod(4), 1f73892c-925d-d9b3-3ea8-e38dd9d87554), highest_lod_available: TextureLod(0) }, image_type: DiffuseTexture } at file libscene/src/render/renderregistry.rs, line 810 in thread Asset fetch #5.
Note those last two Drop log entries. Same timestamp within a millisecond, same UUID, panic follows. So something close to what I described above is happening.
How can this be done safely?