Keyed BTreeSet?

When dealing with Optimizing weird stuff? we noticed we need a BTreeSet that can use a custom key function, ideally through monomorphization. We have these:

/// A source of data of some specified kind.
pub trait DataSource<T: Kind>: DataSourceBase {
    /// Returns the values associated with this kind.
    fn get_values(&self) -> T::Values;

    /// Returns the value associated with this kind.
    ///
    /// This is the same as [`get_values`] but allows using the singular name,
    /// for kinds with up to only one value.
    fn get_value(&self) -> Option<T> where T: Kind<Values=Option<T>> {
        self.get_values()
    }
}

/// A kind of value that can be retrieved from a data source.
pub trait Kind {
    /// Values of this kind.
    type Values: IntoIterator<Item=Self>;
}

/// A kind of value that can be retrieved from a data source, which can
/// override values of the same kind.
pub trait OverridableKind: Kind {
    /// The key, as in a map key, that gets overridden.
    type Key: Hash + Eq + Ord;

    /// Returns the key for use in a key-value map such as a `BTreeMap` or
    /// a set such as `BTreeSet`.
    fn as_key(&self) -> &Self::Key;
}

and, well, for an EffectiveDataSource we can just do:

        /// Forwards to the inner [`DataSource`] as there's no filtering we
        /// can/need to do here.
        impl trait<K> EffectiveDataSourceHelper<K, BTreeSet<K>>
        where
            K: Kind<Values=BTreeSet<K>> + OverridableKind,
            T: DataSource<K>
        {
            fn get_values_impl(&self) -> BTreeSet<K> {
                self.0.get_values()
            }
        }

but this doesn't work for a CombinedDataSource that takes multiple other DataSources like in a Vec.

fn get_values(&self) -> BTreeSet<K> {
  self.vec_of_data_sources...
  let result: BTreeSet<K> = ...
  // a BTreeSet, deduplicated by K.as_key, such that only the first
  // occurrence of such an as_key (in the order specified by vec_of_data_sources) is included in the set.
}

Currently the only thing we can think of is to collect everything into a Vec, sort it by as_key, dedup it by as_key, and then convert it into a BTreeSet again. And maybe add a debug check in EffectiveDataSource that warns if the BTreeSet contains duplicated keys.

Wouldn't a newtype suffice which defers its Ord implementation to whatever is returned by .as_key()? That function could then be pulled out into a trait and added as a bound on impl Ord for Newtype.

3 Likes

This is what we have for EffectiveDataSource:

        /// Filters the inner [`DataSource`] such that only the first instance
        /// of a given key is returned.
        impl trait<K> EffectiveDataSourceHelper<K, Vec<K>>
        where
            K: Kind<Values=Vec<K>> + OverridableKind,
            T: DataSource<K>
        {
            fn get_values_impl(&self) -> Vec<K> {
                let mut values: Vec<K> = self.0.get_values();
                let mut seen = HashSet::<&K::Key>::new();
                let mut keep = Vec::with_capacity(values.len());
                for value in &values {
                    keep.push(seen.insert(value.as_key()));
                }
                drop(seen);
                let mut keep = keep.into_iter();
                values.retain(|_| keep.next().unwrap());
                values
            }
        }

We would want something like this for CombinedDataSource:

        /// Filters the inner [`DataSource`]s such that only the first instance
        /// of a given key is returned.
        impl trait<K> CombinedDataSourceHelper<K, BTreeSet<K, OverridableKindKeyExtractor>>
        where
            K: Kind<Values=BTreeSet<K, OverridableKindKeyExtractor>> + OverridableKind,
            T: DataSource<K>
        {
            fn get_values_impl(&self) -> BTreeSet<K, OverridableKindKeyExtractor> {
                let result = BTreeSet::new();
                // assuming "extend" discards the duplicates from the argument, not from itself.
                self.data_sources.iter().foreach(|x| result.extend(x.get_values()));
                result
            }
        }

We could use a transparent newtype and unsafe transmute tho. :‌)

        /// Filters the inner [`DataSource`]s such that only the first instance
        /// of a given key is returned.
        impl trait<K> CombinedDataSourceHelper<K, BTreeSet<K>>
        where
            K: Kind<Values=BTreeSet<K>> + OverridableKind,
            T: DataSource<K>
        {
            fn get_values_impl(&self) -> BTreeSet<K> { unsafe {
                let result = BTreeSet::new();
                // assuming "extend" discards the duplicates from the argument, not from itself.
                // this totally doesn't break when x.get_values() has duplicates /s
                self.data_sources.iter().foreach(|x| result.extend(mem::transmute<BTreeSet<Newtype<K>>, _>(x.get_values())));
                mem::transmute(result)
            } }
        }

(Wish you could apply #[fundamental] to user types...)

We managed to find a workaround but it requires HKTs.

This:

/// A wrapper for [`OverridableKind`] that acts like the key for Eq/Hash/Ord.
#[repr(transparent)]
#[derive(Debug)]
pub struct EffectiveKind<T: OverridableKind>(pub T);

impl<T: Kind<Values=BTreeSet<T>> + OverridableKind> Kind for EffectiveKind<T> {
    type Values = BTreeSet<Self>;
}

impl_trait! {
    impl<T: Kind + OverridableKind> EffectiveKind<T> where Self: Kind {
        impl trait OverridableKind {
            type Key = T::Key;

            fn as_key(&self) -> &Self::Key {
                self.0.as_key()
            }
        }
        impl trait PartialEq {
            fn eq(&self, other: &Self) -> bool {
                self.as_key().eq(other.as_key())
            }
            fn ne(&self, other: &Self) -> bool {
                self.as_key().ne(other.as_key())
            }
        }
        impl trait Eq {
        }
        impl trait PartialOrd {
            fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
                Some(self.cmp(other))
            }
        }
        impl trait Ord {
            fn cmp(&self, other: &Self) -> std::cmp::Ordering {
                self.as_key().cmp(other.as_key())
            }
        }
        impl trait std::hash::Hash {
            fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
                self.as_key().hash(state);
            }
        }
    }
}

doesn't work as well as we wanted it to. We also get this error:

error[E0308]: mismatched types
   --> src/data/effective.rs:154:17
    |
153 |             fn get_values_impl(&self) -> BTreeSet<EffectiveKind<K>> {
    |                                          -------------------------- expected `BTreeSet<EffectiveKind<K>>` because of return type
154 |                 self.0.get_values()
    |                 ^^^^^^^^^^^^^^^^^^^ expected struct `BTreeSet`, found associated type
    |
    = note:       expected struct `BTreeSet<EffectiveKind<K>>`
            found associated type `<EffectiveKind<K> as Kind>::Values`
    = help: consider constraining the associated type `<EffectiveKind<K> as Kind>::Values` to `BTreeSet<EffectiveKind<K>>`
    = note: for more information, visit https://doc.rust-lang.org/book/ch19-03-advanced-traits.html

error: aborting due to previous error

related to this:

        /// Forwards to the inner [`DataSource`] as there's no filtering we
        /// can/need to do here.
        impl trait<K> EffectiveDataSourceHelper<K, BTreeSet<EffectiveKind<K>>>
        where
            K: Kind<Values=BTreeSet<EffectiveKind<K>>> + OverridableKind,
            T: DataSource<EffectiveKind<K>>,
            EffectiveKind<K>: OverridableKind,
        {
            fn get_values_impl(&self) -> BTreeSet<EffectiveKind<K>> {
                self.0.get_values()
            }
        }

but we haven't bothered to figure out the bounds here. A key function BTreeSet<K, EffectiveKey> would go a long way here. :‌/

Edit: Fixed the error, this seems to work.

        /// Forwards to the inner [`DataSource`] as there's no filtering we
        /// can/need to do here.
        impl trait<K> EffectiveDataSourceHelper<
            EffectiveKind<K>, 
            BTreeSet<EffectiveKind<K>>,
        >
        where
            K: Kind<Values=BTreeSet<K>> + OverridableKind,
            T: DataSource<EffectiveKind<K>>,
            EffectiveKind<K>: Kind<Values=BTreeSet<EffectiveKind<K>>>,
            EffectiveKind<K>: OverridableKind,
        {
            fn get_values_impl(&self) -> BTreeSet<EffectiveKind<K>> {
                self.0.get_values()
            }
        }

Note that if BTreeSet took a key function (in the sense of slice::sort_by_key) this would just be:

        /// Forwards to the inner [`DataSource`] as there's no filtering we
        /// can/need to do here.
        impl trait<K> EffectiveDataSourceHelper<K, BTreeSet<K, EffectiveKey>>
        where
            K: Kind<Values=BTreeSet<K, EffectiveKey>> + OverridableKind,
            T: DataSource<K>,
        {
            fn get_values_impl(&self) -> BTreeSet<K, EffectiveKey> {
                self.0.get_values()
            }
        }

and none of the EffectiveKind stuff would be needed.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.