Optimizing weird stuff?

We have this... thing

        impl trait<K> EffectiveDataSourceHelper<K, Vec<K>, SealedImpl>
        where
            K: Kind<Values=Vec<K>> + OverridableKind,
            T: DataSource<K>
        {
            fn get_values_impl(&self) -> Vec<K> {
                let mut values: Vec<K> = self.0.get_values();
                let mut seen = HashSet::<&K::Key>::new();
                let mut keep = Vec::with_capacity(values.len());
                for value in &values {
                    keep.push(seen.insert(value.as_key()));
                }
                drop(seen);
                let mut keep = keep.into_iter();
                values.retain(|_| keep.next().unwrap());
                values
            }
        }

There's really not much one can do in safe rust to make this more memory efficient, without sacrificing performance, is there?

If you're allowed to change the order, this could be sort and dedup in place.

1 Like

This is O(n). sort is O(n log n).

It would avoid the HashSet tho...

Sure, but big-O isn't everything -- the constant factor may still weigh in favor of the "slower" solution if there's a reasonable bound on the problem size.

3 Likes

So you're saying something like this?

        /// Filters the inner [`DataSource`] such that only the first instance
        /// of a given key is returned.
        impl trait<K> EffectiveDataSourceHelper<K, Vec<K>, SealedImpl>
        where
            K: Kind<Values=Vec<K>> + OverridableKind,
            T: DataSource<K>
        {
            fn get_values_impl(&self) -> Vec<K> {
                let mut values: Vec<K> = self.0.get_values();
                values.sort_by(|l, r| l.as_key().cmp(r.as_key()));
                values.dedup_by(|l, r| l.as_key() == r.as_key());
                values
            }
        }

Yes, and then measure the performance to be sure! You might find that there's a crossover point in size where it makes more sense to do it one way or the other, as long as the memory use of the HashSet way is still an acceptable option.

You might also want to explore if there's anything self.0.get_values() could do to produce better data in the first place.

1 Like

You don't need the bool vector, right?

fn unique_by_key<F, K, Key>(mut values: Vec<K>, key_fn: F) -> Vec<K>
where
    F: Fn(&K) -> Key,
    Key: Eq + Hash,
{
    let mut seen = HashSet::<Key>::new();
    values.retain(|value| seen.insert(key_fn(value)));
    values
}

Edit: This fails if Key is borrowed from K, as it is in the OP, and not in a way that can be easily fixed. The HashSet will need to store &Key, but this will be borrowing from the &K arg to the retain closure, and there is no guaranteeing that said K remains after the call, because it could be e.g. a local scoped to a single iteration inside the definition of retain.


On the sort-dedup strategy, it is worth noting that sorting also allocates. You may want to use sort_unstable if memory is your greatest concern. (the cost is that it will no longer be guaranteed to return the first item with a given key; I couldn't tell whether this is a requirement of your problem)

2 Likes

In the real world, self.0.get_values() is joining together multiple DataSources which often (but not always) use a BTreeMap (or well, many nested BTreeMaps) internally. So it's basically many chunks of sorted data.

The whole point of this EffectiveDataSource stuff is to massage the data when using the ManyDataSources (uh that's not the name we're gonna go with but w/e), so this is the "produce better data" step.

For context:

/// A kind of value that can be retrieved from a data source, which can
/// override values of the same kind.
pub trait OverridableKind: Kind {
    /// The key, as in a map key, that gets overridden.
    type Key: Hash + Eq + Ord;

    /// Returns the key for use in a key-value map such as a `BTreeMap` or
    /// a set such as `BTreeSet`.
    fn as_key(&self) -> &Self::Key;
}

A sorted merge of those chunks might be able to do better than an uninformed sort afterward, and it's possible to do the deduplication at that time too.

We would prefer not to duplicate the deduplication code, and, as this is technically a library, we can't always guarantee that the things are actually sorted.

I think this will only work when the Key is not borrowed.

1 Like

Turns out, we cannot sort it.

However, this code was designed to be uhh "modular". That is, these:

/// A source of data of some specified kind.
pub trait DataSource<T: Kind>: DataSourceBase {
    /// Returns the values associated with this kind.
    fn get_values(&self) -> T::Values;

    /// Returns the value associated with this kind.
    ///
    /// This is the same as [`get_values`] but allows using the singular name,
    /// for kinds with up to only one value.
    fn get_value(&self) -> Option<T> where T: Kind<Values=Option<T>> {
        self.get_values()
    }
}

/// A kind of value that can be retrieved from a data source.
pub trait Kind {
    /// Values of this kind.
    type Values: IntoIterator<Item=Self>;
}

/// A kind of value that can be retrieved from a data source, which can
/// override values of the same kind.
pub trait OverridableKind: Kind {
    /// The key, as in a map key, that gets overridden.
    type Key: Hash + Eq + Ord;

    /// Returns the key for use in a key-value map such as a `BTreeMap` or
    /// a set such as `BTreeSet`.
    fn as_key(&self) -> &Self::Key;
}

allow us to for example substitute a Values from a Vec to a BTreeSet where the order of set elements doesn't matter... Uh we'll get back to this once we figure out what we need to do here. But yeah, as it turns out the order does matter sometimes, and we had forgotten the cases where it does so.

We do take it the approach in the OP is the best approach we can take in safe Rust tho?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.