const ALL: &str = "foo bar baz";
const FOO: &str = "foo";
const BAR: &str = "bar";
const BAZ: &str = "baz";
fn main() {
dbg!(ALL.as_ptr(), FOO.as_ptr(), BAR.as_ptr(), BAZ.as_ptr());
dbg!((FOO.as_ptr() as u64) - (ALL.as_ptr() as u64));
dbg!((BAR.as_ptr() as u64) - (FOO.as_ptr() as u64));
dbg!((BAZ.as_ptr() as u64) - (BAR.as_ptr() as u64));
// dbg!(unsafe { *FOO.as_bytes().get_unchecked(3) });
}
The above code outputs
[src/main.rs:8:5] (FOO.as_ptr() as u64) - (ALL.as_ptr() as u64) = 45
[src/main.rs:9:5] (BAR.as_ptr() as u64) - (FOO.as_ptr() as u64) = 36
[src/main.rs:10:5] (BAZ.as_ptr() as u64) - (BAR.as_ptr() as u64) = 36
which indicates that FOO does not reuse part of ALL despite ALL contains all of FOO.
I even set opt-level to s and z in my local machine and it gives the same result.
Is this just a missed opportunity for optimization or is there a deeper reason behind this?
While it is a missed optimization, it's harder than it seems. The time complexity for this is quadratic relative to the number of strings. Needless to say this blows up quite quickly.
If you really want to, you can explicitly run such a search at compile time. This will be very slow since it gives the const evaluator a lot of work to do, and ugly because of various string operations not being available as const fns today. A faster version for use with big strings would be to use a build script to run a better, compiled algorithm, and of course the fastest would be to compute the offsets once and store them permanently, merely validating their correctness during the build.
const fn reuse(combined: &'static str, substr: &'static str) -> &'static str {
// `str` search and slicing is not available in const,
// so we use byte slices instead.
let bcombined = combined.as_bytes();
let bsubstr = substr.as_bytes();
let Some(position_limit) = bcombined.len().checked_sub(bsubstr.len()) else {
// combined is too short.
return substr;
};
// Note: This is a naive string search algorithm, which is much slower
// than one which takes advantage of properties of `substr` would be.
let mut position = 0;
while position <= position_limit {
// [..] slicing on strings is not const, but `split_at` is.
let candidate = bcombined.split_at(position).1.split_at(bsubstr.len()).0;
if const_eq(candidate, bsubstr) {
// Cannot fail. `unwrap()` isn't const so we match instead.
return match core::str::from_utf8(candidate) {
Ok(s) => s,
Err(_) => unreachable!(),
};
}
position += 1;
}
// Search failed
return substr;
}
// `==` on slices is not const, so we have to make it up ourselves
const fn const_eq(a: &[u8], b: &[u8]) -> bool {
let len = a.len();
if len != b.len() {
return false;
}
let mut i = 0;
while i < len {
if a[i] != b[i] {
return false;
}
i += 1;
}
true
}
const ALL: &str = "foo bar baz";
const FOO: &str = reuse(ALL, "foo");
const BAR: &str = reuse(ALL, "bar");
const BAZ: &str = reuse(ALL, "baz");
fn main() {
dbg!(ALL.as_ptr(), FOO.as_ptr(), BAR.as_ptr(), BAZ.as_ptr());
dbg!((FOO.as_ptr() as u64) - (ALL.as_ptr() as u64));
dbg!((BAR.as_ptr() as u64) - (FOO.as_ptr() as u64));
dbg!((BAZ.as_ptr() as u64) - (BAR.as_ptr() as u64));
// dbg!(unsafe { *FOO.as_bytes().get_unchecked(3) });
}
[src/main.rs:47:5] ALL.as_ptr() = 0x0000629c58565000
[src/main.rs:47:5] FOO.as_ptr() = 0x0000629c58565000
[src/main.rs:47:5] BAR.as_ptr() = 0x0000629c58565004
[src/main.rs:47:5] BAZ.as_ptr() = 0x0000629c58565008
[src/main.rs:48:5] (FOO.as_ptr() as u64) - (ALL.as_ptr() as u64) = 0
[src/main.rs:49:5] (BAR.as_ptr() as u64) - (FOO.as_ptr() as u64) = 4
[src/main.rs:50:5] (BAZ.as_ptr() as u64) - (BAR.as_ptr() as u64) = 4
@jhpratt Do you have a source for that quadratic time complexity? I believe it can be done in linear time in the total number of characters in all of the strings. The idea is to build a suffix tree from a big string containing all of the strings concatenated together. You can use the suffix tree to, for each string, look up the longest string containing it.
Yes, that works in linear time. Building a suffix array instead of a suffix tree also works and is simpler.
It's more complicated if you want to more generally deal with partially overlapping strings, e.g. "abc" and "bcd" can be stored as "abcd". To do that optimally is NP-hard.
But of course the optimal solution is not required! There are good simple heuristics and approximation algorithms that would provide most of the benefit.
I knew about that one. I remember convincing myself that it was equivalent to a variant of the travelling salesman problem. But I did not look at the only-full-overlaps problem at the time.