Sub-string deduplication

I have a conceptual question. I have a const declaration for a snippet of text TEXT.

const TEXT: &str = "Atticus was right. One time he said you never really know a man until you stand in his shoes and walk around in them. Just standing on the Radley porch was enough. The summer that had begun so long ago had ended, and another summer had taken its place, and a fall, and Boo Radley had come out. 𝗜 𝘄𝗮𝘀 𝘁𝗼 𝘁𝗵𝗶𝗻𝗸 𝗼𝗳 𝘁𝗵𝗲𝘀𝗲 𝗱𝗮𝘆𝘀 𝗺𝗮𝗻𝘆 𝘁𝗶𝗺𝗲𝘀-𝗼𝗳 𝗝𝗲𝗺, 𝗮𝗻𝗱 𝗗𝗶𝗹𝗹 𝗮𝗻𝗱 𝗕𝗼𝗼 𝗥𝗮𝗱𝗹𝗲𝘆, 𝗮𝗻𝗱 𝗧𝗼𝗺 𝗥𝗼𝗯𝗶𝗻𝘀𝗼𝗻, 𝗮𝗻𝗱 𝗔𝘁𝘁𝗶𝗰𝘂𝘀. He would be in Jem’s room all night, and he would be there when Jem waked up in the morning. I looked around the front yard, wondering how many times Jem and I had made our journey across the street. It was different now. Daylight... in my mind, the night faded. It was daytime, and the neighborhood was busy. Miss Stephanie Crawford crossed the street to tell the latest news to Miss Rachel. Miss Maudie bent over her azaleas. It was summertime again, and the children played in the yard. But it was not our yard anymore. Jem was nearly thirteen now, and I was nearly ten. We had grown up.";

Independently I need a const slice SENTENCE_X for a sentence that is a part of the TEXT.

const SENTENCE_X: &str = "I was to think of these days many times-of Jem, and Dill and Boo Radley, and Tom Robinson, and Atticus.";

Is there a way to have the second const to point to sub-slice from within TEXT in order
not to waste memory.

This question is conceptual, so please do not pick on the insignificance of memory savings
in this specific example.

This optimization should exist already. Rust's strings don't have a NUL terminator, so the optimizer and linker are able to merge them.

2 Likes

Hah! That's good to know and surprising at the same time. Does that mean that the compiler/linker maintains a sort of GIST/trigram index of all declared const slices?

You can use const pointer arithmetic to get guaranteed reuse. With something roughly like this:

const fn str_index(s: &str, range: std::ops::Range<usize>) -> &str {
    if s.len() < range.end || range.start > range.end {
        panic!("range out of bounds");
    }

    unsafe {
        let res = std::str::from_utf8(std::slice::from_raw_parts(
            s.as_ptr().add(range.start),
            range.end - range.start,
        ));
        
        match res {
            Ok(res) => res,
            Err(_) => panic!("invalid utf8"),
        }
    }
}

const A: &str = "abcdefg";

const B: &str = str_index(A, 1..6);
2 Likes

It doesn't exist already. The only way to tell the linker to allow string merging outside of merging exactly identical sections (which we don't enable right now) is C string merging, which requires NUL terminators (on nightly C string merging is not automatically enabled if a string ends with a NUL terminator). As for the optimizer, LLVM doesn't support substring deduplication and the rustc frontend doesn't do this deduplication either.

3 Likes