Checking simple repeats in strings

Hello all,

I feel like the problem I am about to present is pretty simple, but maybe I am a bit simple.

I want to check if strings contain simple repeats. For example, in a vector of Strings or &strs, I'd like to filter out those which either:

  • contain all the same chars (e.g. AAAAA)
  • contain repeating dimers (e.g. ACACAC)
  • contain repeating trimers (e.g. ACTACTACT)

I came up with something, and I'm pretty sure it works. But it's ugly, and verbose, and loops over three times. I don't care so much about performance, but it would be nice to get some other ideas of how people would tackle this problem. Bonus if it can be extended to any substring length.

My attempt:

fn filter_count_vec(v: &mut Vec<String>) {
    // monomers
    v.retain(|s| {
        let init = s.chars().next().unwrap();
        
        !s.chars().all(|other| other == init)
    });

    // dimers
    v.retain(|s| {
        let all_alt_0 = s.chars().step_by(2).collect::<String>();
        let all_alt_1 = s.chars().skip(1).step_by(2).collect::<String>();

        let all_alt_0_init = all_alt_0.chars().next().unwrap();
        let all_alt_0_same = !all_alt_0.chars().all(|other| other == all_alt_0_init);

        let all_alt_1_init =  all_alt_1.chars().next().unwrap();
        let all_alt_1_same = !all_alt_1.chars().all(|other| other == all_alt_1_init);
        
        all_alt_0_same && all_alt_1_same
    });

    // trimers.
    v.retain(|s| {
        let all_alt_0 = s.chars().step_by(2).collect::<String>();
        let all_alt_1 = s.chars().skip(1).step_by(2).collect::<String>();
        let all_alt_2 = s.chars().skip(1).step_by(3).collect::<String>();

        let all_alt_0_init = all_alt_0.chars().next().unwrap();
        let all_alt_0_same = !all_alt_0.chars().all(|other| other == all_alt_0_init);

        let all_alt_1_init =  all_alt_1.chars().next().unwrap();
        let all_alt_1_same = !all_alt_1.chars().all(|other| other == all_alt_1_init);

        let all_alt_2_init =  all_alt_2.chars().next().unwrap();
        let all_alt_2_same = !all_alt_2.chars().all(|other| other == all_alt_2_init);
        
        all_alt_0_same && all_alt_1_same && all_alt_2_same
    });
}


fn main() {
    let mut vec = vec![
        "asdfgh".to_string(), // keep 
        "hahaha".to_string(), // remove
        "haahaahaahaa".to_string(), // remove
        "aaaaaaaaaa".to_string() // remove
        ];

    filter_count_vec(&mut vec);

    println!("{:?}", vec);
}

I couldn't easily see if this problem had a nice solution, so many thanks in advance for any help :slight_smile:

M

What about using Regex with backreferences? Playground example.

1 Like

I don't have time tonight to look at your code or give much of an explanation, but you could base a solution on something like this:

/// Returns the shortest period of repetition in s.
/// If s does not repeat, returns the number of characters in s.
fn check_repeats(s:&str)->usize {
    let mut delays:Vec<Option<core::str::Chars>> = vec![];
    for c in s.chars() {
        for it in delays.iter_mut() {
            if Some(c) != it.as_mut().and_then(Iterator::next) {
                *it = None;
            }
        }
        delays.push(Some(s.chars()));
    }
    delays.into_iter().position(|x| x.is_some()).unwrap()+1
}
1 Like

If I was gonna leetcode it, I would probably do something like this:

fn examine(s: &str) -> String {
    let mut possibles : Vec<Vec<char>> = Vec::new();
    let mut correct : Vec<bool> = Vec::new();

    for (index, char) in s.char_indices() {
        if let Some(v) = possibles.last() {
            let mut new = v.clone();
            new.push(char);
            possibles.push(new);
        } else {
            possibles.push(vec![char]);
        }
        correct.push(true);
        
        for (poss_index, poss_string) in possibles.iter().enumerate() {
            let rep_index = index % poss_string.len();
            let rep_char = poss_string.get(rep_index).expect("should not be OOB");
            if *rep_char != char {
                correct[poss_index] = false;
            }
        }
    }
    
    let s_len = s.chars().count();
    
    let all_repeats: Vec<_> = correct.iter().enumerate().filter(|(i, &v)| v && s_len % (i + 1) == 0).collect();
    
    let valid = all_repeats.first();
    
    if let Some(&(index, _)) = valid {
        let x: String = possibles[index].iter().collect();
        return x;
    }  else {
        return "OH NO".to_string();
    }
}

fn main() {
    let a = "AAAAA";
    let b = "ABABAB";
    let c = "ABCABCABC";
    
    println!("{}", examine(a));
    println!("{}", examine(b));
    println!("{}", examine(c));

}

Obviously this is real dirty, and there are loads of opportunities for optimization. But the general idea would be to go through the string character by character, building up both a list of possible "mers" and checking all "mers" to see if the current character violates their repetition. Then at the end, you just look for mers which were never violated and also evenly divide the string length, and take the smallest one.

This assumes that you are only interested in strings where the entire string is a repeat, and also has all caveats of "what is a char"

I cleaned this up a bit by using a BTreeMap<...> instead of Vec<Option<...>>. This version should be more efficient for long strings as well as easier to follow.

/// Returns the shortest period of repetition in s.
/// If s does not repeat, returns the number of characters in s.
fn check_repeats(s:&str)->usize {
    let mut delays: BTreeMap<_, std::str::Chars> = BTreeMap::new();
    for (i,c) in s.chars().enumerate() {
        delays.retain(|_,iter| iter.next() == Some(c));
        delays.insert(i+1, s.chars());
    }
    delays.into_keys().next().unwrap()
}
2 Likes

This solution is looking super neat, thanks for this!

Hadn’t considered a regex solution, thanks for this :slight_smile: