Checking simple repeats in strings

euphrasiologist · August 11, 2022, 9:03pm

Hello all,

I feel like the problem I am about to present is pretty simple, but maybe I am a bit simple.

I want to check if strings contain simple repeats. For example, in a vector of Strings or &strs, I'd like to filter out those which either:

contain all the same chars (e.g. AAAAA)
contain repeating dimers (e.g. ACACAC)
contain repeating trimers (e.g. ACTACTACT)

I came up with something, and I'm pretty sure it works. But it's ugly, and verbose, and loops over three times. I don't care so much about performance, but it would be nice to get some other ideas of how people would tackle this problem. Bonus if it can be extended to any substring length.

My attempt:

fn filter_count_vec(v: &mut Vec<String>) {
    // monomers
    v.retain(|s| {
        let init = s.chars().next().unwrap();
        
        !s.chars().all(|other| other == init)
    });

    // dimers
    v.retain(|s| {
        let all_alt_0 = s.chars().step_by(2).collect::<String>();
        let all_alt_1 = s.chars().skip(1).step_by(2).collect::<String>();

        let all_alt_0_init = all_alt_0.chars().next().unwrap();
        let all_alt_0_same = !all_alt_0.chars().all(|other| other == all_alt_0_init);

        let all_alt_1_init =  all_alt_1.chars().next().unwrap();
        let all_alt_1_same = !all_alt_1.chars().all(|other| other == all_alt_1_init);
        
        all_alt_0_same && all_alt_1_same
    });

    // trimers.
    v.retain(|s| {
        let all_alt_0 = s.chars().step_by(2).collect::<String>();
        let all_alt_1 = s.chars().skip(1).step_by(2).collect::<String>();
        let all_alt_2 = s.chars().skip(1).step_by(3).collect::<String>();

        let all_alt_0_init = all_alt_0.chars().next().unwrap();
        let all_alt_0_same = !all_alt_0.chars().all(|other| other == all_alt_0_init);

        let all_alt_1_init =  all_alt_1.chars().next().unwrap();
        let all_alt_1_same = !all_alt_1.chars().all(|other| other == all_alt_1_init);

        let all_alt_2_init =  all_alt_2.chars().next().unwrap();
        let all_alt_2_same = !all_alt_2.chars().all(|other| other == all_alt_2_init);
        
        all_alt_0_same && all_alt_1_same && all_alt_2_same
    });
}


fn main() {
    let mut vec = vec![
        "asdfgh".to_string(), // keep 
        "hahaha".to_string(), // remove
        "haahaahaahaa".to_string(), // remove
        "aaaaaaaaaa".to_string() // remove
        ];

    filter_count_vec(&mut vec);

    println!("{:?}", vec);
}

I couldn't easily see if this problem had a nice solution, so many thanks in advance for any help

M

moy2010 · August 11, 2022, 9:33pm

What about using Regex with backreferences? Playground example.

2e71828 · August 11, 2022, 9:33pm

I don't have time tonight to look at your code or give much of an explanation, but you could base a solution on something like this:

/// Returns the shortest period of repetition in s.
/// If s does not repeat, returns the number of characters in s.
fn check_repeats(s:&str)->usize {
    let mut delays:Vec<Option<core::str::Chars>> = vec![];
    for c in s.chars() {
        for it in delays.iter_mut() {
            if Some(c) != it.as_mut().and_then(Iterator::next) {
                *it = None;
            }
        }
        delays.push(Some(s.chars()));
    }
    delays.into_iter().position(|x| x.is_some()).unwrap()+1
}

KSwanson · August 11, 2022, 10:56pm

If I was gonna leetcode it, I would probably do something like this:

fn examine(s: &str) -> String {
    let mut possibles : Vec<Vec<char>> = Vec::new();
    let mut correct : Vec<bool> = Vec::new();

    for (index, char) in s.char_indices() {
        if let Some(v) = possibles.last() {
            let mut new = v.clone();
            new.push(char);
            possibles.push(new);
        } else {
            possibles.push(vec![char]);
        }
        correct.push(true);
        
        for (poss_index, poss_string) in possibles.iter().enumerate() {
            let rep_index = index % poss_string.len();
            let rep_char = poss_string.get(rep_index).expect("should not be OOB");
            if *rep_char != char {
                correct[poss_index] = false;
            }
        }
    }
    
    let s_len = s.chars().count();
    
    let all_repeats: Vec<_> = correct.iter().enumerate().filter(|(i, &v)| v && s_len % (i + 1) == 0).collect();
    
    let valid = all_repeats.first();
    
    if let Some(&(index, _)) = valid {
        let x: String = possibles[index].iter().collect();
        return x;
    }  else {
        return "OH NO".to_string();
    }
}

fn main() {
    let a = "AAAAA";
    let b = "ABABAB";
    let c = "ABCABCABC";
    
    println!("{}", examine(a));
    println!("{}", examine(b));
    println!("{}", examine(c));

}

Obviously this is real dirty, and there are loads of opportunities for optimization. But the general idea would be to go through the string character by character, building up both a list of possible "mers" and checking all "mers" to see if the current character violates their repetition. Then at the end, you just look for mers which were never violated and also evenly divide the string length, and take the smallest one.

This assumes that you are only interested in strings where the entire string is a repeat, and also has all caveats of "what is a char"

2e71828 · August 12, 2022, 12:35pm

I cleaned this up a bit by using a BTreeMap<...> instead of Vec<Option<...>>. This version should be more efficient for long strings as well as easier to follow.

/// Returns the shortest period of repetition in s.
/// If s does not repeat, returns the number of characters in s.
fn check_repeats(s:&str)->usize {
    let mut delays: BTreeMap<_, std::str::Chars> = BTreeMap::new();
    for (i,c) in s.chars().enumerate() {
        delays.retain(|_,iter| iter.next() == Some(c));
        delays.insert(i+1, s.chars());
    }
    delays.into_keys().next().unwrap()
}

euphrasiologist · August 15, 2022, 4:28pm

This solution is looking super neat, thanks for this!

euphrasiologist · August 15, 2022, 4:33pm

Hadn’t considered a regex solution, thanks for this

system · November 13, 2022, 4:33pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Checking vector help	5	357	September 24, 2022
Fill string with repeated character	10	50061	July 3, 2022
How to check if String exists in a vec? help	4	5870	November 3, 2021
Help in understanding the documentation help	10	397	July 20, 2020
Include array of strings in the possible patterns help	8	1805	March 31, 2021

Checking simple repeats in strings

Related Topics