How to sort a vector of strings by substring within a struct

Hi,

I feel this is a bit too complicated for me so i would need some help here. What I have is a vec of structs such that one element is a string and i would like to sort my vec of structs so that the pattern within a string is utilized in sorting: Example:

use regex::Regex;

#[derive(Debug)]
struct A {
  x: String,
  y: usize
}

fn main() {
 
  let vec : Vec<A>  = vec![
    A{x:"a2:33-7".to_string(), y:5},
    A{x:"a1:23-75".to_string(), y:7},
    A{x:"a10:233-76".to_string(), y:8},
    A{x:"a1:3-77".to_string(), y:9},
    A{x:"b2:3-78".to_string(), y:0},
    A{x:"c2:263-79".to_string(), y:3},
    A{x:"a34:264-79".to_string(), y:4}
  ];
  
  println!("{:?}", vec);
  
  // i need to sort A's by x s.t. the 
  // structs are first sorted by substring 
  // up to the first ':' and then int that follows 
  // up to '-'
  
  // expected result:
  //vec![
  //  A{x:"a1:3-77".to_string(),  y:9},
  //  A{x:"a1:23-75".to_string(), y:7},
  //  A{x:"a2:33-7".to_string(), y:5},
  //  A{x:"a10:233-76".to_string(), y:8},
  //  A{x:"a34:264-79".to_string(), y:4},
  //  A{x:"b2:3-78".to_string(), y:0},
  //  A{x:"c2:263-79".to_string(), y:3},
  //]
 
}

I understand that i should utilize sort_by() but i am getting lost in parsing the String patterns when it comes to double sorting.

Thank you for all the advice you can provide!

Your description sounds as if a10:233-76 should come before a2:33-7, as the a10 and a2 part would, lexicographically, be in that order.

But your “expected result” has them in the opposite order. If there’s more structure to the part before the :, you need to explain the general case. Currently, I’m seeing always one letter followed by an integer, but who knows what you’re after.

You also didn’t express the desired behavior if everything up until the - is identical; of if the whole string is identical and only y differs. Or is that expected to never happen? Do you want a fallback or a panic?

4 Likes

ah sorry , my clumsy wording... there will always be a fixed letter prefix (3 letters from a-z) and int is actually int+ so not only one but one or more digits and i would like to do a numerical sort on those as well as the number following :.

as for the y scenario.. i have not preference. if all is teh same then i do not care about what happens to structs with different y's

Could you also give a representative set of examples for better illustration of what kind of data can come before the :? Does int+ mean multiple integers or just multiple digits?[1] Are leading zeroes possible?


  1. sounds like you’re saying multiple digits ↩︎

3 Likes

man i failed in explaining this properly. Sorry for that!

so as for a more representative case:

options for x:

aaa1:33-7
aaa2:454-8
aag21:4565-8544
abg121:75-8
...

so :
[a-z]{3}([0-9]+):([0-9]+)-([0-9]+)

as for the leading zeros , they are non existent.

EDIT:

As for the size limit they are all u32 and will never be larger :slight_smile:

Any size limits on the integers? Will they fit u32 or u64 or something?

Assuming it would fit a u32, here’s a quick implementation I could come up with

use core::cmp::Ordering;

#[derive(Debug, PartialEq)]
struct A {
    x: String,
    y: usize,
}

fn get_int(s: &mut &str) -> u32 {
    let u = s.find(|c| c < '0' || c > '9').unwrap_or(s.len());
    let (l, r) = s.split_at(u);
    *s = r;
    l.parse().unwrap()
}
fn consume(s: &mut &str, n: usize) {
    *s = &s[n..];
}

fn compare(ref mut l: &str, ref mut r: &str) -> Ordering {
    l[0..3].cmp(&r[0..3]).then_with(|| {
        consume(l, 3);
        consume(r, 3);
        get_int(l).cmp(&get_int(r)).then_with(|| {
            assert_eq!(l[0..1], *":");
            assert_eq!(r[0..1], *":");
            consume(l, 1);
            consume(r, 1);
            get_int(l).cmp(&get_int(r)).then_with(|| {
                assert_eq!(l[0..1], *"-");
                assert_eq!(r[0..1], *"-");
                consume(l, 1);
                consume(r, 1);
                get_int(l).cmp(&get_int(r)).then_with(|| {
                    assert_eq!(l[..], *"");
                    assert_eq!(r[..], *"");
                    Ordering::Equal
                })
            })
        })
    })
}

fn main() {
    let mut vec: Vec<A> = vec![
        A{x:"aaa2:33-7".to_string(), y:5},
        A{x:"aaa1:23-75".to_string(), y:7},
        A{x:"aaa10:233-76".to_string(), y:8},
        A{x:"aaa1:3-77".to_string(), y:9},
        A{x:"aab2:3-78".to_string(), y:0},
        A{x:"aac2:263-79".to_string(), y:3},
        A{x:"aaa34:264-79".to_string(), y:4}
    ];

    println!("{:#?}", vec);

    vec.sort_by(|l, r| compare(&l.x, &r.x));

    println!("{:#?}", vec);

    // expected result:
    let expected = vec![
        A{x:"aaa1:3-77".to_string(),  y:9},
        A{x:"aaa1:23-75".to_string(), y:7},
        A{x:"aaa2:33-7".to_string(), y:5},
        A{x:"aaa10:233-76".to_string(), y:8},
        A{x:"aaa34:264-79".to_string(), y:4},
        A{x:"aab2:3-78".to_string(), y:0},
        A{x:"aac2:263-79".to_string(), y:3},
    ];
    assert_eq!(vec, expected);
}

Rust Playground

One could probably also use a regex, maybe even more efficiently. Also I’m not verifying the first 3 characters are a-z. Also I now noticed that it doesn’t verify the validity of later parts at all when the earlier parts are already comparing non-equal… I suppose my verifying story is a bit inconsistent – my reasoning while writing the code was that it’d help me more quickly find any bugs I might have introduced – you could simply remove all the asserts, anyways (and then the whole final then_with(|| Ordering::Equal) can go, too.)

I’ll be offline for a while now, but others can certainly answer and follow-up questions (e.g. the ref mut l: &str argument patterns are a bit funky :innocent:), or suggest improvements / alternatives :slight_smile:

1 Like

I suggest parsing your Strings into a struct with fields that represent the data you care about within the String, so that you can take advantage of strong typing, derived Ord implementations, and so on.

Here's a sketch.

Notes:

  • I chose u16 for the numbers arbitrarily and assumed their built-in FromStr implementations were appropriate
  • Slicing a str usually takes more care than this code needs; s[1..] is relying on the ('a'..='z').contains check to only allow letters whos length in UTF8 is 1
  • Errors can be improved with enough elbow grease
    • I didn't even implement Display or Error for XParseErr for example
  • Unwrapping in the sort is not a good approach and is just for illustration
    • You could use the parsed version everywhere after parsing
    • You could pre-check that everything is well formed before sorting
      • You could have a newtype holding a String you know is well formed and have a From<WellFormedA> for ParsedA implementation or such
    • Etc
  • You could implement Display for ParsedX and From<ParsedA> for A, etc
3 Likes

Though i marked the quinedot as a solution I find yours as helpful as his/hers but i do not know hoe to mark both thank you both for your help