Can a regex match only at a specific index?


#1

Hello, newbie here.

I’m making a library for parsing and syntax-highlighting a string. Assume the string can be very long (10,000 lines or more). I want to know whether a regex matches the string at a specific position.

Simple example:

let re = Regex::new(r"\btakimata\b").unwrap();
let string = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.";

if let Some(m) = re.find_at(string, 10) {
    if m.start() == 10 {
        // do something
    }
}

Ideally, this would be an O(1) operation, because in the string the word “takimata” is not at index 10. However, Regex::find_at searches the entire string beginning at index 10, which is unnecessary.

To prevent this, I’m adding ^ to every regex, which means I have to slice the string accordingly:

let re = Regex::new(r"^(\btakimata\b)").unwrap();
let string = &string[10 .. ];

if let Some(m) = re.find(string) {
    // do something
}

This is way more efficient, but now I have no way of finding a word boundary \b at the beginning of the regex. For example:

let re = Regex::new(r"^(\beat\b)").unwrap();
let string = "You are great!";
let string = &string[10 .. ];

if let Some(m) = re.find(string) {
    // do something
}

Here, the regex, which searches for the word “eat”, finds a match, because it only considers the sub-slice “eat!” and assumes a word boundary at the beginning (this is not desired).

How can I make it work? Or is it impossible at the moment? If so, can you recommend me a different approach?


PS: The library is almost finished, and it’s pretty fast, but it’s rather awkward to use because of what I described.


#2

Hmm. The find_at or is_match_at methods look tempting, but they still search to the end of the string, and you can no longer use ^ to prevent this because it won’t match when starting in the middle of the string.

Have you tried a regex like ^.{10}\beat\b?


#3

I guess that would work, but then I can’t re-use regular expressions for different indices, and compiling a regex is expensive.


#4

I would probably use ^eat, and upon a match, confirm that it exists at a word boundary (either manually or by using another regex on a more restricted substring based on the aforementioned match).


#5

I’m trying to do that now, but it’s harder than I thought. Pseudo code:

if 0 < i < len:
    word_boundary = string[i - 1].is_alphanumeric() != string[i].is_alphanumeric()
elif 0 == i < len:
    word_boundary = string[i].is_alphanumeric()
elif 0 < i == len:
    word_boundary = string[i - 1].is_alphanumeric()
else (i.e. 0 == i == len):
    word_boundary = false

Is this correct? Or is there a simpler way to check whether there’s a word boundary?