Looking for a fast lookup algorithm

Hi,

This might not be a right place to ask this but I'll try in hope someone can provide at least a reference to a solution.

My impasse is the following situation. On one hand i have a list of overlapping intervals like this:

[1..2]
[1..3]
[2..7]
[1..5]
[4..5]
[3..10]
...
# each interval has a unique pair of start and stop coordinates

on the other i have a list of points:

4
6
8
...

Task is for each point in my list, to find all intervals that overlap is. What i am currently doing is extremely stupid. I mapped the points and then i roll-out each interval (loop from x..y in each interval) and check if there is any point equal to a position in my hashmap. if it is, i simply associate a key ( point in my hashmap to a value (which is a vector of all intervals that overlap that point). My aim is to be able to answer multiply queries once the index (map) is created. Problem is that for a large number of intervals this index construction becomes extremely slow (especially when the average overlap between interval increases ). Could anyone provide hint on how to reduce algorithm construction runtime for this problem?

thank you

If you want something similar to your HashMap approach but way more efficient, use a BTreeMap and its range_mut API.

use std::ops::Range;
use std::collections::BTreeMap;

static RANGES: &[std::ops::Range<i32>] = &[
    1..2,
    1..3,
    2..7,
    1..5,
    4..5,
    3..10,
];

static NUMBERS: &[i32] = &[4, 6, 8];

fn main() {
    let mut map: BTreeMap<i32, Vec<Range<i32>>> = NUMBERS.iter().map(|n| (*n, vec![])).collect();
    for r in RANGES {
        map.range_mut(r.clone()).for_each(|e| e.1.push(r.clone()));
    }
    dbg!(&map);
}
[src/main.rs:20] &map = {
    4: [
        2..7,
        1..5,
        4..5,
        3..10,
    ],
    6: [
        2..7,
        3..10,
    ],
    8: [
        3..10,
    ],
}

As a more simpler data structure, you could also use a sorted vector, and just binary-search for the range start point.

use std::ops::Range;

static RANGES: &[std::ops::Range<i32>] = &[
    1..2,
    1..3,
    2..7,
    1..5,
    4..5,
    3..10,
];

static NUMBERS: &[i32] = &[4, 6, 8];

fn main() {
    let mut map: Vec<(i32, Vec<Range<i32>>)> = NUMBERS.into_iter().map(|n| (*n, vec![])).collect();
    
    // pessimistically assume NUMBERS migth not be sorted yet
    map.sort_unstable_by_key(|e| e.0);

    // main loop
    for r in RANGES {
        let start_ix = map.partition_point(|e| e.0 < r.start);
        map[start_ix..].iter_mut().take_while(|e| e.0 < r.end).for_each(|e| e.1.push(r.clone()));
    }
    dbg!(&map);
}
[src/main.rs:25] &map = [
    (
        4,
        [
            2..7,
            1..5,
            4..5,
            3..10,
        ],
    ),
    (
        6,
        [
            2..7,
            3..10,
        ],
    ),
    (
        8,
        [
            3..10,
        ],
    ),
]

one could also use two vectors instead of one vector of pairs.

5 Likes

Just for reference, the segment tree is a data structure that allows efficient mapping from a point to a set of intervals containing the given point. There's the segment-tree crate, but I never used it.

2 Likes

I’m only noticing this now, you state both

Task is for each point in my list, to find all intervals that overlap is.

but also

My aim is to be able to answer multiply queries once the index (map) is created.

The latter statements sounds like you are not interested in finding all intervals for all points? Unless I’m misunderstanding this second statement. So then, what are the queries you want to answer?

3 Likes

Alternatively the points can be stored in a balanced tree and queried with an interval (returning matching points).

Queries for nested intervals (e.g. matching endpoint) could be grouped to some extent.

2 Likes