Dynamic pool of work items that must be processed (but not consumed) by a dynamic pool of workers?


#1

The “pool” (maybe wrong term) would have:

  1. A dynamic list of work items.
  2. A dynamic list of workers to processed those items.
  3. Work items are not consumed by the workers.
  4. The condition is that the pool must schedule each worker to process each item and all changes (insert/update/delete) to the work item list as long as the worker is active in the pool.

The concrete case that I’m thinking about is searching a bunch of text files with a bunch of searches in parallel. For example in that case the work items would be text files and the workers would be the different searches.

Searches and files might be added/removed over time and the system would still be efficient in both limiting the amount of work scheduled at once, and in not needing to redo search work for existing searches when just a few files are changed.


I’m interested in any thoughts or suggestions. I’m looking for existing examples (any language) that might solve a similar problem. Or suggestions on how to (or not too) build my own solution.

I’ve been working on a solution using @BurntSushi 's ignore crate for the initial processing. And then watching the filesystem for changes and incrementally updating the result based on those changes… but I think I need a better high level design.

I know little about such things, but it seems like the database design of a log events and then materialized views might be pretty clean solution. In this case the events modeling inserted/updated/deleted work items and the materialized views representing the workers. But getting that overall design into rust, and in parallel is still non trivial for me.


#2

I think your definition of “work item” is off, or at least a little odd, and that’s shaping your discussion of the problem to seem more unusual than it might otherwise be.

I think your files and patterns are ‘data items’. I think the work item is a task ‘search this file’, and does get consumed. I’m not sure whether each worker is a different search, or more generic, which makes the work items ‘search this file for this pattern’. Either would work but I think the latter is maybe the simplest start.

I’d keep a set of files, and a set of patterns, and something like a crossbeam channel for work distribution. On each file addition, emit a set of tasks to search that file for each pattern. On each pattern addition, emit tasks to search each file. Then you can have a “sensible” number of threads running the work, without worrying about too many files or too many patterns leading to thrashing the machine.

That keeps things simple and isolated, but has a load problem: you’ll want to avoid duplicated copies of file data and repeated IO for each file. Part of that is simply ordering the work sensibly. Part of that may seem like it could involve some shared, reference counted, file buffer. And part of it should involve looking at how to search for multiple patterns in the one pass, like the way regex::RegexSet can, and changing the work item to include the current pattern set for each file, so only one worker reads each file.


#3

Thanks, I think you are right about the terminology.

Also thanks for the regex::RegexSet mention, I didn’t know about it, and it will be useful for combining the searches into a single pass.

That keeps things simple and isolated, but has a load problem: you’ll want to avoid duplicated copies of file data and repeated IO for each file.

Yes, and it’s even more needed because some of the searching that I want to perform requires that the file is parsed into a tree first. So to prepare the file for searching will require both a load and sometimes a parse of the file before the search can execute.

Part of that is simply ordering the work sensibly

I’m not quite sure what you mean by this, but what I am trying to do is group search “tasks” together so that all tasks targeting a particular file will run together (one after another in same thread) so that they can share a common context and will only load/parse file once.


#4

Yes, exactly this for the very simple initial example model I was using. Basically, emit all the patterns for one file, rather than all the files for one pattern, into the work queue in order. As soon as we move on to the multi-pattern model, it’s irrelevant again (or, rather, subsumed by an even better solution).

(one after another in same thread) so that they can share a common context and will only load/parse file once

Even this is notably less efficient than searching in a single pass, for cache and paging locality (now that you know it can be done)


#5

Yes, for search that can be mapped down to regexes then I’ll definitely use RegexSet. But I also expect to have other searches that require parsing the file… those are the ones that I really want to group. Anyway I think you’ve gotten me a bit more on the right track. Thanks.