i have this piece of code which is shared among multiple tasks
async fn _push_to_set(set: Arc<Mutex<HashSet<String>>>, file: String) -> Result<()> {
trace!("Loading file: {}", file);
let file = File::open(file).await?;
let mut reader = BufReader::new(file).lines();
for line in reader.next_line().await? {
let mut set = set.lock().await;
set.insert(line.trim().to_string());
}
Ok(())
}
the files are line +1 Million line in every file, i feel that acquiring the lock for every insert comes with a great overhead, is there a better way to insert into a shared set?
File IO is not really possible without blocking IO in any case, so async does not gain you much. For this reason, reading the files synchronously in a spawn_blocking call is a fine way to do it. As for the ram disk, it will speed up either approach simply because ram disks allow faster access to the file.
@cuviper As I understood the purpose, the set is created once and then used, in which case I would not recommend a concurrent set. To be fair, this assessment was aided by having also answered this thread.
for now i am using a python script to process the files, which is really really slow (~30 minutes) i am trying to replicate the code using rust so i can reduce the time and process the files efficiently.
is the procedure on top better, or should i go with extend?
basically all this server does is send all the actions that occur during a period of time to another server, every file contain a list of actions that happen during a period of time (action per line) my goal is to get a unique list of actions that happen and send them to another server.
the files are sent using scp (i don't have control on it) to my server and are placed in a directory (which i can change)
what i have now is a cron that run every hour, load the pyhton script that processes the files that are sent to me. the script is loaded and iterate throught the files inside the directory and process them one by one create a set of unique actions save them to a file and send them to another server.
this process takes a lot of time to finish, and sometimes when files are too big, it takes more than 1 hour, which causes two scripts to run.. (and this causes issues) the files sizes vary from 10M to 1G and the number of files is not a constant.
in my current executable (rust) i want to remove the cron, and let rust wait for files to come and process them
Sure that makes sense. It sounds like the number of files is reasonably small, in which case I'd probably go for just spawning a thread, since the bulk of your work appears to be either file IO or CPU bound. If you need to process at most one file at a time, you can even avoid the whole thread business.
In parallel, you have some form of daemon / background process that listens on that Unix socket to collect the events, and then outputs them wherever you need to output.
But the key idea is that one of the most efficient ways to have a form of mpsc with multiple processes in Unix is to use a UNIX_DATAGRAM.
server sending the files -> inotify -> spawn a process that will send bytes to a UNIX_DGRAM -> app listening on UNIX_DGRAM (process and send to next server)
sorry @Yandros i may miss-understand your idea, but how can adding a DGRAM socket between inotify and the application improve performance?
What i endup doing is
a loop that has a delay_for and runs in a tmux window
when new files comes, iter through them and spawn a task to handle each files
when tasks are finished collect the results and send them to the next recipient
if DGRAM is better, can you please explain to me why?
Because, if I have understood correctly, you need to do some "merge" over the data spread across the multiple files. If it isn't the case / if you can handle each file separately and send the right data accordingly, then the inotify layer will indeed suffice