How to modify content inside a file following a match

I have a text file that looks like this:

aaa
bbb
ccc
ddd
EEE
f
// stuff here
f
f
// stuff here
f

I need to:

  1. Locate the line that contains "EEE"
  2. Locate the 4x "f"s following the EEE (there are many more "f"s in the doc - I specifically need the 4 that follow.
  3. Replace content between the 1st and 2nd f, as well as 3rd and 4th. Note that old and new content may or may not be the same size (number of lines).

What's the best way to do this? Just pointing me to the right libraries / tutorials would be enough. I just don't really know where to start:)

First, you should open the file and read it into a String (assuming the file is entirely ASCII or UTF-8 text). The convenience function read_to_string() is probably easiest for this, but you can use the functions under OpenOptions and File to open and read the file manually instead for future reference. Then you should use the methods on String to edit the returned String. The methods match_indicies() (which produces an iterator over all the indices where there is a match against a particular pattern) and replace_range() may be particularly useful for finding and replacing the relevant portions of the String. If the replacement text is the same in each case replace() may be a more efficient approach. For this sort of use-case you may also consider a regular expression library like regex. Then you just need to write the text back to the file with something like write().

1 Like

Modifying files in place isn't easy. Text editors make it look easy, but files are like arrays. Inserting or deleting content in the middle means you have to shift everything afterwards around.

Tools that modify files usually take one of two approaches:

  1. Read the entire file into memory, modify an in-memory buffer, then write the whole thing back out when saving.
  2. Read the input file, write a modified version to a temporary file, then if the entire write is successful move the temp file over top of the original.

The second approach has a couple of nice properties: It can be done in a streaming manner without reading the whole file in at once, avoiding O(file size) memory usage. It avoids corruption if the write fails partway through; the original file is still intact. Moving a file is an atomic operation, so other programs won't see the update until it's finished.

This is all language agnostic. For some Rust-specific starting points see:

6 Likes

Interesting - so would I do something like this:

  1. search for the line containing EEE
  2. once found, save the remaining contents of the file to a temp variable
  3. search through the contents, this time for f using match_indices()
  4. replace_range() between 1st and 2nd, and 3rd and 4th

Does that sound reasonable?

The suggestions I gave were mainly oriented toward the "Read the entire file into memory, modify an in-memory buffer" approach @jkugelman suggested, then you could search for "EEE" in the buffer and do the editing there. If the file you're working with is particularly large, you could just read the part after "EEE" into the buffer, do the replacing, then write it back to the file at the appropriate point, but that makes the use of the file access APIs in the standard library a little more complicated, since you would be manually reading a little at a time, then you would need to open the file for editing, and move to the right point before writing back the buffer (and reduce the file length if the replacement is smaller than the orignal). Reading the entire file into memory is somewhat easier to work with.

1 Like

I would go with approach #2 from my last post, reading and writing line-by-line in a streaming manner. Something like:

  1. Open the input file in and a temporary output file out.
  2. Loop #1: Read a line at a time until you hit EEE. Write each line to out.
  3. Loop #2: Read a line at a time until you hit the 1st f. Write each line to out.
  4. Write the first set of replacement content to out.
  5. Loop #3: Read a line at a time until you hit the 2nd f. Discard these lines.
  6. Loop #4: Read a line at a time until you hit the 3rd f. Write each line to out.
  7. Write the second set of replacement content to out.
  8. Loop #5: Read a line at a time until you hit the 4th f. Discard these lines.
  9. Loop #6: Read a line at a time until you hit EOF. Write each line to out.
  10. Rename out to in.

Why so many loops? It's effectively a state machine. The loops have slightly different actions based on where it's at in the input file. When it gets to the end, step #10 is where the changes are actually "committed". If it doesn't make it to step #10 then the input file is left unchanged.

I admit, this is rather longwinded compared to doing everything in memory. Still, it's a good exercise in doing things efficiently. Reading everything into a memory buffer and doing a couple of searches and replaces is easier to code up, but it'll use a lot more memory and probably do a bunch of passes over the file without you even realizing. Looping over the file by hand guarantees that everything's done in one pass and with minimal memory usage.

And hey, extract the repetitive logic into a helper function or two and the code won't even be that hard on the eyes.

3 Likes

Thanks @jameseb7 and @jkugelman - super helpful!