How to perform a regex match on lines with the intention to extract fields and account for multiline logs

fritshoogland · February 27, 2022, 5:44pm

I have the following working function:

fn parse_loglines( logs_data: String ) -> Vec<LogLine> {
    let mut loglines: Vec<LogLine> = Vec::new();
    // This regex does not capture multi-line errors, such as:
    // long_operation_tracker:
    //W0214 11:22:36.980569  7803 long_operation_tracker.cc:114] UpdateReplica running for 1.000s in thread 7814:
    //@     0x7fa344eb611f  (unknown)
    //@     0x7fa345833b39  __lll_lock_wait
    //@     0x7fa34582e6e2  __GI___pthread_mutex_lock
    //@     0x7fa34e609858  rocksdb::port::Mutex::Lock()
    //@     0x7fa34e6897fb  rocksdb::InstrumentedMutex::Lock()
    //@     0x7fa34e564a4f  rocksdb::DBImpl::WriteImpl()
    //@     0x7fa34e5666a3  rocksdb::DBImpl::Write()
    //@     0x7fa353ff6e40  yb::tablet::Tablet::WriteToRocksDB()
    //@     0x7fa354007114  yb::tablet::Tablet::ApplyKeyValueRowOperations()
    //@     0x7fa354007660  yb::tablet::Tablet::ApplyOperation()
    //@     0x7fa35400791a  yb::tablet::Tablet::ApplyRowOperations()
    //@     0x7fa3540929d4  yb::tablet::WriteOperation::DoReplicated()
    //@     0x7fa354083efb  yb::tablet::Operation::Replicated()
    //@     0x7fa354089720  yb::tablet::OperationDriver::ApplyTask()
    //@     0x7fa354089e8e  yb::tablet::OperationDriver::ReplicationFinished()
    //@     0x7fa353cb3ae7  yb::consensus::ReplicaState::NotifyReplicationFinishedUnlocked()
    // fs_manager:
    //I0217 10:12:35.491056 26960 fs_manager.cc:278] Opened local filesystem: /mnt/d0
    //uuid: "05b8d17620eb4cd79eddaddb2fbcbb42"
    //format_stamp: "Formatted at 2022-02-13 16:26:17 on yb-1.local"
    // tablet_server_main:
    //I0217 10:12:35.521059 26960 tablet_server_main.cc:225] ulimit cur(max)...
    //ulimit: core file size 0(unlimited) blks
    //ulimit: data seg size unlimited(unlimited) kb
    //ulimit: open files 1048576(1048576)
    //ulimit: file size unlimited(unlimited) blks
    //ulimit: pending signals 119934(119934)
    //ulimit: file locks unlimited(unlimited)
    //ulimit: max locked memory 64(64) kb
    //ulimit: max memory size unlimited(unlimited) kb
    //ulimit: stack size 8192(unlimited) kb
    //ulimit: cpu time unlimited(unlimited) secs
    //ulimit: max user processes 12000(12000) 

    // it does match the most common logline format:
    // I0217 10:19:56.834905 31987 docdb_rocksdb_util.cc:416] FLAGS_rocksdb_base_background_compactions was not set, automatically configuring 1 base background compactions.
    let regular_log_line = Regex::new( r"([IWFE])(\d{2})(\d{2}) (\d{2}):(\d{2}):(\d{2})\.(\d{6}) (\d{1,6}) ([a-z_A-Z.:0-9]*)] (.*)\n" ).unwrap();

    // Just take the year, it's not in the loglines, however, when the year switches this will lead to error results
    let year= Local::now().format("%Y").to_string();
    let timezone = Local::now().format("%z").to_string();
    for captures in regular_log_line.captures_iter(&logs_data) {
        let mut timestamp_string: String = year.to_owned();
        timestamp_string.push_str(":");
        timestamp_string.push_str(&captures.get(2).unwrap().as_str());        // month
        timestamp_string.push_str(":");
        timestamp_string.push_str(&captures.get(3).unwrap().as_str());        // day
        timestamp_string.push_str(":");
        timestamp_string.push_str(&captures.get(4).unwrap().as_str());        // hour
        timestamp_string.push_str(":");
        timestamp_string.push_str(&captures.get(5).unwrap().as_str());        // minute
        timestamp_string.push_str(":");
        timestamp_string.push_str(&captures.get(6).unwrap().as_str());        // seconds
        timestamp_string.push_str(":");
        timestamp_string.push_str(&captures.get(7).unwrap().as_str());        // seconds
        timestamp_string.push_str(":");
        timestamp_string.push_str(&timezone.to_owned());
        loglines.push(LogLine {
            severity: captures.get(1).unwrap().as_str().to_string(),
            timestamp: DateTime::parse_from_str(&timestamp_string, "%Y:%m:%d:%H:%M:%S:%6f:%z").unwrap().with_timezone(&Local),
            tid: captures.get(8).unwrap().as_str().to_string(),
            sourcefile_nr: captures.get(9).unwrap().as_str().to_string(),
            message: captures.get(10).unwrap().as_str().to_string()
        });
    }
    loglines
}

Can the building of the timestamp be done in a more elegant way?
I essentially want to build a row, and it there are trailing rows add them to the message field of the LogLines struct. The common documentation shows multi match regex sets and captures, but not both. Can anyone show me how that's done?
Is there anything else that I can do in a better way? I am still learning.

Thank you

quinedot · February 28, 2022, 1:03am

Your current regex doesn't match the first commented example due to this double space:

W0214 11:22:36.980569  7803 
                     ^^

So maybe you want to be more lenient about whitespace.

You're capturing it in pieces and then rebuilding the pieces one by one; you could just capture the whole thing, prepend the year, and parse that. You could also use format! instead of a bunch of push_str. And, you can parse as the local TimeZone instead of appending the timezone, parsing it, and then setting it.

Here's what I got with those adjustments.

A regex to look ahead for non-log-line lines seemed too gnarly for me, and I didn't see an iterator over matches and non-matches, but you can just do so manually. The basic idea is to hold on to the current LogLine and append any non-matching when you find the next match. There's an extra step to find the first LogLine.

You should throw a bunch of tests at it, and make it more robust as needed. For example, the whitespace flexibility mentioned earlier, having some sort of fallback other than panicking for invalid dates, etc.

fritshoogland · March 2, 2022, 8:08pm

Thank you so much for helping me. This does exactly what I struggled with.

Would you mind talking me through the manual line addition? I get what is happening, but I try to master it so I can use it myself, and currently it still looks too alien to me to comprehend. Sorry about that.

I added tests for all the multiline exceptions, like you said, and I changed the fetching of the date in the way you showed, works like a charm!

quinedot · March 3, 2022, 12:55am

Sure. I factored this part out, because we need to create a LogLine from the regex match of a "regular log line" in more than one place now:

    let to_logline = |captures: Captures<'_>| { /* ... */ };

Next, we look for the first regular match out of all the logs given to us:

    let mut logline;
    let mut remaining;
    // `captures` finds the first match in all of the data
    match regular_log_line.captures(&logs_data) {
        // If there are no regular lines at all, just return empty `Vec`
        None => return loglines,
        Some(captures) => {
            // We found the first regular log line
            // `offset` is the `String` index just after the line ends
            let offset = captures.get(0).map(|m| m.end()).unwrap_or(0);
            // So `remaining` is the rest of the `log_data` after the match
            remaining = &logs_data[offset..];
            // Here we create a `LogLine` from the regular log line, like
            // you were already doing before
            logline = to_logline(captures);
        }
    };

While inside the Some(captures) branch, the points in the log data are like:

offset 0             | Junk\n
                     | Junk\n
line of `captures`   | W0214 ... \n
offset of `offset`   | @     0x7fa344eb611f ... \n

We store a reference to all the data from offset forward in remaining.

So at this point, we have the first regular log line converted into a LogLine (stored in logline), and we have all the log data that comes afterwards stored in remaining. That's an invariant that will hold just before the loop condition every time -- logline is the parsed version of the most recently seen regular log line, and remaining is all unparsed data that follows.

Next we loop, looking for the next regular log line:

    while let Some(captures) = regular_log_line.captures(remaining) {
        let all = captures.get(0).unwrap();
        let from = all.start();
        let offset = all.end();

If we match, the situation with regards to the log data in remaining looks like so:

start of `remaining` | @     0x7fa344eb611f  (unknown)\n
                     | @     0x7fa345833b39  __lll_lock_wait\n
offset of `from`     | I0217 10:19:56.834905 31987 ... \n
offset of `offset`   | I0217 10:12:35.491056 26960 ... \n

And the line of captures is the line between from and offset. Everything in remaining that is before from are the lines that came after the last "regular log line", but before this one. In this case, that's two lines. So this is the part where we append the extra lines to the previous LogLine:

        logline.message += &remaining[..from];

We're not going to change that last LogLine any more after this, so we store it in the Vec:

        loglines.push(logline);

Then we can parse the current regular log line:

        logline = to_logline(captures);

We'll hold on to that in case we need to append more to it.

The last thing we do in the loop is to re-assign remaining to be everything that comes after the current LogLine we just parsed:

        remaining = &remaining[offset..]

This restores the invariants. So then we loop and repeat this procedure until we stop matching log lines.

Append any intermediate lines to the "currently active" LogLine
Store that LogLine in the Vec
Parse the latest match and make it the new "currently active" LogLine
Re-assign remaining so we only look at the unparsed parts of the log data next time

Once we stop matching -- just after the while let loop -- remaining holds all of the log data that came after the last LogLine we created:

start of `remaining`: | uuid: \"05b8d17620eb4cd79eddaddb2fbcbb42\"\n
                      | format_stamp: \"Formatted at 2022-02-13 ...\n
                      | tablet_server_main:\n
(End Of Data)         |

So we just append it all to the most recent LogLine (the last one in the data):

    logline.message += remaining;

and finally, add that LogLine to the Vec and return.

    loglines.push(logline);
    loglines

system · June 1, 2022, 12:56am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How to do Regexp matching code review	5	471	December 1, 2020
How to match multiline regex?	4	184	February 29, 2024
Scanning a logfile using scan_rules::scanner	5	433	November 24, 2020
[SOLVED] 's' does not live long enough help	4	194	December 3, 2023
Regex with a newline character help	4	2594	April 10, 2022

How to perform a regex match on lines with the intention to extract fields and account for multiline logs

Related Topics