Use Regex to match

I am able to find the text between the tags for a single line through RegEx.

use regex::Regex;

fn main() {
    let re = Regex::new(r"<head>(.|\n)*?</head>").unwrap();
    let caps = re.captures("hvfiuhgiuer gh <head> hfiuhrfu ferijfiurfh  hu </head> fr gerg" ).unwrap();
    println!("{:?}", caps);
}

How can this be achieved for the whole HTML file using FileIO?

You can use std::fs::read_to_string() to read a file and match a regex on it. I don't think regex support matching on I/O streams directly.

And an obligatory note that regex can't handle HTML in general. HTML allows syntax like:

<!doctype html>
<title>hi</title>
<body>
<!-- I'm not <head> -->
<script> console.log("</head> doesn't end here")</script> 

where the <title> element is in <head>, but both start and end tags for <head> are optional, and automatically implied from the context (<title> opens it, <body> closes it).

For serious HTML processing there's html5ever that will see even the implied <head>. There's also lol-html for very fast streaming rewrites of HTML without loading it fully.

4 Likes

First off, don't parse HTML with regex. Unless you're just scraping a specific set of HTML pages that happens to have a fixed easily regexable pattern, regexes, especially the regex crate's relatively limited ones won't be able to get the job done.

Secondly, the regex crate does not support streaming data. You need to either read the whole file into memory or use memory mapped IO to get a continuous buffer you can search.

2 Likes
use regex::RegexBuilder;

fn main() {

    let re = RegexBuilder::new(r"<head>(.*?)</head>").dot_matches_new_line(true).build().unwrap();
    let string = std::fs::read_to_string("a.html").unwrap();
    let caps = re.captures(&string).unwrap();
    println!("{:?}", &caps[1]);
}

Given all the previously mentioned caveats, you could improve the regex a little by using RegexBuilder and setting dot_matches_new_line(true).

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.