I am able to find the text between the tags for a single line through RegEx.
use regex::Regex;
fn main() {
let re = Regex::new(r"<head>(.|\n)*?</head>").unwrap();
let caps = re.captures("hvfiuhgiuer gh <head> hfiuhrfu ferijfiurfh hu </head> fr gerg" ).unwrap();
println!("{:?}", caps);
}
How can this be achieved for the whole HTML file using FileIO?
You can use std::fs::read_to_string() to read a file and match a regex on it. I don't think regex support matching on I/O streams directly.
And an obligatory note that regex can't handle HTML in general. HTML allows syntax like:
<!doctype html>
<title>hi</title>
<body>
<!-- I'm not <head> -->
<script> console.log("</head> doesn't end here")</script>
where the <title> element is in <head>, but both start and end tags for <head> are optional, and automatically implied from the context (<title> opens it, <body> closes it).
For serious HTML processing there's html5ever that will see even the implied <head>. There's also lol-html for very fast streaming rewrites of HTML without loading it fully.
First off, don't parse HTML with regex. Unless you're just scraping a specific set of HTML pages that happens to have a fixed easily regexable pattern, regexes, especially the regex crate's relatively limited ones won't be able to get the job done.
use regex::RegexBuilder;
fn main() {
let re = RegexBuilder::new(r"<head>(.*?)</head>").dot_matches_new_line(true).build().unwrap();
let string = std::fs::read_to_string("a.html").unwrap();
let caps = re.captures(&string).unwrap();
println!("{:?}", &caps[1]);
}
Given all the previously mentioned caveats, you could improve the regex a little by using RegexBuilder and setting dot_matches_new_line(true).