Drafty code for an `mdbook` preprocessor

I guess this code isn't there for review, but it's not an explicit help question either. Any suggestions or comments would be cool.

Background

The mdbook has a template for creating preprocessors over here.

The goal is to fetch remote raw-markdown files from anywhere, and place them in a book. There already is a crate for this already, but this is just for learning.

For now it only works for raw-markdown text. [1]

Snippet

Preprocessors receive the mdbook's chapters, one by one as a String, and can transform it. My current idea is:

/// Replaces the URL to markdown-content by the content itself.
/// Could be used for other formats eventually.
fn urls_to_content(content: &str) -> String {
    regex_replace_all!(
        r"\s(\{\{\s*#remote\s+([^\s}{]{5,200}.md)\s*\}\})",
        content,
        |_, _whole, url| {
            let body =   reqwest::blocking::get(url).unwrap().text().unwrap();
            body
        }
    )
    .to_string()
}

The way it works, with a bit more context, is:

  1. The user adds {{ #remote <BASE_URL>/path/to/file.md}} inside their markdown book (see snippet above)
  2. The remote markdown content is placed in the book where the placeholder above was.
    • Should be saved (to be done) so it does not download the same files every time.
    • Should create a Client to share between all GET requests, but I've simply written the idea at the moment.

Something I couldn't figure out is how to re-use the regex (i.e r"\s(...)") in the different places needed, rather than copy it.


The longer snippet is
use lazy_regex::regex_replace_all;
use mdbook_preprocessor::{
    Preprocessor, PreprocessorContext,
    book::{Book, Chapter},
    errors::Result,
};

/// Preprocessor that fetches remote markdown files
pub struct Fetch;

impl Fetch {
    pub fn new() -> Fetch {
        Fetch
    }
}

impl Preprocessor for Fetch {
    fn name(&self) -> &str {
        "fetch"
    }

    /// Modify chapters replacing `{{#remote URLs}}` by the .md content.
    fn run(
        &self,
        ctx: &PreprocessorContext,
        mut book: Book,
    ) -> Result<Book> {
        // book.toml option for this preprocessor.
        let option = "preprocessor.fetch.disable";
        match ctx.config.get::<bool>(option) {
            // Ok(None) is field unset.
            Ok(None) | Ok(Some(false)) => {
                book.for_each_chapter_mut(include_markdown);
                Ok(book)
            }
            Ok(Some(true)) => Ok(book),
            Err(err) => Err(err.into()),
        }
    }
    /// Run when rendering to HTML,
    /// But operate on markdown files.
    fn supports_renderer(&self, renderer: &str) -> Result<bool> {
        Ok(renderer == "html")
    }
}

/// Write markdown to book.
/// This function is separated so we test the replce
fn include_markdown(chapter: &mut Chapter) {
    chapter.content = urls_to_content(&chapter.content)
}
/// Replaces the URL to markdown-content by the content itself.
/// Could be used for other formats eventually.
fn urls_to_content(content: &str) -> String {
    regex_replace_all!(
        r"\s(\{\{\s*#remote\s+([^\s}{]{5,200}.md)\s*\}\})",
        content,
        |_, _whole, url| {
            let body =   reqwest::blocking::get(url).unwrap().text().unwrap();
            body
        }
    )
    .to_string()
}

#[cfg(test)]
mod test {

    use lazy_regex::{regex, regex::Match};

    use super::*;

    #[test]
    fn test_regex() {
        let input_str: &str = r#"some text and even more but now 
            // Should fail: blank in `// a.`
            {{ #remote https:// abc.def.g/mypath/to.md }} 
            // Should pass
            {{ #remote https://abc.def.g/mypath/to.md }} 
            // Should pass
            {{#remote https://abc.def.ga.b.c/mypath/to.md}}
            // Should pass: `http` is accepted
            {{ #remote http://this.is.insecure/fails/to.md }}
            // Should pass:
            {{#remote https://github.com/rvben/rumdl/blob/main/docs/markdownlint-comparison.md}}
        //"#;
        fn find_markdown_urls(str_file: &str) -> Vec<&str> {
            // I did not find out a way to use the same regex
            // since `regex!` and `regex_replace_all!` need a 
            // literal. And using `static reg=..` was too hard.
            let found: Vec<&str> =
                regex!(r"\s(\{\{\s*#remote\s+([^\s}{]{5,200})\s*\}\})")
                    .find_iter(str_file)
                    .map(|m: Match| m.as_str())
                    .collect();
            found
        }

        let result = find_markdown_urls(input_str);
        assert_eq!(result.len(), 4)
    }
    #[test]
    fn test_url_replacement() {
        let content = r"safgdsafgdsaf
        hello world

        {{#remote https://raw.githubusercontent.com/rust-lang/mdBook/7b29f8a7174fa4b7b31536b84ee62e50a786658b/README.md}}
        ";
        let new_doc = urls_to_content(&content);
        assert!(new_doc.starts_with("safgd"));
        assert!(
            new_doc
                .contains("mdBook is a utility to create modern online books from Markdown files.")
        )
    }
}

  1. Example of raw markdown ↩︎

Version 2

use std::sync::LazyLock;

use mdbook_preprocessor::{
    Preprocessor, PreprocessorContext,
    book::{Book, Chapter},
    errors::Result,
};
use pulldown_cmark::{Event, Parser, TextMergeStream};
use regex::{Captures, Regex};
use reqwest::blocking::get as get_reqwest;

/// Build the regex only once.
static RE: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"\s\{\{\s*#remote\s+([^\s}{]{5,200})\s*\}\}").unwrap()
});
/// Preprocessor that fetches remote markdown files
pub struct Fetch;

impl Preprocessor for Fetch {
    fn name(&self) -> &str {
        "fetch"
    }

    /// Modify chapters replacing `{{#remote URLs}}` by the .md content.
    fn run(
        &self,
        ctx: &PreprocessorContext,
        mut book: Book,
    ) -> Result<Book> {
        // book.toml option for this preprocessor.
        let option = "preprocessor.fetch.disable";
        match ctx.config.get::<bool>(option) {
            // `Ok(None)` is field unset.
            Ok(None) | Ok(Some(false)) => {
                book.for_each_chapter_mut(|ch| {
                    match include_markdown(ch) {
                        Ok(s) => ch.content = s,
                        Err(e) => {
                            eprintln!("failed to process chapter: {e:?}")
                        }
                    }
                });
                Ok(book)
            }
            Ok(_) => Ok(book),
            Err(err) => Err(err.into()),
        }
    }
    /// Preprocess Markdown, regardless of
    /// the final output being .html or .md
    fn supports_renderer(&self, renderer: &str) -> Result<bool> {
        Ok(renderer == "html" || renderer == "md")
    }
}

/// Modify the standard input when it matches URLs.
fn include_markdown(ch: &mut Chapter) -> Result<String> {
    let mut buf = String::with_capacity(ch.content.len());

    // Iterator over events
    let parser =
        TextMergeStream::new(Parser::new(&ch.content)).map(|e| match e {
            Event::Text(text) => {
                let result = url_to_content(&text).into();
                Event::Text(result)
            }
            _ => e,
        });
    Ok(pulldown_cmark_to_cmark::cmark(parser, &mut buf).map(|_| buf)?)
}
/// Replaces the URL to markdown-content by the content itself.
/// Could be used for other formats eventually.
fn url_to_content(content: &str) -> String {
    RE.replace(content, |caps: &Captures| {
        let mut r = get_reqwest(format!("{}", &caps[1]))
            .unwrap()
            .text()
            .unwrap();
        r.insert_str(0, "\n");
        r
    })
    .to_string()
}