New CommonMark parser


#1

I’ve released a CommonMark parser written in pure, safe, Rust. The code is on github at https://github.com/google/pulldown-cmark, and it’s also the “pulldown-cmark” crate on Cargo.

I’d like for this to become the engine in rustdoc. Right now it implements CommonMark (an effort to standardize Markdown) with no extensions, but if extensions are necessary I’m open to adding them.

Performance should be quite good. In my tests, it’s a little slower than hoedown, and a little faster than the cmark reference implementation written in C. I’m not going to post detailed benchmarks now because there’s probably a little more optimization to be done and I want to make sure my benchmark methodology is sound.

It was fun writing this. I felt the pull parser architecture meshed well with Rust’s iterators, and overall made it easy to write in a style that does very little allocation and copying.

I hope you find it useful! Feedback is welcome.


Release: comrak (CommonMark + GFM compatible Markdown parser)
#2

This looks neat :smile:

Is there an extension API? To extend the functionality or overwrite current behaviour?


#3

There is not one yet, but I’m considering adding one. That said, while some extensions I think would be reasonably straightforward in a general way (new inline markup like math and strikethrough), other extensions would probably require by-hand modification of the scanners. Those would probably work like hoedown; you set options when creating the parser, and that turns the extensions on.


#4

Just last week we were discussing whether a pure-Rust implementation of CommonMark existed with which we could replace Hoedown in rustdoc. This sounds great!


#5

That would be a great feature :wink:
There have been some implementations of markdown in Rust, but I haven’t heard of one supporting custom extensions yet. It would be nice to have that kind of control (like in the python markdown module).


#6

Woah, nice!

One extension rustdoc definitely uses is tables: example, source.


#7

This is really cool! Would love to see some css parsers like sass/scss implemented too!


#8

Personally I prefer Stylus, but I agree. That would probably yield an enormous speed up from the node based parser.


#9

Very cool! I’ve thought about implementing a Markdown parser in Rust myself, but well… too much work and too little time… :wink:

A pull-parser seems like quite a nice idea for this, although I haven’t really looked into the mechanics of your implementation. You mention compatibility to CommonMark. Is there a way to run the test suite?

I might be inclined to try and write some of the ‘common’ CommonMark extensions if you give me a ‘getting started’ guide.


#10

Well, I believe I’m your first user; I know the crates.io download count as 0 when I got there, and 1 when I left. :smiley:

I swapped out hoedown for pulldown-cmark in cargo-script (the commit in question is here) in an experimental branch. Some feedback:

  1. Event doesn’t derive Debug. I have filed an issue. (Looks like I #1 on that, too :stuck_out_tongue: #2, curses!)

  2. Considering there is no documentation, it was pretty straightforward. The interface was a lot easier to understand, too. If it weren’t for the missing Debug impl on Event, the code would have been noticeably shorter, too.

  3. It passed the existing tests on the first try. Just to emphasise how impressed I am: it did so when I still wasn’t entirely sure what events would be generated for a fenced code block (which I was trying to extract). I just guessed from the definition of the Event and Tag enums.

Like I said, it’s in an experimental branch, but… I can’t really see any reason not to merge it into the next version at this point.

:+1:


#11

:confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball:

This is wonderful

We definitely need/want extension support as we need/want the ability to enable and disable features via CL flags.

See https://github.com/rust-lang/rust/blob/master/src/librustdoc/html/markdown.rs#L53-L66 for the custom flags we pass into hoedown today.


#12

Something I yearn for in our doc parser is the ability for us to “handle” special URLs e.g. a hook for doc:std::collections::BTreeMap to get passed back to us to be interpretted as the desired URL. Would this be viable in your design?


#13

@raphlinus Would you mind adding the MIT license to this project so it has the same license as the rest of the Rust source? Since there’s a good chance this could end up part of Rust we need the licensing to be right, and better do that now than later.

(The reason we can’t do Apache 2 only is because it is not compatible with GPL2).


#14

Great idea, easy linking of other doc items would be really neat! From what I’ve read of the code so far, this might work:

let parser = parser.map(|event| match event {
    Event::Start(Link(url, title)) => {
        let new_url = if doc_link_pattern.is_match(url) { expand_doc_url(url) } else { url };
        Event::Start(Link(new_url, title))
    },
    _ => event
});

#15

If it can be done in this way, I think it’s great. It’s exactly what the iterator API is designed for. The other idea I had was to have a callback for looking up link refs, so you could write (for example) [Vec] and the callback would supply an optional url and title. This would be conceptually identical to appending a bunch of these to the source, for every possible input the callback can process:

[Vec]: https://doc.rust-lang.org/std/vec/struct.Vec.html "Struct std::vec::Vec"

My feeling is that we want to be sensitive how well the Markdown will work when cut and pasted into a different processor, one without the fancy URL processing. What I like about the callback idea is that it renders as [Vec] in that case, which I think is suggestive as “there is a link but we don’t have it”.


#16

Just so people know, we’re having this discussion in an email thread.


#17

I’m concerned mapping over all elements will have a considerable overhead over just having html-rendering/parsing call a callback at the right point.

That said, it’s certainly maximally extensible!


#18

Very useful to see the list. Tables I already knew about from @huon. I think “no intra emphasis”, fenced code, and autolinks are already standard in CommonMark. I haven’t researched autolinks deeply, but know that while some Markdown variants parse email addresses and url’s in plain text, CommonMark requires angle brackets around them. I recommend the latter.

Superscript and strikethrough sound pretty easy, I imagine they’re just new inline markup similar to * and _.

Does Rust source actually need footnotes? Those feel like the trickiest, not so much because of the implementation, but to understand the way it should be spec’ed, especially the potential interactions with other elements.


#19

Sorry, I probably wasn’t as clear as I should have been. I am proposing a callback, so it only processes the link when it occurs, but that the result is conceptually the same as if you had mapped over all possible links. I’m thinking along these lines because I want the diff from the CommonMark spec to be minimal and very easy to understand; my feeling is that this qualifies.


#20

It looks like grammar.md, reference.md, and a couple parts of TRPL use footnotes. All in all maybe 10 usages.

Roughly the spec (at least for parsing) seems to amount to basically handling reference-style links.

[^foo] specifies where to inject a footnote

[^foo]: blah blah blah specifies what the footnote’s text is.

    git grep "\[^.*\]" -- ./*/*.md
    src/doc/grammar.md:explicit codepoint lists. [^inputformat]
    src/doc/grammar.md:[^inputformat]: Substitute definitions for the special Unicode productions are
    src/doc/grammar.md:The `ident` production is any nonempty Unicode[^non_ascii_idents] string of
    src/doc/grammar.md:[^non_ascii_idents]: Non-ASCII characters in identifiers are currently feature
    src/doc/reference.md:explicit code point lists. [^inputformat]
    src/doc/reference.md:[^inputformat]: Substitute definitions for the special Unicode productions are
    src/doc/reference.md:An identifier is any nonempty Unicode[^non_ascii_idents] string of the following form:
    src/doc/reference.md:[^non_ascii_idents]: Non-ASCII characters in identifiers are currently feature
    src/doc/reference.md:run-time.[^phase-distinction] Those semantic rules that have a *static
    src/doc/reference.md:[^phase-distinction]: This distinction would also exist in an interpreter.
    src/doc/reference.md:library.[^cratesourcefile]
    src/doc/reference.md:[^cratesourcefile]: A crate is somewhat analogous to an *assembly* in the
    src/doc/reference.md:*fields* of the type.[^structtype]
    src/doc/reference.md:[^structtype]: `struct` types are analogous to `struct` types in C,
    src/doc/reference.md:by the name of an [`enum` item](#enumerations). [^enumtype]
    src/doc/reference.md:[^enumtype]: The `enum` type is analogous to a `data` constructor declaration in
    src/doc/trpl/macros.md:We can implement this shorthand, using a macro: [^actual]
    src/doc/trpl/macros.md:[^actual]: The actual definition of `vec!` in libcollections differs from the
    src/doc/trpl/the-stack-and-the-heap.md:it doesn’t.[^moving] When the function is over, we need to free the stack frame
    src/doc/trpl/the-stack-and-the-heap.md:[^moving]: We can make the memory live longer by transferring ownership,