New CommonMark parser


#8

Personally I prefer Stylus, but I agree. That would probably yield an enormous speed up from the node based parser.


#9

Very cool! I’ve thought about implementing a Markdown parser in Rust myself, but well… too much work and too little time… :wink:

A pull-parser seems like quite a nice idea for this, although I haven’t really looked into the mechanics of your implementation. You mention compatibility to CommonMark. Is there a way to run the test suite?

I might be inclined to try and write some of the ‘common’ CommonMark extensions if you give me a ‘getting started’ guide.


#10

Well, I believe I’m your first user; I know the crates.io download count as 0 when I got there, and 1 when I left. :smiley:

I swapped out hoedown for pulldown-cmark in cargo-script (the commit in question is here) in an experimental branch. Some feedback:

  1. Event doesn’t derive Debug. I have filed an issue. (Looks like I #1 on that, too :stuck_out_tongue: #2, curses!)

  2. Considering there is no documentation, it was pretty straightforward. The interface was a lot easier to understand, too. If it weren’t for the missing Debug impl on Event, the code would have been noticeably shorter, too.

  3. It passed the existing tests on the first try. Just to emphasise how impressed I am: it did so when I still wasn’t entirely sure what events would be generated for a fenced code block (which I was trying to extract). I just guessed from the definition of the Event and Tag enums.

Like I said, it’s in an experimental branch, but… I can’t really see any reason not to merge it into the next version at this point.

:+1:


#11

:confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball: :confetti_ball:

This is wonderful

We definitely need/want extension support as we need/want the ability to enable and disable features via CL flags.

See https://github.com/rust-lang/rust/blob/master/src/librustdoc/html/markdown.rs#L53-L66 for the custom flags we pass into hoedown today.


#12

Something I yearn for in our doc parser is the ability for us to “handle” special URLs e.g. a hook for doc:std::collections::BTreeMap to get passed back to us to be interpretted as the desired URL. Would this be viable in your design?


#13

@raphlinus Would you mind adding the MIT license to this project so it has the same license as the rest of the Rust source? Since there’s a good chance this could end up part of Rust we need the licensing to be right, and better do that now than later.

(The reason we can’t do Apache 2 only is because it is not compatible with GPL2).


#14

Great idea, easy linking of other doc items would be really neat! From what I’ve read of the code so far, this might work:

let parser = parser.map(|event| match event {
    Event::Start(Link(url, title)) => {
        let new_url = if doc_link_pattern.is_match(url) { expand_doc_url(url) } else { url };
        Event::Start(Link(new_url, title))
    },
    _ => event
});

#15

If it can be done in this way, I think it’s great. It’s exactly what the iterator API is designed for. The other idea I had was to have a callback for looking up link refs, so you could write (for example) [Vec] and the callback would supply an optional url and title. This would be conceptually identical to appending a bunch of these to the source, for every possible input the callback can process:

[Vec]: https://doc.rust-lang.org/std/vec/struct.Vec.html "Struct std::vec::Vec"

My feeling is that we want to be sensitive how well the Markdown will work when cut and pasted into a different processor, one without the fancy URL processing. What I like about the callback idea is that it renders as [Vec] in that case, which I think is suggestive as “there is a link but we don’t have it”.


#16

Just so people know, we’re having this discussion in an email thread.


#17

I’m concerned mapping over all elements will have a considerable overhead over just having html-rendering/parsing call a callback at the right point.

That said, it’s certainly maximally extensible!


#18

Very useful to see the list. Tables I already knew about from @huon. I think “no intra emphasis”, fenced code, and autolinks are already standard in CommonMark. I haven’t researched autolinks deeply, but know that while some Markdown variants parse email addresses and url’s in plain text, CommonMark requires angle brackets around them. I recommend the latter.

Superscript and strikethrough sound pretty easy, I imagine they’re just new inline markup similar to * and _.

Does Rust source actually need footnotes? Those feel like the trickiest, not so much because of the implementation, but to understand the way it should be spec’ed, especially the potential interactions with other elements.


#19

Sorry, I probably wasn’t as clear as I should have been. I am proposing a callback, so it only processes the link when it occurs, but that the result is conceptually the same as if you had mapped over all possible links. I’m thinking along these lines because I want the diff from the CommonMark spec to be minimal and very easy to understand; my feeling is that this qualifies.


#20

It looks like grammar.md, reference.md, and a couple parts of TRPL use footnotes. All in all maybe 10 usages.

Roughly the spec (at least for parsing) seems to amount to basically handling reference-style links.

[^foo] specifies where to inject a footnote

[^foo]: blah blah blah specifies what the footnote’s text is.

    git grep "\[^.*\]" -- ./*/*.md
    src/doc/grammar.md:explicit codepoint lists. [^inputformat]
    src/doc/grammar.md:[^inputformat]: Substitute definitions for the special Unicode productions are
    src/doc/grammar.md:The `ident` production is any nonempty Unicode[^non_ascii_idents] string of
    src/doc/grammar.md:[^non_ascii_idents]: Non-ASCII characters in identifiers are currently feature
    src/doc/reference.md:explicit code point lists. [^inputformat]
    src/doc/reference.md:[^inputformat]: Substitute definitions for the special Unicode productions are
    src/doc/reference.md:An identifier is any nonempty Unicode[^non_ascii_idents] string of the following form:
    src/doc/reference.md:[^non_ascii_idents]: Non-ASCII characters in identifiers are currently feature
    src/doc/reference.md:run-time.[^phase-distinction] Those semantic rules that have a *static
    src/doc/reference.md:[^phase-distinction]: This distinction would also exist in an interpreter.
    src/doc/reference.md:library.[^cratesourcefile]
    src/doc/reference.md:[^cratesourcefile]: A crate is somewhat analogous to an *assembly* in the
    src/doc/reference.md:*fields* of the type.[^structtype]
    src/doc/reference.md:[^structtype]: `struct` types are analogous to `struct` types in C,
    src/doc/reference.md:by the name of an [`enum` item](#enumerations). [^enumtype]
    src/doc/reference.md:[^enumtype]: The `enum` type is analogous to a `data` constructor declaration in
    src/doc/trpl/macros.md:We can implement this shorthand, using a macro: [^actual]
    src/doc/trpl/macros.md:[^actual]: The actual definition of `vec!` in libcollections differs from the
    src/doc/trpl/the-stack-and-the-heap.md:it doesn’t.[^moving] When the function is over, we need to free the stack frame
    src/doc/trpl/the-stack-and-the-heap.md:[^moving]: We can make the memory live longer by transferring ownership,


#21

I didn’t even realize footnotes were part of the spec, and had I known, there would probably be way more of them :smile:


#22

One other solution to this that came up in the RST discussion was having a custom URI scheme for addressing dynamic links.


#23

This is really cool, do you think you could spend a few minutes writing some sort of post about your experience with Rust and writing this parser? I would love to hear from someone with as much experience as you about how you feel about the language.

Also, jgm seems to think that this is a faster implementation than the C reference implementation!


#24

I’ll probably write something. Overall it was a good experience. Since I was going for performance (it is faster than the cmark reference implementation), I had to go fairly low level, most of my scanning methods deal directly with byte arrays of utf-8. Other than that, I felt that the language and libraries (especially iterators) helped get this done in a clean way. It’s all pure safe Rust 1.0, and I didn’t feel a strong need to step outside that, even in the pursuit of performance.

I think the combination of safety and performance really is a big deal. While I was researching edge cases, I came across an undefined behavior bug in cmark. This is in spite of the library being “extensively fuzz-tested using american fuzzy lop”.

Anyway, glad you like it!


#25

License is MIT now: https://github.com/google/pulldown-cmark/commit/dd7054e625022128455aac5c2d8cec2f244afadf

Just need to wait for a new release.


#26

Holy shit yesssss :confetti_ball:

So hype to clean up rustdoc


#27

Hey @raphlinus! What ever happened with this?