Substitutions of tokens in a stream


#1

Hi everybody! I have to ask a “simple” question. Basically I want to write something that, given a text input of arbitrary length, could convert the emoji codes to the correspondings emojis (es: :cat: to :cat: using images from emojione). I am aware that some libraries already exists (es: rust-emojicons), but what I found use regular expression and I don’t want to use that approach. Before accusing me of reinventing the wheel, my only goal for now is learning, not producing yet another emoji library. If something good comes out of this experiment, I will release it in the wild of course, but still it’s not my current goal.

The approach that I’m trying to use is build something with nom, given it’s zero copy capabilities that should allow me to satisfy the requirement of the arbitrary length text. The issue I’m having at the time is the learning curve associated with nom, that is making things harder for me (I have no hurry, so it’s not really an issue). Another approach I thought about, given the finite set of emojis, is building a tree of the possibles byte sequences (both short codes and unicodes representation) and substitute whenever I reach a leaf.

Am I choosing the right approaches? There is something smarter that I could do?

Thanks! :v:


#2

@druzn3k and I spoke on IRC and I suggested using Aho-Corasick since the problem boils down to finding a set of fixed substrings in text.

The approach taken in the rust-emojicons crate is also reasonable, assuming the rate of false positives is low (which I expect it would be for normal prose).

Probably a parser combinator library is overkill for this.

The part where you need to map from the emoji text to the Unicode codepoint is probably best done with the phf crate, although the fst crate could also work (with more friction, admittedly).


#3

I learned about three cool crates today, thanks!


#4

you’re welcome @cbreeden :slight_smile: