Unicode tokenizer for no_std environment

I am somewhat new to Rust programming, and I am looking for small projects to improve my skill. I thought I could do an Unicode tokenizer over one or more UTF8 byte array buffers to generate unicode tokens. This is targeting no_std, so that no heap memory will be used. There will be an option to transform CR, and CR/LF to standardized line ending convention.

Do you know if there are existing no_std crates which perform this function?

I don't know if there's anything that exactly matches what you want to do, but there are some related no_std-compatible crates that you might find useful:

  • unicode-segmentation to get properties of codepoints, find grapheme boundaries, etc.
  • nom is a parser-combinator library that could be used to write a tokenizer.

It's not clear to me what you mean by a "token" here, as that word isn't defined in the glossary: https://www.unicode.org/glossary/#T

You might get more precise help if you elaborated specifically what the goal is.

The function is to produce unicode codepoints sequentially from one or more byte arrays. It will properly handle UTF8 fragments straddling between two byte arrays, among other minor functions. I might have an option to ignore the BOM marker.

In the no_std environment I thought there should be some crates out there that do similar things. The main thing is just to not use any heap memory. It is otherwise nothing special, and basic functionality.