Nom, &str vs &[u8] as input type in text parser


#1

Hey there,
I’m exploring the nom crate atm and I’m having a great time so far!
But there is one thing I don’t get so far:
nom can work with string slices (&str) as well as with byte slices(&[u8]) as input.

I prefer strings as input, since my input is text, my output either parsed data or strings as well. But seeing some other projects, even though they process only text as well, all of the ones I saw so far use byte slices as input.
Is there any particular reason to this?

It doesn’t really change much to the code except that the code gets cluttered with str::from_utf8 calls.

Thanks in advance,
Anton


#2

The only way to answer this question is to look at the actual problem you’re trying to solve. Without knowing the higher level problem you’re addressing, it’s probably not possible to say, in general, whether you should be using &[u8] or &str (or both). If you’re writing core search primitives, for example, then it’s probably best to accept a &[u8] (or at least provide an alternative API for it). If you’re writing routines for natural language processing, then you might want to demand a &str for input.

If you’re writing str::from_utf8 in multiple places and aren’t sure why, then there’s either an issue in the design somewhere, or you just shouldn’t offer &[u8] APIs and instead require callers to do UTF-8 conversion.

Which projects?


#3

What do you mean by core search primitives?

For example this iso8601 date parser uses CompleteByteSlice's as input, which is a wrapper around &[u8], the INI file parser from the nom repo works on bytes as well.
Especially the INI parser makes me wonder, because as far as I can see, an INI file is just text.


#4

Could you please elaborate on the problem you’re trying to solve? That’s the most important next step to answering your question.

Regexes, substring search, replacements. Mostly anything that can be done on raw bytes that is otherwise encoding agnostic.

Again, it depends on the problem you’re trying to solve. If you want an INI parser that works on any ASCII compatible encoding (e.g., latin1, ascii, utf8) without any additional transcoding, then you can write your parser to be on raw bytes.


#5

Yeah, sorry, of course.
I’m writing a toy http header parser.

The point with different encodings is very interesting, didn’t think of that at all and it makes perfectly sense, thank you :slight_smile:


#6

I would probably use &[u8] then, based on this: https://stackoverflow.com/questions/818122/which-encoding-is-used-by-the-http-protocol

Indeed, hyper’s httparse crate accepts an &[u8].


#7

Thank you very much, then I’ll change that. :blush: