Type cast return type possible?

So I have this procedure that I think needs to be looked at first

fn pchar<'a, I, E>() 
-> impl Fn(I) -> nom::IResult<I, I, E>
where I: 'a + Clone + nom::InputTake + nom::Compare<&'static str> + nom::InputIter + nom::InputLength + nom::InputTakeAtPosition + std::iter::ExactSizeIterator + std::ops::Index<usize>,
      <I as nom::InputTakeAtPosition>::Item: nom::AsChar,
      <I as std::ops::Index<usize>>::Output: Sized,
      E: nom::ParseError<I>
{
    let homogenize_pct_encoded = |i: ((I, I), I)| {
        // Params `I` will always be utf-8 byte length string. Expand the first slice to 
        // include the value from the other slices so we dont have to use a Vec. So 
        // long as the slices are contiguous, this wont actualyl be unsafe.
        let ptr = &(i.0).0[0] as *const _;
        let contiguous_matches = unsafe {
            // All slices are non overlapping, contiguous sub arrays of a greater array. 
            // ptr starts at first slice's first element and expands by the length
            // of itself + next slice + last slice
            //
            // [..., s1[0] -------- -------- -------- ]
            //        ptr   s1 len   s2 len   s3 len
            std::slice::from_raw_parts(ptr, (i.0).0.len() + (i.0).1.len() + (i.1).len())
        };

        // We are returning a contiguous result so parsed results must be contiguous 
        // so we dont accidently use a parser that gives discontiguous results
        // COMMENTING OUT TEMPORARILY SO WE DONT NEED TO DEAL WITH DEBUG IMPLS FOR NOW
        //debug_assert_eq!(&contiguous_matches[0..(i.0).0.len()], (i.0).0);
        //debug_assert_eq!(&contiguous_matches[(i.0).0.len()..(i.0).0.len() + (i.0).1.len()], (i.0).1);
        //debug_assert_eq!(&contiguous_matches[(i.0).0.len() + (i.0).1.len()..(i.0).0.len() + (i.0).1.len() + (i.1).len()], i.1);
        // TODO: Should also debug_assert that the slices are not overlapping, but
        // even non overlapping slices can have the same values, so not sure how to test.
        // Maybe match ptr values?

        contiguous_matches
    };

    alt((
            unreserved(),
            nom::map(pct_encoded(), homogenize_pct_encoded),
            sub_delims(),
            tag(":"),
            tag("@"),
            ))
}

So what I'm doing here is attempting to parse an array of characters using pchar(). The return type as you can see is impl Fn(I) -> nom::IResult<I, I, E>. That's what the alt() function call returns there at the end.

Now I need that homogenize_pct_encoded lambda. All the types in the tuple going into alt() have to be the same, so we "homogenize" the one from the pct_encoded() parser. I can totally justify what I'm doing here, but let me get to the problem.

Compiling, I get back the message

--> src/lib.rs:386:5
    |
386 |     alt((
    |     ^^^ expected &[_], found type parameter
    |
    = note: expected type `std::result::Result<(I, &[_]), nom::Err<_>>`
               found type `std::result::Result<(I, I), nom::Err<_>>`
    = note: required because of the requirements on the impl of `nom::branch::Alt<I, I, _>` for `(impl std::o
ps::Fn<(I,)>, impl std::ops::Fn<(I,)>, impl std::ops::Fn<(I,)>, impl std::ops::Fn<(I,)>, impl std::ops::Fn<(I
,)>)`
    = note: required by `nom::branch::alt`

Which means to me that the return type of homogenize_pct_encoded is &[_] and of course our alt() and our pchar() functions expect to return Result<(I, I), Err> so all the Fn(I) -> IResult.. arguments inside our alt(( .... )) call need to return Result<(I, I), Err>. They all do, except for nom::map(pct_encoded(), homogenize_pct_encoded). It returns Result<(I, &[_]), Err> because the homogenize_pct_encoded returns &[_].

How can I make homogenize_pct_encoded return the type I instead of &[_]?

You'd have to have some function that could turn &[_] into any given I. You could require I: From<&[T]>, I suppose (where &[T] is whatever slice type pct_encoded is returning) and then return contiguous_matches.into().

Does this function need to be generalized over I?

1 Like

The only type I actually expect in homogenize_pct_encoded is a &[u8]. I tried only specifying that as the input params but it led to the same error, this time instead of &[_] it becomes about &[u8].

I meant if pchar needed to be generalized over I.

No, it should only deal with &[u8]. It's the alt() nom function that is generalized

Okay I think I understand what you mean. I did this and it works

fn pchar<'a, E>() 
-> impl Fn(&'a [u8]) -> IResult<&'a [u8], &'a [u8], E>
where E: ParseError<&'a [u8]>
{
    let homogenize_pct_encoded = |i: ((&[u8], &[u8]), &[u8])| {
        // Params `&[u8]` will always be utf-8 byte length string. Expand the first slice to 
        // include the value from the other slices so we dont have to use a Vec. So 
        // long as the slices are contiguous, this wont actualyl be unsafe.
        let ptr = &(i.0).0[0] as *const _;
        let contiguous_matches = unsafe {
            // All slices are non overlapping, contiguous sub arrays of a greater array. 
            // ptr starts at first slice's first element and expands by the length
            // of itself + next slice + last slice
            //
            // [..., s1[0] -------- -------- -------- ]
            //        ptr   s1 len   s2 len   s3 len
            std::slice::from_raw_parts(ptr, (i.0).0.len() + (i.0).1.len() + (i.1).len())
        };

        // We are returning a contiguous result so parsed results must be contiguous 
        // so we dont accidently use a parser that gives discontiguous results
        debug_assert_eq!(&contiguous_matches[0..(i.0).0.len()], (i.0).0);
        debug_assert_eq!(&contiguous_matches[(i.0).0.len()..(i.0).0.len() + (i.0).1.len()], (i.0).1);
        debug_assert_eq!(&contiguous_matches[(i.0).0.len() + (i.0).1.len()..(i.0).0.len() + (i.0).1.len() + (i.1).len()], i.1);
        // TODO: Should also debug_assert that the slices are not overlapping, but
        // even non overlapping slices can have the same values, so not sure how to test.
        // Maybe match ptr values?

        contiguous_matches
    };

    alt((
            unreserved(),
            map(pct_encoded(), homogenize_pct_encoded),
            sub_delims(),
            tag(":"),
            tag("@"),
            ))
}

........

fn main() {
    let mut input: &[u8] = b"%11";
    let s = pchar()(input) as IResult<&[u8], &[u8]>;
}

My question now is, if let input: is &str, instead of &[u8], is there a way for me to make homogenize_pct_encoded take a generic that covers both &[u8] and &str? Of course, any &str I use I can guarantee to be 7 bit encoded utf 8, so they should be completely interchangeable.

I'd just do as_bytes on the &str, pass that into your parser, and then do str::from_utf8_unchecked on the result. If you weren't doing the from_raw_parts you could just use InputTake::take to create contiguous_matches, but most of the builtins panic if you go past the end of a slice like you're doing.

I'd generally recommend not doing the stuff you're doing with joining the slices, unless you have no control over how pct_encoded is returning its data - I'm assuming that internally pct_encoded has a buffer that it can guarantee contiguous allocation into and you're not relying on the layout of a tuple or struct to guarantee contiguity (which it won't,) so you might be better off returning that buffer as a whole.

I'd just do as_bytes on the &str , pass that into your parser, and then do str::from_utf8_unchecked on the result.

You're right, but I was hoping to allow &str and &[u8] for api convenience of callers.

If you weren't doing the from_raw_parts you could just use InputTake::take to create contiguous_matches , but most of the builtins panic if you go past the end of a slice like you're doing.

Oh nice so you already know about the nom crate. High five! I really like nom so far. I don't think InputTake::take would allow me to expand a slice past its length (panics), so even for the contiguous slices I have, it wont work?

I'd generally recommend not doing the stuff you're doing with joining the slices, unless you have no control over how pct_encoded is returning its data - I'm assuming that internally pct_encoded has a buffer that it can guarantee contiguous allocation into and you're not relying on the layout of a tuple or struct to guarantee contiguity (which it won't,) so you might be better off returning that buffer as a whole.

This is pct_encoded()

fn pct_encoded<'a, I, E>() 
-> impl Fn(I) -> IResult<I, ((I, I), I), E>
where I: 'a + Clone + InputTake + Compare<&'static str> + InputIter + InputLength + InputTakeAtPosition,
      <I as nom::InputTakeAtPosition>::Item: nom::AsChar,
      E: ParseError<I>
{
    pair(
        pair(
            tag("%"), 
            map_parser(take(1u8), hex_digit1)
        ),
        map_parser(take(1u8), hex_digit1)
    )
}

So as you can see, I have no control on what pair() returns. It returns ((I, I), I) and that isn't compatible with my alt() combinator. There technically isn't anything preventing pct_encoded() from returning discontiguous slices. It is possible to use a skip like parser combinator inside those pair() functions, which skips tokens or does a lookahead or something like that and that would be bad.

Basically, the three debug_assert_eq in place and in combination with a test suit, should actually be good enough with the invariants they enforce. If I ever use a skip like parser combinator in the pair(), I should panic when running tests. Being debug_assert_eq is also a nice touch since they get removed in the release build. I still need to assert against overlapping slices, as actually testing for equality of slices isn't good enough.

I should also note that the difference between doing our slice join vs just returning a Vec has significant performance gains (obviously) that I can't ignore because this procedure is right in the hot path of the main program.

1 Like

As far as the making it generic over &str and &[u8] the AsRef trait will work

fn take_str_or_slice<T: AsRef<[u8]>>() {
    homogenize_pct_encoded = |i: ((&T, &T), &T)| {
        ...
    }
    ...
}

here is the playground link where I was tinkering playground. I hope this helps :+1:

1 Like

I think I get what you're trying to do. nom already has a parser that can do that - nom::combinator::recognize. It'll take the sequence of input matched by a parser and return it directly as a slice/str.

You can remove homogenized_pct_encoded and write the alt like

alt((
    unreserved(),
    recognize(pct_encoded()),
    sub_delims(),
    tag(":"),
    tag("@"),
))
2 Likes

Ring a ding ding!