Encode/decode URI

I think it's trivial with some regex and UTF-8 operations, but is there a crate similiar to Ecma encodeURI(...) and decodeURI(...) for encoding general Unicode code points into %xx sequences and back into Unicode code points?

You could use the types in the http crate. I believe they take care of escaping?

1 Like

urlencoding is quite popular

2 Likes

I tried:

use std::str::FromStr;
use http::uri::Uri;

pub fn encode_uri(s: impl AsRef<str>) -> String {
    Uri::from_str(s.as_ref()).expect("URI malformed").to_string()
}

When I give it something like app-storage:// it fails... If the domain/path is filled, it works. Hmm... doesn't work for my case.

Sadly it escapes everything. I gave it app-storage://ã and it escaped even the :// characters:

app-storage%3A%2F%2F%C3%A3

I think you want the url crate

Playground

use url::Url;

fn main() {
    let s = "app-storage://ã";

    println!("{}", Url::parse(s).unwrap());
}

Output

app-storage://%C3%A3
2 Likes

It also works with parsing just "app-storage://" without a host.

1 Like

Is there a way to decode the host and path? I know of urlencoding, but it encodes the slashes...

Right, I can probably use a regex to split the sum of host-path by slashes and map to urlencoding::decode

What do you mean by decode? Url parses the URL and handles the encoding of the path

If you want the raw, unencoded url you can use urlencoding::decode on the encoded url string. It will ignore everything that is not %XX encoded. So you don't need to split the url and decode the different parts one by one:

println!("{}", urlencoding::decode("app-storage://%C3%A3").to_owned().unwrap());

Stdout:

app-storage://ã
3 Likes

Oops, I've mixed things a bit here. I meant decode as a way to translate a URI into a native path based on host operating system.

Out of the decoding matter, I've only one issue with the url crate... It requires a "host". In my case I need only path. I'm using URIs like file:, app:... which may begin with a directory name like Física. I tried parsing a URI with that host and it's said to be malformed.

What about using the uriparse crate? Unlike the url crate and http, this should work for arbitrary URIs and not just URLs.

2 Likes

I tried:

println!("{}", uriparse::URI::try_from("app://Física").unwrap().host().unwrap());

Got InvalidIPv4OrRegisteredNameCharacter error.

I think the uriparse crate only supports URI's and not IRI's (Internationalized Resource Identifier). URI's are limited to ASCII, while IRI's allow full Unicode.

1 Like

iri-string crate (disclaimer: I am the author) can be used to encode unicode string into URI-encoded string, and you can control in what context the string should be encoded (i.e. to encode # or not, to encode / or not, etc.)

However, It cannot naively decode the encoded URI into an IRI without applying some normalization.

1 Like

I resolved to implement percent encoding and decoding by myself using just lazy_regex:

use lazy_regex::{regex_replace_all};

pub fn encode_uri(s: impl AsRef<str>) -> String {
    regex_replace_all!(r"[^A-Za-z0-9_\-\.:/\\]", s.as_ref(), |seq: &str| {
        let mut r = String::new();
        for ch in seq.to_owned().bytes() {
            r.push('%');
            r.push_str(octet_to_hex(ch).as_ref());
        }
        r.clone()
    }).into_owned()
}

pub fn decode_uri(s: impl AsRef<str>) -> String {
    regex_replace_all!(r"(%[A-Fa-f0-9]{2})+", s.as_ref(), |seq: &str, _| {
        let mut r = Vec::<u8>::new();
        let inp: Vec<u8> = seq.to_owned().bytes().collect();
        let mut i: usize = 0;
        while i != inp.len() {
            r.push(u8::from_str_radix(String::from_utf8_lossy(&[inp[i + 1], inp[i + 2]]).as_ref(), 16).unwrap_or(0));
            i += 3;
        }
        String::from_utf8_lossy(r.as_ref()).into_owned().to_owned()
    }).into_owned()
}

fn octet_to_hex(arg: u8) -> String {
    let r = format!("{:x}", arg);
    ((if r.len() == 1 { "0" } else { "" }).to_owned() + &r).to_uppercase().to_owned()
}

For now I decided to ignore the colon.

octet_to_hex() can be simplified by {:02X} formatting.

fn main() {
    assert_eq!(format!("{:02X}", 1), "01");
    assert_eq!(format!("{:02X}", 254), "FE");
}
1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.