A newtype, that has a constructor, but also a "trusted" way to bypass the constructor

Hi guys,

I'm struggling with this problem.

Say I have a type EmailAddress(String) whose constructor does validation/sanitization before returning a Result<Self, _>. This is great for data going into a database. But once in the database it is trusted. We don't want to use the constructor on the way out of the database, e.g. our validation rules may tighten over time, (making once valid database entries, now invalid).

So we want to trust that the &str or String coming from the database is safe, without allowing anyone to just construct an EmailAddress

How can this be achieved?

Easiest solution is to just have an unsafe constructor that skips validation and call that on the DB retrieval code. You are, after all, just asserting that the DB retrieval is safe, and unsafe is the canonical way to do that.

You could also go through an intermediate type, or put the EmailAddress constructor on String via an extension trait, but these kinds of solutions ultimately depend on the DB retrieval code having private access to something that your EmailAddress-constructing users don't.

The critical question is whether you want library UB for violations, or if you "just" want violating the rules to be a logic error. It sounds, in this case, like you want the latter because you need weird data in the database to be returnable.

At that point, the general answer is just to make an associated function with a big ugly name.

So that'd be something like this:

mod whatever {
    pub struct EmailAddress(String);
    impl EmailAddress {
        fn new(x: String) -> Result<Self, WhateverError> { ... }
        fn new_bypassing_validation(x: String) -> Self { Self(x) }
    }
}

And then you just don't write any unsafe code that depends on the email address being well-formed.

7 Likes

unsafe is meant to relate to memory safety, so I don't think this would be an appropriate use of unsafe in this situation. Per

"Code or interfaces whose memory safety cannot be verified by the type system."

1 Like

But memory is a synonym for data. And I really like the unsafe approach, because the data returned from a database is just like data you may get from FFI.

That's not my understanding of what memory safety means, which is things like "use after free", accessing memory that is not defined ( outside the range of memory allocated ), and such-like.

[ I just checked, and both clippy and cargo will issue warnings if you mark a code block as unsafe when it isn't ]

I am not saying using unsafe in this way is definitively wrong, I would consider it confusing and un-ideomatic though, just my opinion.

1 Like

It really depends what invariants the library guarantees. Typically, NonZeroU8 does not obviously refers to memory, but because the library docs say it has to be non zero, its new_unchecked method is unsafe. The same goes for str being valid UTF-8 for example.

Regarding the OP issue, it depends whether the "valid email" invariant is used to guarantee soundness afterwards, or exposed to the user with such a guarantee.

My understanding is that the memory safety of the standard library does depend on Strings having valid UTF-8 data. Whereas (I presume) this does not apply regarding the validity of an email address (it's possible it does, if there is unsafe code in use which assumes email addresses are valid, but this seems an unlikely scenario to me ).

Well it really depends what the library doc says. If it says the email is valid for example, I could expect EmailAddress to contain an @, and write some really optimized code:

fn get_domain_name(email: &EmailAddress) -> &str {
 let opt_res = email.as_str().split_once('@');
 unsafe { opt_res.unwrap_unchecked() }.1
}

which would be UB if there is no @.

The key point is that once the promise is made in the documentation, the safety guarantees extend to what consumers of your library do.

2 Likes

Two things here.

  • The general: Please don't use unsafe unless you're actually depending on the invariant for memory safety elsewhere. Despite the standard library being quite strict about this (see how Box::leak isn't unsafe, for example) we still get many people who try the "but how bad could it be?" response when told that they're doing something UB. Any place that needs unsafe for just logical correctness actually weakens Rust's soundness propositions.

  • The specific: the OP specifically mentioned that the rules will tighten over time, so there's no guarantee that what's in the database actually meets the rules, and thus it's particularly inappropriate to say "I promise this is validation-clean" for this data from the DB.


These are both relied upon for memory safety. The only reason Option<NonZeroU8> can have size 1 is because it's immediately unsound to make a NonZeroU8 that holds a zero. And str depends on it for codepoint decoding (like .chars()) -- it can read off the end of a buffer if the last byte has the high bit set, for example, because that can't happen in valid UTF-8 where such a byte always has the correct number of continuation bytes after it.

Here's a quick example of MIRI catching the UB from a non-UTF-8 str: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=247eec5ee027ebf6a061a63bb46ed7d2

error: Undefined Behavior: entering unreachable code
  --> /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/hint.rs:51:14
   |
51 |     unsafe { intrinsics::unreachable() }
   |              ^^^^^^^^^^^^^^^^^^^^^^^^^ entering unreachable code
   |
   = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
   = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
           
   = note: inside `std::hint::unreachable_unchecked` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/hint.rs:51:14
   = note: inside `std::option::Option::<&u8>::unwrap_unchecked` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/option.rs:820:30
   = note: inside `core::str::validations::next_code_point::<std::slice::Iter<u8>>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/validations.rs:49:23
   = note: inside `<std::str::Chars as std::iter::Iterator>::next` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/iter.rs:44:18
   = note: inside `<std::str::Chars as std::iter::Iterator>::fold::<(), [closure@std::iter::Iterator::for_each::call<char, fn(char) {std::mem::drop::<char>}>::{closure#0}]>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:2170:29
   = note: inside `<std::str::Chars as std::iter::Iterator>::for_each::<fn(char) {std::mem::drop::<char>}>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:736:9
note: inside `main` at src/main.rs:5:5
  --> src/main.rs:5:5
   |
5  |     s.chars().for_each(drop);
   |     ^^^^^^^^^^^^^^^^^^^^^^^^
3 Likes

It's also probably easier to justify if you think of the DB as arbitrary memory that can be accessed from anywhere. If you can safely decode an invalid email from the DB into an EmailAddress, then you can round-trip an invalid email back into the DB without validation. Then the DB is effectively corrupted, and who knows where that will have an impact.

Note that for those fond of requiring something noisy like unsafe, but different than unsafe since not UB-related, you can:

Create your own unsafe-like effect:

pub use dangerous::dangerous;
pub mod dangerous {
    #[derive(Clone, Copy)]
    pub struct HotPotato<'lt>([&'lt (); 0]);

    pub fn dangerous<R> (f: impl FnOnce(HotPotato<'_>) -> R)
      -> R
    {
        f(HotPotato([]))
    }
}

This gives you dangerous blocks, written as:

dangerous(|hot_potato| {
    ...
})
  • which grants you a hot_potato, token of the dangerous stuff being done, and with which you'll be able to call dangerous APIs.

And from there, the only missing part is having the equivalent of an unsafe fn, by virtue of having a function requiring a Dangerous<'_> parameter:

impl MyFancyString {
    fn new_unchecked (
        _: dangerous::HotPotato<'_>,
       arg: impl Into<String>,
    ) -> Self
    ...
}

No matter how hard you try (without unsafe of course), the only way to call MyFancyString::new_unchecked() will be within a dangerous block:

dangerous(|hot_potato| {
    MyFancyString::new_unchecked(hot_potato, "...")
})
  • Playground

  • A similar API can be featured using macros rather than a callback so as to be async-friendly, for instance.

  • feel free to rename HotPotato to Dangerous, and |hot_potato| to |token| :upside_down_face:

10 Likes

That's quite clever, thanks! At first I thought that I'd be able to smuggle the token to the non-dangerous code via dangerous(|token| token), but compiler stopped me:

error: lifetime may not live long enough
  --> src/main.rs:24:35
   |
24 |     let token = dangerous(|token| token);
   |                            ------ ^^^^^ returning this value requires that `'1` must outlive `'2`
   |                            |    |
   |                            |    return type of closure is dangerous::Dangerous<'2>
   |                            has type `dangerous::Dangerous<'1>`
1 Like

Reminds me of something I thought of for one of my projects, something along the lines of:

pub struct Foo {
    /* ... */
}

impl Foo {
    pub fn new() -> Foo {/* ... */}

    pub fn safe_thing(&self) {/* ... */}

    fn dangerous_thing(&self) {/* ... */}

    pub fn dangerous(&self) -> DangerousFoo<'_> {
        DangerousFoo(self)
    }
}

pub struct DangerousFoo<'a>(&'a Foo);

impl<'a> DangerousFoo<'a> {
    pub fn dangerous_thing(&self) {
        self.0.dangerous_thing()
    }
}

fn main () {
    let foo = Foo::new();
    foo.safe_thing();
    foo.dangerous().dangerous_thing();
}

But yours is probably better, it doesn't need as much repetition, as my method would require doing this for everything with dangerous operations

1 Like

We just came up with a much better idea in the Discord chat!!

We create a struct DbAccessToken; somewhere protected inside of the datatabse package/mod/struct whatever, that only the database can construct.

use std::marker::PhantomData;

pub struct DatabasePriviledgeToken(PhantomData<()>);

impl DatabasePriviledgeToken {
    pub(super) fn new() -> Self {
        Self(PhantomData)
    }
}

Now the constructor that bypasses validation must accept one of these types as a parameter to the function.

I'd probably have to re-export the token in another package though to avoid cyclical dependencies (Database needs Email to create the EmailAddress, Email needs Database to ask for the AccessToken)

Edit: 25-04-2023: So honestly, I've come to regret this a little bit. It's really cool, and I put the constructor behind a feature flag so that tests in foreign crates can construct these types, but every so often running unit tests just breaks because for one reason or another the feature is not enabled correctly in one of my crates, and sometimes the feature is enabled when it shouldn't be etc. so it's just more effort to maintain, than it is to review bad uses of a from_trusted fn that doesn't ask you to supply a unconstructable PriviledgeToken

5 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.