My first project: Bindings to a C library

Hi all,

I recently published my first crate (https://github.com/kodemartin/rustpostal) and I would greatly appreciate a review.

Thank you!

Hi! Thanks for your work, it's an interesting library. This is a use case that I've been pondering myself for some time.

Let's jump into the review!

  • First of all, the customary way of wrapping a C library in Rust is to split the functionality in two separate crates:

    1. A "sys" crate, in your example it would be called libpostal-sys. This should contain the immediate C FFI bindings. These include C types and functions, as well as equivalents of other C constructs that can't be directly translated to Rust, e.g. const items that replace #define macros.
    2. A higher-level crate providing a safe, idiomatic, preferably RAII-enabled wrapper around the lower-level -sys one.

    This is preferable to your current approach of mixing safe and unsafe APIs in the same create.

  • Next, it would be nice to follow the Rust API Guidelines as closely as possible. A couple of things that stick out immediately:

    1. Most of your types and functions have no documentation.
    2. unsafe blocks are generally very long and coarse-grained, often spanning almost entire functions.
    3. There are no derived trait implementations on most of your types. Default, Debug, Clone, Copy (if applicable, especially on simple enums and newtype wrappers around primitive types), PartialEq, Eq (if applicable) and Hash are the most important ones you'll want to provide for interoperability. It is very hard to use types that lack these implementations.
    4. Relatedly, NormalizeOptions::get_default_options() should just be the Default::default() impl. NormalizeOptions::initialize() could be the derived Default impl. (Having a constructor named initialize() is confusing anyway – why do you need it? Why isn't NormalizeOptions::new() enough?)
  • In general, taking Vec<_> in function arguments is considered an anti-pattern (see e.g. expand_address_with_options()). They can only be used if the caller actually has a Vec. It's preferable to accept a slice (&[T]) or better yet, an Iterator<Item=T> instead.

  • Your NormalizeOptions::as_libpostal_options() function takes &mut self, which is inappropriate for a function named as_something() (without it being called as_mut_something()). Do you really need this? If you are mutating the structure, then the method shouldn't be called as_something(), since the idiomatic naming convention dictates that as_something() methods perform a read-only conversion between types, and do not mutate the receiver.

  • Exposing separate (and unsafe!) setup and the teardown functions is very error-prone. Consider wrapping setup and teardown functionality into an RAII "context" type, which calls setup() in its constructor and teradown() in its Drop implementation.

  • Your documentation for address::parse_address() is unidiomatic: you explicitly call .into_iter() on the result before iterating, which is unnecessary.

  • The use libc; statement in ffi.rs is unnecessary.

  • Globally suppressing compiler warnings against unused items (#![allow(unused)]) is a code smell, too. Why do you need it? You should in general fix warnings rather than allowing them, especially if you are allowing them globally.

  • The AddressComponents bitset shouldn't be implemented manually. Bitset boilerplate can be easily implemented using the bitflags crate.

  • The code is unnecessarily verbose in some places. For example, the following code snippet:

    let mut n: libc::size_t = 0;
    let n = &mut n as *mut libc::size_t;
    let raw = ffi::libpostal_expand_address(address.as_ptr() as *const libc::c_char, options, n);
    

    could be rewritten as just

    let mut n: usize = 0;
    let raw = ffi::libpostal_expand_address(address.as_ptr() as *const libc::c_char, options, &mut n);
    

    because usize is guaranteed to be the same as libc::size_t, and an implicit conversion ("coercion") from &mut T to *mut T exists (as well as from &T to *const T, of course).

3 Likes

This is an amazing review! Thank you very much for the time you spent, the depth
you went into, and the constructive quality of your comments.

Indeed, many aspects of my first release seemed a bit off, and getting feedback
on the customary and idiomatic ways to do things in Rust, is extremely helpful.
A lot of things you mention, simply did not occur to me due to lack of
experience with the language.

Most of the recommendations are straight-forward and outline a roadmap for the
project.

To recap:

  1. Use dedicated libpostal-sys crate for the ffi.

    Q: Do you think this relates only to the ffi.rs module of the crate?
    Or do you see declarations in other modules that should not be there?

  2. Use RAII-enabled wrappers on the higher-level crate.

  3. Converge to the Rust API Guidelines.

    Note: You raise very good points in your specific remarks (derive
    traits, implement Defaultetc.), so no argue there!

  4. Use bitflags for AddressComponents. Note: I value also the generic
    implications of the comment, to reuse accepted community solutions and void
    boilerplate.

  5. Revisit reasons for suppressing compiler warning, and prefer fixing any
    underlying issues.

Apart from these general directions, I would like to discuss more two specific
comments:

  1. NormalizeOptions::as_libpostal_options() :

    Your NormalizeOptions::as_libpostal_options() function takes &mut self ,
    which is inappropriate for a function named as_something() (without it being
    called as_mut_something() ). Do you really need this? If you are mutating
    the structure, then the method shouldn't be called as_something() , since
    the idiomatic naming convention dictates that as_something() methods perform
    a read-only conversion between types, and do not mutate the receiver.

    This has troubled me a lot during the implementation. Wanted to create a
    higher-level interface to construct ffi::libpostal_normalize_options. This
    requires the population of a *const *const libc::c_char field, and the
    only solutions I could think of, was to mutate
    NormalizeOptions.language_c_strs and NormalizeOptions.language_ptrs.
    The latter holds the data in the target ffi structure for the whole
    lifetime of NormalizeOptions. So this was the reaseon for &mut self.

    After reviewing API guidelines C-Conv I reckon the to_ convention is more appropriate, no?

  2. About verbose code snippets:

    an implicit conversion ("coercion") from &mut T to *mut T exists (as well as from &T
    to *const T

    Is this because of the CoercedUnsized
    implementation?

Thanks again!

All of the other modules contain pub unsafe functions, which should not be there. The prime example of this is the setup-teardown pair which is directly in lib.rs. But as_libpostal_options() is also unsafe.

Not really. to_xyz() and into_xyz() methods still shouldn't mutate. In general, conversion methods should not mutate the receiver. The difference between them is only that:

  • as_() methods convert from either references or owned values to references or cheaply-constructed values (typically, primitive types or small Copy types), e.g.: as_ref(), as_str(), as_bytes().
  • as_mut_() methods convert from either mutable references or owned values to a mutable reference. E.g.: as_mut(), as_mut_slice().
  • to_() methods convert from borrowed to owned / expensive-to-construct types, e.g.: to_owned(), to_string().
  • into_() methods convert from owned to owned types, e.g.: into(), into_inner().

Why? Couldn't you create a new *const *const char value? Currently, you are indeed returning the pointers inside self – and while I'm not familiar with the internals of libpostal, it seems really dubious that you are returning something with pointers inside self but that return value has raw pointers and thus it doesn't have a lifetime that would tie it to self. All in all, this smells like Undefined Behavior waiting to happen.

No, it isn't. Nothing's unsized here. &T always coerces to *const T and &mut T always coerces to *mut T, full stop. Unsized coercions are an entirely different kind of beast. They are related to the fact that you can e.g. convert a &[T; N] reference-to-array into a &[T] reference-to-slice, and that you can do the same using Box<[T]>, and trait objects, etc. None of those compilcations is happening here. Note how that CoerceUnsized impl you linked to has T on one side but U, a different type, on the other side. In your case, there aren't two different pointed types. It's &mut T to *mut T, not &mut T to *mut U.

It's a lot simpler: a raw pointer has strictly less capabilities and information content than a reference — basically, references are pointers with extra lifetime and borrowing information. Therefore, you can always convert a reference to a raw pointer that points to the same type.

You got a point(er) there! In fact, if currently self is moved after the ffi structure is created and dropped for some reason, we will be left with dangling reference. I have to revisit this.

This makes sense. Have to understand coercion better.

Are there any other references, sample codes you could recommend on the topics we discussed?